GeeCON Prague 2014*
The moose likes Java in General and the fly likes Ranchers,Clarification on UTF-8 in Java Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


JavaRanch » Java Forums » Java » Java in General
Bookmark "Ranchers,Clarification on UTF-8 in Java" Watch "Ranchers,Clarification on UTF-8 in Java" New topic
Author

Ranchers,Clarification on UTF-8 in Java

Deepak Lal
Ranch Hand

Joined: Jul 01, 2008
Posts: 507

Hi Ranchers,
I'm reading a file(C:/Documents and Settings/Administrator/Desktop/myFiles/Example.txt) which is saved as UTF-8 encoding.(attached in this post).

/*
* C:/Documents and Settings/Administrator/Desktop/myFiles/Example.txt is the file which is of encoding Type UTF-8(when opened in ultra edit shows 3 junk characters(in hex mode)(prefixed to (1001528770) in
* beginning of file)
* and when passed as shown below,we see that it is replaced by ?,The buffered reader object has a ? prefixed to it.(see displayed output)
* can you help me on this.I need the question mark(?) to be disappeared in bufferedReader Object.
* please please calling all experts in Java to help me.
*/

see below example program executed on JDK 1.6 under eclipse.and output also displayed as shown below.

Requirement: i want to remove the ? prefixed to the line(Line Read is ?1001528770;PHP;ABQ;PHP;22;SHA;123456789;Tan Ya Ling;26:11:2008;07)
How do we accomplish this ??? please need inputs ??




Sample Program see below.(Attached File is Example.txt which is used for parsing)



When The Going Gets Tougher,The Tougher gets Going
Deepak Lal
Ranch Hand

Joined: Jul 01, 2008
Posts: 507

Please help me on this.
Inputs/Suggestions plzzzz



Deepak
K. Tsang
Bartender

Joined: Sep 13, 2007
Posts: 2452
    
    8

once the data is read into a byte array, try using new String (byte[], "UTF-8"). I don't know if this will work or not. You may want to look through the Java Internationalization tutorial here.


K. Tsang JavaRanch SCJP5 SCJD/OCM-JD OCPJP7 OCPWCD5 OCPBCD5
Deepak Lal
Ranch Hand

Joined: Jul 01, 2008
Posts: 507



TSang wrote:
once the data is read into a byte array, try using new String (byte[], "UTF-8").




My Comments : i want the ?(question mark) to be removed in the bufferedReader object and again i need to return the bufferedReader object ONLY. I have googled enough and did not find any help..

return type should be bufferedReader object without the ?(question mark)



Please suggest on how to proceed....
Henry Wong
author
Sheriff

Joined: Sep 28, 2004
Posts: 18876
    
  40

Please suggest on how to proceed....


Good, bad, or indifferent, you kinda decided on how to proceed...

i want the ?(question mark) to be removed in the bufferedReader object and again i need to return the bufferedReder object. I have googled enough and did not find any help..

return type should be bufferedReader object and without the ?


Extend the BufferedReader class, and override the readXXX() methods to remove any questions marks. Then instantiate it, and return it. An instance of your new class IS-A BufferReader instance, and it will remove the question marks.

Do I think that this is a good idea? Not really. But this is what you want -- and isn't hard to implement.

Henry


Books: Java Threads, 3rd Edition, Jini in a Nutshell, and Java Gems (contributor)
Deepak Lal
Ranch Hand

Joined: Jul 01, 2008
Posts: 507

Henry ,
could you please send me the sample code...this is really difficult for me...
Please help me for this...i have been struggling to get this thing ......

Please henry sample code needed.

Please again requesting you for sample code...




Deepak
Deepak Lal
Ranch Hand

Joined: Jul 01, 2008
Posts: 507

Henry ,
could you please send me the sample code...this is really difficult for me...
Please help me for this...i have been struggling to get this thing ......

Please henry sample code needed.

Please again requesting you for sample code...




Deepak
Henry Wong
author
Sheriff

Joined: Sep 28, 2004
Posts: 18876
    
  40


Sorry, but the JavaRanch is not a code mill...

http://faq.javaranch.com/java/NotACodeMill

I described it enough for you to do it -- or at the least, get started. I would recommend that you attempt it. You can always ask for more clarification once you run into an issue.

Henry
Deepak Lal
Ranch Hand

Joined: Jul 01, 2008
Posts: 507

Ok henry, i will try and will get back to you in the same thread,please reply back in case i have any clarifications.
Narendira Sarma
Greenhorn

Joined: Nov 14, 2008
Posts: 18
Deepak,

No one is going to be so urgent to answer your question in the same way you are. Because its your problem.

Removing a "?" is very simple with using String.replace() function. But that is not going to help you, as more and more question marks can appear anywhere in the lines you read from the file.

Try to analyze your file. If Java (or IDE) is displaying a ? then it clearly suggests that the character is not supported on the used encoding. Try to print "byte by byte" and see if the byte relating to "?" is out of range off the printable characters.

Check if your IDE (if you are using some IDE) is capable of showing those characters.

I am just giving you suggestions from what you have presented.

These might help you.
Deepak Lal
Ranch Hand

Joined: Jul 01, 2008
Posts: 507

Thanks for information.let me try at my end again.
Deepak Lal
Ranch Hand

Joined: Jul 01, 2008
Posts: 507

Hi Sarma and Henry,
I have tried the below,could you help me further please.....




The first 3 are illegal i need to remove them and assign it back to bufferedReader object ... please advice....

Henry Wong
author
Sheriff

Joined: Sep 28, 2004
Posts: 18876
    
  40

The first 3 are illegal i need to remove them and assign it back to bufferedReader object ... please advice....


I don't know what more to tell you... You have demostrated that there are illegal characters in the stream (twice) and the correct answer here is to remove it from the string.... But you don't want that -- you want the BufferedReader to not return it.

This means you have two choices...

1. Do the suggestion that I mentioned before (see previous post).

2. Or read it all in -- to a char array. Get rid of the illegal chars. And then used the BufferedReader on a CharArrayReader.

Either will work.... You just have to do it. And we can't give you any hints in the right direction, if you don't actually start.

I have tried the below,could you help me further please.....


It is nice that you tried something, but it would have been better if you tried what was mentioned. So, I guess, my recommendation would be... to reread the previous posts, as you haven't use that help yet.

Henry
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18570
    
    8

You probably just have a byte-order mark (BOM) at the beginning of your file. Some Microsoft programs put that there even though it isn't required or even necessary for documents formatted in UTF-8. So why not just skip it?
Henry Wong
author
Sheriff

Joined: Sep 28, 2004
Posts: 18876
    
  40


Just thought of something -- if the illegal characters are always in the beginning, then one other option would be to read off the characters before returning the BufferedReader...

Henry
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18570
    
    8

Henry Wong wrote:
Just thought of something -- if the illegal characters are always in the beginning, then one other option would be to read off the characters before returning the BufferedReader...

Henry

Which they are, according to the very detailed post. Which also shows that they are indeed a UTF-8 byte order mark; see the Wikipedia article http://en.wikipedia.org/wiki/Byte_order_mark to confirm that.
Deepak Lal
Ranch Hand

Joined: Jul 01, 2008
Posts: 507

Yes you all are absolutely correct it is a UTF 8 file wityh BOM,we need to parse it and if there are any illegal characters in the beginning of the file,Then skip it,how can i accomplish it...Please help me


henry wrote:
Just thought of something -- if the illegal characters are always in the beginning, then one other option would be to read off the characters before returning the BufferedReader...





but do i accomplish it....please need help on this henry.... you are saying read off the characters but i dont know how to accomplish it....can you tell me which api supports reading off the first 3 bytes and then return it back to bufferedReader....



Regards
Deepak
Henry Wong
author
Sheriff

Joined: Sep 28, 2004
Posts: 18876
    
  40

Deepak Lal wrote:
but do i accomplish it....please need help on this henry.... you are saying read off the characters but i dont know how to accomplish it....can you tell me which api supports reading off the first 3 bytes and then return it back to bufferedReader....


Take a look at the BufferedReader methods -- particularly the read(), mark(), and reset() methods.

Basically, mark() the beginning of the file, then read() the first three characters (not a line, just the first three characters). If the first three characters are junk characters, then you are done. Simply return the BufferedReader back from the method. If the first three characters are not junk characters, then reset() the buffered reader, and then return it back from the method.

Anyone who uses the returned BufferedReader will not get the junk characters, when they try to read from it (as you already taken it off from the stream).


And again, this only works if the junk characters are at the beginning. For anything else, you have to use either of the other two suggestions already mentioned.

Henry
Deepak Lal
Ranch Hand

Joined: Jul 01, 2008
Posts: 507

Hi Henry
Tried ,but its not working

see below:



Please suggest.




Deepak
Henry Wong
author
Sheriff

Joined: Sep 28, 2004
Posts: 18876
    
  40

Please advice on this now. I did as [you] suggested...what is wrong now....


Please read my post again. You are not doing anything close to what I suggested.

Henry
Deepak Lal
Ranch Hand

Joined: Jul 01, 2008
Posts: 507





Henry Wrote :
Take a look at the BufferedReader methods -- particularly the read(), mark(), and reset() methods.

Basically, mark() the beginning of the file, then read() the first three characters (not a line, just the first three characters). If the first three characters are junk characters, then you are done. Simply return the BufferedReader back from the method. If the first three characters are not junk characters, then reset() the buffered reader, and then return it back from the method.

Anyone who uses the returned BufferedReader will not get the junk characters, when they try to read from it (as you already taken it off from the stream).






Henry Wong
author
Sheriff

Joined: Sep 28, 2004
Posts: 18876
    
  40

Even the code in my previous post is doing the same steps.( as you have suggested)(junk character is @ beginning of file)


Huh?!?!?


mark() the beginning of the file,


You didn't do this. At least, not correctly...

then read() the first three characters


You didn't do this.

(not a line, just the first three characters).


You definitely didn't do this. In fact, you did exactly what I said not to do.

If the first three characters are junk characters, then you are done. Simply return the BufferedReader back from the method. If the first three characters are not junk characters, then reset() the buffered reader, and then return it back from the method.


Since, you didn't mark correctly, and you didn't read 3 characters -- you can't do any of these.



Anyway, I don't know what more to say. This may be a language issue. Maybe someone else can explain it better than me...

Henry
Deepak Lal
Ranch Hand

Joined: Jul 01, 2008
Posts: 507

when i read the line it is as below

previous line is ?1001528770;PHP;ABQ;PHP;22;SHA;123456789;Tan Ya Ling;26:11:2008;07

//hence there is one junk character now(we dont have 3 junk characters now.)

so i did the below

bufferedReader.mark(0) ; //Marks the present position in the stream hence i have marked as Zero. now what is wrong with this statement.

The BufferedReader API


mark

public void mark(int readAheadLimit)
throws IOException
Marks the present position in the stream. Subsequent calls to reset() will attempt to reposition the stream to this point.

Overrides:
mark in class Reader
Parameters:
readAheadLimit - Limit on the number of characters that may be read while still preserving the mark. An attempt to reset the stream after reading characters up to this limit or beyond may fail. A limit value larger than the size of the input buffer will cause a new buffer to be allocated whose size is no smaller than limit. Therefore large values should be used with care.
Throws:
IllegalArgumentException - If readAheadLimit is < 0
IOException - If an I/O error occurs



Next Step: Since there is only one junk character,im reading a single character

int k = bufferedReader.read(); // Reads a single character. hence im reading a single character see api below



BufferedReader Api says:

read
public int read()
throws IOException
Reads a single character.

Overrides:
read in class Reader
Returns:
The character read, as an integer in the range 0 to 65535 (0x00-0xffff), or -1 if the end of the stream has been reached
Throws:
IOException - If an I/O error occurs


System.out.println("k is "+k); //here k is displaying the value as 49 which is charValue(1) which is not a junk character but we see line is having a junk character.

Please tell me where i have gone wrong now....

K. Tsang
Bartender

Joined: Sep 13, 2007
Posts: 2452
    
    8

Hello Deepak, having looking through your posts in detail. Have you consider your way of approach using RandomAccessFile readUTF() method. I think this may make life easier. Rather than BufferReader stuff.
Henry Wong
author
Sheriff

Joined: Sep 28, 2004
Posts: 18876
    
  40

bufferedReader.mark(0) ; //Marks the present position in the stream hence i have marked as Zero. now what is wrong with this statement.


The parameter to the mark() method is not the position to mark -- the position to mark is the present position. In fact, passing a zero as the first parameter should effectively disable the marking mechanism.

Henry
Henry Wong
author
Sheriff

Joined: Sep 28, 2004
Posts: 18876
    
  40

Next Step: Since there is only one junk character,im reading a single character

int k = bufferedReader.read(); // Reads a single character. hence im reading a single character see api below


Actually, it is three junk characters -- as shown by your other output. It is just that the string output shows it as one question mark.

Regardless, you didn't read one character or three characters. You read one line AND one character.

Henry
Henry Wong
author
Sheriff

Joined: Sep 28, 2004
Posts: 18876
    
  40

System.out.println("k is "+k); //here k is displaying the value as 49 which is charValue(1) which is not a junk character but we see line is having a junk character.

Please tell me where i have gone wrong now....


It is not a junk character, because you read the junk character off already -- when you read the first line eariler.

Henry
Henry Wong
author
Sheriff

Joined: Sep 28, 2004
Posts: 18876
    
  40

K. Tsang wrote:Hello Deepak, having looking through your posts in detail. Have you consider your way of approach using RandomAccessFile readUTF() method. I think this may make life easier. Rather than BufferReader stuff.


Unfortunately, the OP is very particular in returning a BufferedReader. I am guess it is because this code will be called by some other code that expects a BufferedReader and can't be changed.

As a side note, of all the suggestions so far, I still like the first one the best -- writing a FilteredReader that can wrapped by the Buffered reader (that removes all the illegal chars automatically). It's definitely the most elegant, IMHO.

Henry
Deepak Lal
Ranch Hand

Joined: Jul 01, 2008
Posts: 507

Henry ,
Please what parameter should i pass to my mark method in this scenario...could you correct that and i will proceed further...
Im confused.
Henry Wong
author
Sheriff

Joined: Sep 28, 2004
Posts: 18876
    
  40

Deepak Lal wrote: Please what parameter should i pass to my mark method in this scenario...could you correct that and i will proceed further...
Im confused.


I think you have the misconception that the mark() method will move the position to a certain location in the stream -- it doesn't. It is the reset() method that moves the position. The mark() method just marks the current position.

So, if you want to mark the beginning of the stream, you have to mark it when you are at the beginning of the stream.

I am assuming you want to read a line from the beginning, then go back to the beginning and read a few bytes. To do that, you must mark it, when it is at the beginning, which is before you read the line. And when you want to go back to the beginning, you need to call the reset() method.

As for the parameter to the mark method, that is the number of bytes to buffer -- meaning the number of bytes that you are expecting to read before you call the reset() method. You have to make sure that this number is large enough, or you won't be able to reset. In your example, you need to have enough room to hold the line you are reading and will later reset.

Henry
Campbell Ritchie
Sheriff

Joined: Oct 13, 2005
Posts: 39084
    
  23
Too difficult a quesiton for "beginners". Moving.
Martijn Verburg
author
Bartender

Joined: Jun 24, 2003
Posts: 3274
    
    5

Also for those interested, reading this is highly recommended.


Cheers, Martijn - Blog,
Twitter, PCGen, Ikasan, My The Well-Grounded Java Developer book!,
My start-up.
Deepak Lal
Ranch Hand

Joined: Jul 01, 2008
Posts: 507

Hi Henry,
Could you please paste the code which could help me to achieve the impossible.i have been struggling to get ghis done....please
Henry Wong
author
Sheriff

Joined: Sep 28, 2004
Posts: 18876
    
  40

Deepak Lal wrote:Hi Henry,
Could you please paste the code which could help me to achieve the impossible.i have been struggling to get ghis done....please


Nope. Still not a code mill.

No offense intended. But I gave you three possible solutions. You completely ignored the first two. And with this last one, which I completely described you the solution -- enough that it can be considered pseudo code, you did something only remotely related. Complained that you did it correctly. And when I explain that you didn't, and gave you clarification, you are now ignoring the clarification and blatently asking for the solution.

Sorry, but since you have little interest in doing this yourself, there is no longer a need for me here.

Henry
Henry Wong
author
Sheriff

Joined: Sep 28, 2004
Posts: 18876
    
  40

As for this...

As a side note, of all the suggestions so far, I still like the first one the best -- writing a FilteredReader that can wrapped by the Buffered reader (that removes all the illegal chars automatically). It's definitely the most elegant, IMHO.


It was pretty straightforward. It took me 20 minutes to write the FilterReader.

Henry
Deepak Lal
Ranch Hand

Joined: Jul 01, 2008
Posts: 507


You have to make sure that this number is large enough, or you won't be able to reset. In your example, you need to have enough room to hold the line you are reading and will later reset.


this number is large enough. can you atleast tell me large enough means what could be the value.

example if there are 300 lines,then approximately
1001528770;PHP;ABQ;PHP;22;SHA;123456789;Tan Ya Ling;26:11:2008;07
1001528770;PHP;ABQ;PHP;22;SHA;123456789;Tan Ya Ling;26:11:2008;07
1001528770;PHP;ABQ;PHP;22;SHA;123456789;Tan Ya Ling;26:11:2008;07
1001528770;PHP;ABQ;PHP;22;SHA;123456789;Tan Ya Ling;26:11:2008;07

upto 300 lines.

please suggest... im
Deepak Lal
Ranch Hand

Joined: Jul 01, 2008
Posts: 507

Hi Henry,
Need your help again...Please ....


Could you please tell me where i have gone wrong now...im reading the first 3 characters...isnt that correct,,,could you please refine the above code




Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18570
    
    8

(I haven't been following this thread, so my apologies if this has already been covered.)

So okay, you read those three characters from the Reader. You seem to think you have "gone wrong" in some way. But I don't see a problem. What do you see as the problem with that code?
Deepak Lal
Ranch Hand

Joined: Jul 01, 2008
Posts: 507

i should get 3 junk characters,but now im getting as below
{
The value of k 65279 (what is this referring to)
The value of k 49 which is correct for charValue(1)
The value of k 48 which is correct for charValue(0)
}
where are the three junk characters as referred by Henry. ??? or is this the correct output code ???

Please help me im still confused.

Deepak
Henry Wong
author
Sheriff

Joined: Sep 28, 2004
Posts: 18876
    
  40

this number is large enough. can you atleast tell me large enough means what could be the value.

example if there are 300 lines,then approximately
1001528770;PHP;ABQ;PHP;22;SHA;123456789;Tan Ya Ling;26:11:2008;07
1001528770;PHP;ABQ;PHP;22;SHA;123456789;Tan Ya Ling;26:11:2008;07
1001528770;PHP;ABQ;PHP;22;SHA;123456789;Tan Ya Ling;26:11:2008;07
1001528770;PHP;ABQ;PHP;22;SHA;123456789;Tan Ya Ling;26:11:2008;07

upto 300 lines.


Since you only intend to reset, at most 3 characters, then shouldn't a value of 3 be okay?

Henry
 
GeeCON Prague 2014
 
subject: Ranchers,Clarification on UTF-8 in Java