my dog learned polymorphism*
The moose likes Java in General and the fly likes Quick String to <char> Array Conversion Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » Java in General
Bookmark "Quick String to <char> Array Conversion" Watch "Quick String to <char> Array Conversion" New topic
Author

Quick String to <char> Array Conversion

Kevin Simonson
Ranch Hand

Joined: Oct 22, 2011
Posts: 103
Is there some quick way to transform a <String> argument to an array of <byte>s and vice versa?

Right now my code is:

which works just fine; if I type in:

java Stb QuickBrownFoxJumpsOverTheLazyDog QuickBrownFoxJumpsOverTheLazyDog

it apparently makes the conversion without any problem and announces that the two strings are equal.

But I've noticed that I'm processing the <String> object character by character, and I wonder if that might be slowing me down, if perhaps I'm going to process a large number of <String> objects, or perhaps some very long <String> objects.

I know that when doing file I/O with large files I can create a large <char> array as a buffer and call the <read()> and <write()> methods of <InputStream> and <OutputStream> objects respectively that use <char> arrays as arguments, and therefore move a lot of data from the file to the buffer in one method call, or a lot of data from the buffer to another file in one method call. I was just wondering if there was a quick way to move the contents of a <String> object to a <char> array and back again that involved just one call to a method. Anybody know of any?

Kevin Simonson
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18541
    
    8

Sure, both of those things exist. And they are right there in the documentation for java.lang.String. You can't miss them.
Kevin Simonson
Ranch Hand

Joined: Oct 22, 2011
Posts: 103
Paul Clapham wrote:Sure, both of those things exist. And they are right there in the documentation for java.lang.String. You can't miss them.

As I stated in my original post, I was primarily interested in a quick way to transform a <String> argument into an array of <byte>s. I followed the link you gave me, Paul, and tried timing it on transforming <String> objects into <char> arrays, and then back again, both with the <getBytes( int, int, byte[], int)> method one way and the <String( byte[], int, int}> constructor the other way, and with the <getBytes()> method (that returns an object of type <byte[]>) the first way, and with the <String( byte[])> constructor the other way. I also timed the method that I had been using, where I went through the <String> object character by character, and converted each character into a pair of bytes.

The code I used was:

To get a really large file I took the first 58,248 digits of pi, split them into 7281 lines, giving me 576 characters to each line, and then concatenated that file onto itself to make it fifty times bigger, so the file I ended up with has 364,050 lines, each 576 characters wide. Then I just used <Scanner>'s method <nextLine()> on each of those 364,050 lines to get fairly long <String>s to work with.

When I timed it, the first algorithm took 1.775 seconds, and the second took 2.638 seconds. My own personal algorithm took 2.134 seconds. But note that the <getBytes( int, int, byte[], int)> method used by the first algorithm is deprecated. Also, I wrote a second version of my code:

and when I ran the two versions I got results:

C:\Users\kvnsmnsn\D\Java\Misc>javac StrngBte.java
Note: StrngBte.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.

C:\Users\kvnsmnsn\D\Java\Misc>java StrngBte PiHun.Txt 364050 1152
Using built in constructor and deprecated method, took 1775 ms.
Using my homemade code, took 2134 ms.
Using the non-deprecated one, returning a new <byte> array, took 2638 ms.

C:\Users\kvnsmnsn\D\Java\Misc>javac StrngBte2.java
Note: StrngBte2.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.

C:\Users\kvnsmnsn\D\Java\Misc>java StrngBte2 PiHun.Txt 364050 1152
Using built in constructor and deprecated method, took 1715 ms.
Using my homemade code, took 2161 ms.
Using the non-deprecated one, returning a new <byte> array, took 2664 ms.
(Ratio was 100%.)

That last parenthetical remark indicates that the length of the <String> object was invariably 100% of the length of the <byte[]> object, which indicates that the second algorithm potentially loses information, since a <String> of 576 characters has 9216 bits (16 bits per character), while a <byte> array of 576 elements has only 4608 bits (8 per element).

I did some of my own testing and concluded that the first, deprecated, algorithm has the same problem; it converts a <String> of 576 characters into a <byte> array of 576 elements. Even the first algorithm doesn't quite run twice as fast as my homemade algorithm, so if I modified my code so that it only stores the least significant eight bits of each character in each <String> object, then my algorithm would probably be faster than either of the other two algorithms.

So, taking the deprecated code and the fact that both of the other algorithms potentially lose data, into consideration, it looks like my way of converting <String>s into <byte> arrays is probably the best of the three ways.

What do you guys think?

Kevin Simonson
Winston Gutkowski
Bartender

Joined: Mar 17, 2011
Posts: 7549
    
  18

Kevin Simonson wrote:So, taking the deprecated code and the fact that both of the other algorithms potentially lose data, into consideration, it looks like my way of converting <String>s into <byte> arrays is probably the best of the three ways.

What do you guys think?

Lots. I hope you don't mind.

1. Your method only does a default String-->byte[] conversion; String's methods do all sorts.
2. It seems like re-inventing the wheel for an extraordinarily small difference in time (I did read it right - 1.1 seconds overall for 364,000 lines).
3. I hope you've tested your code exhaustively so that you know what happens in all corner cases (I'm pretty sure Sun did). Just one I can think of is how it deals with surrogate pairs, should you ever be unfortunate enough to encounter them.
4. What could be easier that s.getBytes() or String s = new String(bytes)?
5. I refer you to my quote.

You did ask :wink: .

Winston


Isn't it funny how there's always time and money enough to do it WRONG?
Articles by Winston can be found here
Kevin Simonson
Ranch Hand

Joined: Oct 22, 2011
Posts: 103
Winston Gutkowski wrote:Lots. I hope you don't mind.

I don't mind at all. I enjoy discussions like this!

Winston Gutkowski wrote:1. Your method only does a default String-->byte[] conversion; String's methods do all sorts.
2. It seems like re-inventing the wheel for an extraordinarily small difference in time (I did read it right - 1.1 seconds overall for 364,000 lines).
3. I hope you've tested your code exhaustively so that you know what happens in all corner cases (I'm pretty sure Sun did).

That's part of the problem; I tested both my code and the code that uses Sun's constructors. My test code included:

When I compile and run it I get:

C:\Users\kvnsmnsn\K\Java\Misc>javac TsCheck.java
Note: TsCheck.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.

C:\Users\kvnsmnsn\K\Java\Misc>java TsCheck
problem.equals( result_l9) == false.
problem.equals( result_vl) == false.

buffer:
0: 97
1: 97
2: 97
3: 97
4: -1
5: -1
6: -1
7: -1
8: -1
vlBuffer:
0: 97
1: 63
2: 63
3: 63

C:\Users\kvnsmnsn\K\Java\Misc>

The deprecated method the compiler is complaining about is <getBytes( int, int, byte[], int)>. When I convert from a <String> to a <byte> array using that deprecated method, and then convert back to a <String> again using the <String( byte[], int, int)> constructor, I get a <String> object whose <equals()> indicates it doesn't equal the original <String>. And when I convert from a <String> to a <byte[]> using method <getBytes()>, and then convert back to a <String> again using the <String( byte[])> constructor, once again I get a <String> object whose <equals()> method indicates it doesn't equal the original <String>.

Which isn't too surprising, since the rest of what is printed out shows that although the <new byte[ 9]> statement created an array big enough to store all 64 bits of string <problem>, only four elements (that's 32 bits) of array <buffer> got used. Elements 4 through 8 of array <buffer> are each still set to the -1 values I originally set them to. So data is potentially lost with a <getBytes( int, int, byte[], int)> call any time, and it certainly was lost when dealing with string <problem>.

Winston, you mentioned that I only did a couple of default <String> to <byte> conversions, and mentioned that "String's methods do all sorts." Can you tell me whether there are any methods that would convert <problem> above to a <byte> array and back to a <String>, resulting in a <String> object that I could use on the original value and get a <true> result from <equals()>? If there are some, could you tell me what some of them are?

Winston Gutkowski wrote:4. What could be easier that s.getBytes() or String s = new String(bytes)?

Ease is nice, but correctness is pretty important too, and is probably more important than ease. I'm looking for a way to convert <String>s to <byte> arrays in such a way that I can convert them back to <String>s and get the same values.

Winston Gutkowski wrote:5. I refer you to my quote.

I couldn't find your quote. What quote were you referring to?

Kevin Simonson
Jeff Verdegan
Bartender

Joined: Jan 03, 2004
Posts: 6109
    
    6

Kevin Simonson wrote:
Ease is nice, but correctness is pretty important too, and is probably more important than ease. I'm looking for a way to convert <String>s to <byte> arrays in such a way that I can convert them back to <String>s and get the same values.


If I'm understanding your questions and issues correctly, String's methods do what you want, and the ability to do that conversion correctly for different encodings is part of where their "slowness" comes from.

I share Winston's opinion that the difference over the size of the input seems pretty small. Especially if you're reading this stuff from a file or a socket, the conversion times ought to be small compared to the I/O cost. Is that difference actually meaningful in terms of hard requirements that you have? Or are you just trying to make it, "as fast as possible"?

If the former, then you may want to investigate avoiding Strings altogether and just starting with bytes in the first place. Not having direct access to String's internal data, i.e., the backing char[], you can't really get much faster than what it decides to give you.

If the latter, then I would go with what String offers, since it's easy, highly likely to be correct, and, in the absence of hard numerical requirements, can be deemed "fast enough."
Winston Gutkowski
Bartender

Joined: Mar 17, 2011
Posts: 7549
    
  18

Kevin Simonson wrote:I couldn't find your quote. What quote were you referring to?

The one at the bottom of my posts. This one included.

Winston
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18541
    
    8

Kevin Simonson wrote:
Paul Clapham wrote:Sure, both of those things exist. And they are right there in the documentation for java.lang.String. You can't miss them.

As I stated in my original post, I was primarily interested in a quick way to transform a <String> argument into an array of <byte>s. I followed the link you gave me, Paul, and tried timing it on transforming <String> objects into <char> arrays, and then back again, both with the <getBytes( int, int, byte[], int)> method one way and the <String( byte[], int, int}> constructor the other way, and with the <getBytes()> method (that returns an object of type <byte[]>) the first way, and with the <String( byte[])> constructor the other way....
What do you guys think?



Part of the problem is that you apparently don't understand the difference between a char and a byte. Contrary to what you say, you didn't transform a String to a char array at any point in any code you posted. That would involve using the toCharArray() method on a string. Your code only converts to and from byte arrays.

A char in Java represents a Unicode character; that's a code point in the Unicode character set which is supposed to represent the smallest possible unit of text. Whereas a byte is just 8 bits and can represent anything which fits. It's possible to convert a string of text to an array of bytes; this is referred to as "encoding". There are about 100 different encodings which can be used to do that. Many of them convert one char to one byte, at the expense of ignoring the majority of Unicode. These are commonly (and loosely) referred to as "ASCII" encodings. Others convert one char to one or more bytes depending on the value of the char, UTF-8 is an example of that. Each Java environment has a default encoding, which you used by not specifying any other encoding. You're using Windows so it's one of those ascii encodings, but if you were on some other machine the default might be UTF-8.

Your code, on the other hand, ignores all of that. And it is doing binary manipulations on the bytes, which suggests to me that they aren't meant to be text. In which case they shouldn't have been put into a String in the first place, they should have been put into a byte array. Which would have made this thread entirely unnecessary.
Kevin Simonson
Ranch Hand

Joined: Oct 22, 2011
Posts: 103
I don't know if it's proper etiquette to quote myself, but in this case I didn't know what else to do.
Kevin Simonson wrote:That's part of the problem; I tested both my code and the code that uses Sun's constructors. My test code included:

When I compile and run it I get:

C:\Users\kvnsmnsn\K\Java\Misc>javac TsCheck.java
Note: TsCheck.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.

C:\Users\kvnsmnsn\K\Java\Misc>java TsCheck
problem.equals( result_l9) == false.
problem.equals( result_vl) == false.

buffer:
0: 97
1: 97
2: 97
3: 97
4: -1
5: -1
6: -1
7: -1
8: -1
vlBuffer:
0: 97
1: 63
2: 63
3: 63

Did anybody notice those two lines up there right under the "java TsCheck" command? The result of calling <problem.equals( result_l9)> was <false>! The result of calling <problem.equals( result_vl)> was also <false>! In other words, using the built-in <getBytes()> methods and the built-in constructors, I got back <String>s that contained different <char>s than when I started.

So what good is increased speed of execution or ease of calling methods and constructors when the quickly executing and easily called methods and constructors don't actually do the job?

The actual <String>s I'm going to be using this for are file names and directory names. Now ordinarily my tool is going to be used on machines where file names and directory names are made up of ASCII <char>s, so the built-in methods and constructors would work just fine. But I'd like my tool to be general purpose so that it can work on anybody's file names and directory names. Is it impossible to name a file or directory using one or two <char>s whose actual values get up around <(char) 21857> or <(char) 43617> or <(char) 65377> like I have in my program?

Kevin Simonson
Jeff Verdegan
Bartender

Joined: Jan 03, 2004
Posts: 6109
    
    6

Kevin Simonson wrote:
Did anybody notice those two lines up there right under the "java TsCheck" command? The result of calling <problem.equals( result_l9)> was <false>! The result of calling <problem.equals( result_vl)> was also <false>! In other words, using the built-in <getBytes()> methods and the built-in constructors, I got back <String>s that contained different <char>s than when I started.

So what good is increased speed of execution or ease of calling methods and constructors when the quickly executing and easily called methods and constructors don't actually do the job?


Without having read your code or worked through which values you're dealing with and how I'd expect them to be processed, I'm going to go out on a limb here and say you have one or more misconceptions, about Unicode, or the methods/c'tors in question, or both.

If nobody else provides more detail about where those misconceptions lie (or tells me I'm wrong), I'll be happy to look more closely if you post an SSCCE that just demonstrates the "out and back doesn't give me what I started with" problem you're seeing, with nothing superfluous.

EDIT: And I'd suggest that using a deprecated method is probably a recipe for trouble here.
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18541
    
    8

Kevin Simonson wrote:The actual <String>s I'm going to be using this for are file names and directory names. Now ordinarily my tool is going to be used on machines where file names and directory names are made up of ASCII <char>s, so the built-in methods and constructors would work just fine. But I'd like my tool to be general purpose so that it can work on anybody's file names and directory names. Is it impossible to name a file or directory using one or two <char>s whose actual values get up around <(char) 21857> or <(char) 43617> or <(char) 65377> like I have in my program?


Ah, I see. Then what's the purpose of all the code which converts the alleged file names to bytes?

If you're working in an operating system which supports Unicode (which is all non-obsolete operating systems which run Java) then sure, you can use any Unicode characters in your file names. You don't have to simulate that in your code, just create files with whatever names you like.

As for your rant about code which doesn't work, it's just that you didn't understand what happens in the conversion between chars and bytes. That process uses an encoding which maps a sequence of 16-bit chars into a sequence of 8-bit bytes. Obviously it isn't possible to convert 1 char to 1 byte, so data has to change somehow. Different encodings do that in different ways. You didn't specify any encoding, so the system used the default encoding. On Windows machines that encoding is ISO-8859-1, which maps all characters outside of the first 256 to a question mark, and some of them inside that range to a question mark as well. Obviously this is a non-reversible transformation.

But then all of that is academic, since you shouldn't have to be converting to bytes anyway.
 
With a little knowledge, a cast iron skillet is non-stick and lasts a lifetime.
 
subject: Quick String to <char> Array Conversion
 
Similar Threads
Cannot convert char to byte
can default value of char be used as an index of array?
Conversion from byte[] to char[]
Wierd Socket Behavior
doubt on char default value...