Win a copy of Clojure in Action this week in the Clojure forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

finding byte[] in subset of another byte[]

 
Jon Dornback
Ranch Hand
Posts: 137
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hey Ranchers,
when using the read methods of inputstreams, the data is returned as bytes (either single or in an array). i want to test if the input (or part of the input) is equal to a given string. i can convert the string to an array of bytes, but then how can i compare it to the subset of the input array? i know i could do it by converting the input bytes to a string and comparing to the given string, but this produces a lot of overhead in creating new objects, string compares, etc. so i want to compare the raw bytes. any suggestions?
thanks,
Jon
 
Neil Laurance
Ranch Hand
Posts: 183
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I'm sure there is a more efficient elegant solution, but something like the following should be a good starting point...

Cheers, Neil
 
Jon Dornback
Ranch Hand
Posts: 137
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
thanks - that's the basic algorithm i was thinking of. wasn't sure if there was something already in the API's that i was missing. i am assuming that is the premise of how the String.indexOf(String) method works, but hopefully faster since it is using a byte array instead of the String object. when i get the chance i will look in to the indexOf source and maybe do some time trials of the two methods. thanks for the help!
Jon
 
Jim Yingst
Wanderer
Sheriff
Posts: 18671
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Remember that when you use read(byte[]), you won't necessarily read all the bytes you want to. The buffer may be full, or there may be other reasons for a delay, such as disk access. So you need to consider what happens if the pattern you're searching for happens to straddle the boundary between two consecutive reads. The simplest solution may be to force your reads to be complete before attempting to find a match. E.g. use a BufferedReader and the readLine() method (if you know the pattern won't have any newlines). Or write everything you read into a ByteArrayOutputStream to collect successive reads into one big array. These techniques involve more overhead in terms of memory and object creation - but they will be simpler, so don't overlook them.
Alternately you can modify Neil's algorithm to remember state between successive calls, so that if one match attempt finds the first part of a pattern at the end of the input array, the next attempt can check for the remainder of the pattern at the beginning of the input array. One of the hardest parts here may be defining an API for the find() method - what do you return to indicate, first, that there was a partial match at the end of an input array, and second, that the next input array matched the remainder of the pattern? (Or not?) Probably you'll need to make some sort of Matcher object which has enough methods to convey the required information without ambiguity. Check out the java.util.regex Pattern and Matcher classes for ideas - but remember, they still assume that the input to match against is all available at once.
 
Dave Landers
Ranch Hand
Posts: 401
  • 0
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
...i want to test if the input (or part of the input) is equal to a given string. i can convert the string to an array of bytes...

Note that you can get yourself in trouble doing String/byte[] conversions. Strings in Java are not arrays of bytes, but arrays of char (Unicode). To go back and forth between bytes and strings requires an encoding. Unless you know what encoding the bytes are in, and specify it, you will get different answers on differently configured machines. Take your code from one machine to another and it may act differently.
If your string (or the file) consists of only US-ASCII 7-bit characters, then you are probably OK. As soon as it goes outside that space, you may have trouble.
Some characters in some encodings are not even reversible (don't go back where they started) - so that the following is not guaranteed to be true:

If you really are trying to compare bytes and Strings, just make sure you specify the character encoding, for example using String.getBytes(encoding) rather than String.getBytes().
If the file you are reading is actually text/character data, you probably should be using a Reader (either specifying an encoding or assuming platform-local encoding).
... just in case you wanted to know...
 
I agree. Here's the link: http://aspose.com/file-tools
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic