File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
The moose likes Java in General and the fly likes finding byte[] in subset of another byte[] Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » Java in General
Bookmark "finding byte[] in subset of another byte[]" Watch "finding byte[] in subset of another byte[]" New topic

finding byte[] in subset of another byte[]

Jon Dornback
Ranch Hand

Joined: Apr 24, 2002
Posts: 137
Hey Ranchers,
when using the read methods of inputstreams, the data is returned as bytes (either single or in an array). i want to test if the input (or part of the input) is equal to a given string. i can convert the string to an array of bytes, but then how can i compare it to the subset of the input array? i know i could do it by converting the input bytes to a string and comparing to the given string, but this produces a lot of overhead in creating new objects, string compares, etc. so i want to compare the raw bytes. any suggestions?

use the [CODE] tags - it makes it much easier for people to help you.
Neil Laurance
Ranch Hand

Joined: Jul 18, 2002
Posts: 183
I'm sure there is a more efficient elegant solution, but something like the following should be a good starting point...

Cheers, Neil
Jon Dornback
Ranch Hand

Joined: Apr 24, 2002
Posts: 137
thanks - that's the basic algorithm i was thinking of. wasn't sure if there was something already in the API's that i was missing. i am assuming that is the premise of how the String.indexOf(String) method works, but hopefully faster since it is using a byte array instead of the String object. when i get the chance i will look in to the indexOf source and maybe do some time trials of the two methods. thanks for the help!
Jim Yingst

Joined: Jan 30, 2000
Posts: 18671
Remember that when you use read(byte[]), you won't necessarily read all the bytes you want to. The buffer may be full, or there may be other reasons for a delay, such as disk access. So you need to consider what happens if the pattern you're searching for happens to straddle the boundary between two consecutive reads. The simplest solution may be to force your reads to be complete before attempting to find a match. E.g. use a BufferedReader and the readLine() method (if you know the pattern won't have any newlines). Or write everything you read into a ByteArrayOutputStream to collect successive reads into one big array. These techniques involve more overhead in terms of memory and object creation - but they will be simpler, so don't overlook them.
Alternately you can modify Neil's algorithm to remember state between successive calls, so that if one match attempt finds the first part of a pattern at the end of the input array, the next attempt can check for the remainder of the pattern at the beginning of the input array. One of the hardest parts here may be defining an API for the find() method - what do you return to indicate, first, that there was a partial match at the end of an input array, and second, that the next input array matched the remainder of the pattern? (Or not?) Probably you'll need to make some sort of Matcher object which has enough methods to convey the required information without ambiguity. Check out the java.util.regex Pattern and Matcher classes for ideas - but remember, they still assume that the input to match against is all available at once.

"I'm not back." - Bill Harding, Twister
Dave Landers
Ranch Hand

Joined: Jul 24, 2002
Posts: 401
...i want to test if the input (or part of the input) is equal to a given string. i can convert the string to an array of bytes...

Note that you can get yourself in trouble doing String/byte[] conversions. Strings in Java are not arrays of bytes, but arrays of char (Unicode). To go back and forth between bytes and strings requires an encoding. Unless you know what encoding the bytes are in, and specify it, you will get different answers on differently configured machines. Take your code from one machine to another and it may act differently.
If your string (or the file) consists of only US-ASCII 7-bit characters, then you are probably OK. As soon as it goes outside that space, you may have trouble.
Some characters in some encodings are not even reversible (don't go back where they started) - so that the following is not guaranteed to be true:

If you really are trying to compare bytes and Strings, just make sure you specify the character encoding, for example using String.getBytes(encoding) rather than String.getBytes().
If the file you are reading is actually text/character data, you probably should be using a Reader (either specifying an encoding or assuming platform-local encoding).
... just in case you wanted to know...
I agree. Here's the link:
subject: finding byte[] in subset of another byte[]
jQuery in Action, 3rd edition