aspose file tools*
The moose likes I/O and Streams and the fly likes finding a String in an InputStream Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » I/O and Streams
Bookmark "finding a String in an InputStream" Watch "finding a String in an InputStream" New topic
Author

finding a String in an InputStream

Ulrich Heeger
Ranch Hand

Joined: Jun 06, 2003
Posts: 266
Hi everyone,
I have an InputStream containing both binary and ASCII datas.
I would like to scan through an InputStream to find if it contains a certain String. I was thinking about to convert the String to a byte-Array and to scan through the byte-Array gained of the InputStream in order to replace the matching bytes with a new String. I can't transform the InputStream in a String because in this case the binary datas may be corrupted by converting them back in bytes.
Does somebody know a Utility class /methods or an alternative for these purposes?

Thanks a lot in advance & greetings,
Ulrich
Stan James
(instanceof Sidekick)
Ranch Hand

Joined: Jan 29, 2003
Posts: 8791
Look into FilteredInputStream as a place to put this code.

What do you want to do when you find the string? Is it ok to raise the "found it!" flag after it has already gone by, or do you need to capture it, replace it or remove it or something? (By "raise the flag" I mean take some action ... call a method or whatever.)

This would raise the flag just as the last byte of the search target comes through:

This is going to be really clumsy in pseudo code but it would remove the search target from the stream:

If you have really massive input and a fairly long search target, look up the Boyer Moore search algorithm. It's way too complex to describe here but it's orders of magnitude faster than brute force. The "partial match" test is very tough on some patterns, and Boyer Moore handles it brilliantly.

Any of that sound useful?


A good question is never answered. It is not a bolt to be tightened into place but a seed to be planted and to bear more seed toward the hope of greening the landscape of the idea. John Ciardi
Ulrich Heeger
Ranch Hand

Joined: Jun 06, 2003
Posts: 266
Hi Stan,
thanks a lot for your help.
Your pseudocode sounds really reasonnable to me and I will try the way you suggested. Also thanks for your advice concerning Boyer Moore search algorithm, I will have a look on it.
I wish you a nice week-end,
Ulrich
Stan James
(instanceof Sidekick)
Ranch Hand

Joined: Jan 29, 2003
Posts: 8791
Let us know what you wind up with. This sounds like an interesting problem. In fact, it sounds like one we had recently with invalid characters in an XML stream. We decided an exception was just what the XML producer deserved and did nothing. Ho ho ho.
Ulrich Heeger
Ranch Hand

Joined: Jun 06, 2003
Posts: 266
Hi Stan,
today i have created a new method with following pseudo-code:

That works fine. Thanks a lot. But now I have a new problem , see my new thread. If you have time & patience I would be very thankful if you could help me.

Regards,
ulrich
Stan James
(instanceof Sidekick)
Ranch Hand

Joined: Jan 29, 2003
Posts: 8791
I'd be concerned about the action you're taking based on matching the first letter of the pattern. For instance

Pattern: BABANANA
Text: BABABANANABA

You will find the matching "B" at 1 and read 7 more characters, find the buffer doesn't match and start looking again at 9. You would miss the match starting at 3!

This is precisely what the BM algorithm does so well. It computes the next place you should look very efficiently. If your pattern is longish, it can skip lots of positions making it very fast.

The "brute force" method which is easy enough for me to understand never skips anything. When I started at 1 and didn't match, I'd start next time at 2. Back in one of my examples I said write the buffer (or some complex subset) to hint at this kind of trouble. The simple subset is one character.

I also mentioned circular buffers. Is that a familiar concept? With an 8 character pattern like BABANANA it would let me always look at the next 8 characters.
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: finding a String in an InputStream