File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
http://aspose.com/file-tools
The moose likes Java in General and the fly likes How to search a pattern in a large file using java Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » Java in General
Bookmark "How to search a pattern in a large file using java" Watch "How to search a pattern in a large file using java" New topic
Author

How to search a pattern in a large file using java

palanisamy subramani
Greenhorn

Joined: Aug 30, 2010
Posts: 29
I have tried to search a pattern in a small file, able to get results, but when i go and search in a 1GB file, getting heap error.

What is the best way to search a pattern in a big files using java.

Thanks
Rob Spoor
Sheriff

Joined: Oct 27, 2005
Posts: 19784
    
  20

If the pattern only appears on separate lines you shouldn't store each line, but only the current one. In pseudo code:


SCJP 1.4 - SCJP 6 - SCWCD 5 - OCEEJBD 6
How To Ask Questions How To Answer Questions
palanisamy subramani
Greenhorn

Joined: Aug 30, 2010
Posts: 29
Thanks Rob for your quick reply,
Here pattern is in more than one line So i cannot go line by line.

To search the pattern(more than a line) i have to load entire file in memory and have to do search for the pattern,due to this getting heap memory error.

What will be the best way to search the pattern in the above scenario?

Thanks
John de Michele
Rancher

Joined: Mar 09, 2009
Posts: 600
Palanisamy:

That is almost certainly the wrong way to go about it. What precisely are you searching for, and what have you tried so far?

John.
Winston Gutkowski
Bartender

Joined: Mar 17, 2011
Posts: 8419
    
  23

palanisamy subramani wrote:Thanks Rob for your quick reply,
Here pattern is in more than one line So i cannot go line by line.

To search the pattern(more than a line) i have to load entire file in memory and have to do search for the pattern

Like John, I suspect that's not correct.

Are you saying that the pattern can be 1Gb long? It seems unlikely.

So the usual solution is to read in as much as you need to to guarantee a match.
Alternatively, break up the pattern into logical pieces that can be searched for procedurally.


Winston
Rob Spoor
Sheriff

Joined: Oct 27, 2005
Posts: 19784
    
  20

I am right now going through a 3GB log file, matching a specific pattern on each line, and moving that line to a file depending on the value of that pattern. (To be more precise, I'm splitting a single 3GB Apache HTTPD log file of several months into one log file per day.) No problem with that.
palanisamy subramani
Greenhorn

Joined: Aug 30, 2010
Posts: 29
John,

I have tried with less than 1MB files, got results. When i go for 1GB file, got heap error.
If this is a wrong way, then what is the best way to do that?

Thanks
palanisamy subramani
Greenhorn

Joined: Aug 30, 2010
Posts: 29
Giving more information on this,

Notes:
A day a file, may end up with more than 1GB with log data and XML data inside.
Pattern is like 5 lines of XML .That pattern may repeat many times.

Summarising options provided by you guys,
1) Split the file into small file and read from that. -- multiline pattern may split between files, pattern may miss.
2) Split the pattern into line by line pattern -- complex logic is required to filter the pattern.


All your comments are apreciated

Mike Simmons
Ranch Hand

Joined: Mar 05, 2008
Posts: 3018
    
  10
[composed without seeing the last comment above]

Can you tell us about the pattern? What is it? Can you identify an initial part of the pattern that only takes up one line, and search for that first? Is there any size limit for how much text can be inside the pattern?
Mike Simmons
Ranch Hand

Joined: Mar 05, 2008
Posts: 3018
    
  10
palanisamy subramani wrote:Pattern is like 5 lines of XML .That pattern may repeat many times.

Is the whole file XML? It may well be easier to use an XML parser, one that doesn't load the whole DOM into memory. In the early days of Java XML processing, that would have meant using a SAX parser; I'm not sure what the best choices are now.

Is there a particular start and end tag that you're looking for? Do you want all instances of that start and end tag? Or is the pattern more complex than that?
John de Michele
Rancher

Joined: Mar 09, 2009
Posts: 600
Palanisamy:

The problem with reading large files whole into memory is exactly what you describe - you run out of memory, it's horribly inefficient, wastes resources, etc.. If that five line XML pattern is consistent, then what you probably want to do is check for the first line, and if that matches, check to see if the next four lines match. That way, your file can be 1MB, or 1GB, or 1TB, and you don't have the problem of accidentally splitting files in the middle of the pattern you're looking for.

John.
palanisamy subramani
Greenhorn

Joined: Aug 30, 2010
Posts: 29

I broke the multiline pattern into single line patten and able to search huge file without any issue.

Thanks to all for your valuable comments!!!
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: How to search a pattern in a large file using java