This week's book giveaway is in the General Computing forum.
We're giving away four copies of Arduino in Action and have Martin Evans, Joshua Noble, and Jordan Hochenbaum on-line!
See this thread for details.
The moose likes Performance and the fly likes Parsing  a huge File Big Moose Saloon
  Search | Java FAQ | Recent Topics
Register / Login


JavaRanch » Java Forums » Java » Performance
Reply Bookmark "Parsing  a huge File" Watch "Parsing  a huge File" New topic
Author

Parsing a huge File

Holger Prause
Ranch Hand

Joined: Oct 09, 2000
Posts: 47
Hi,

I have to parse a huge CSV File and to replace every ' Character
in this File.The file is 70 MegaByte huge and i have 176 MegaByte
free Memory.But i get a java.lang.OutOfMemoryException.I must be doing somthing wrong.
the following shows what i am doing -
<pre>
public void parse(File file) throws IOException {
String line;
BufferedReader reader = new BufferedReader(new FileReader(file));
writer = new BufferedWriter(new FileWriter("/somepath/somefile");
while ((line = reader.readLine()) != null) {
writer.write(replace(line,"'",""));
writer.newLine();
}
writer.flush();
writer.close();
}
public String replace(String original,String searchFor,String replaceWith) {
String orig = original;
StringBuffer changed = new StringBuffer("");
int indexof;
while ((indexof=original.lastIndexOf(searchFor)) != -1) {
changed.append(orig.substring(0,indexof)).append(replaceWith).append(orig.substring(indexof+searchFor.length()));
}
return changed.toString();
}
</pre>
I think my replace Method is the Problem.
Now i got the idea that i dont have to use the replace method,instead i should use a reader that reads characters and if the specified character occurs - i just dont write him out.
What Reader should i use - is my idea right ?
Can you please post some code example ?
Thx a lot,
Holger
Jim Yingst
Wanderer
Sheriff

Joined: Jan 30, 2000
Posts: 18670
I think the problem is that in your replace() method, the while loop is infinite. Since the String referenced by "original" is unchanging, original.lastIndexOf(searcFor) just keeps finding the same value every time, never equal to -1 - and then appending some stuff to the StringBuffer each time, eventually leading to the OutOfMemoryError. I suggest you work on the replace() method by itself, not called through parse(). Just write a main() method that calls replace("sample a", "a", "i") and prints the result (should be "simple i"). Then you can focus on the one method without worrying about the other - if it throws OutOfMemoryError, you have a much better idea where the problem is. Then add some print statements inside the while loop:
<code><pre> System.out.println("indexof: " + indexof);
System.out.println("changed: " + changed);</pre></code>
This will give you a better idea just what your loop is doing each time through it. As you work on the loop, you may also want to consider the String methods indexOf(int, int) or lastIndexOf(int, int) to make sure you don't just find the same substring each time, as well as the StringBuffer method replace(int, int, String) which will do some of the work for you. Good luck.
[This message has been edited by Jim Yingst (edited January 13, 2001).]


"I'm not back." - Bill Harding, Twister
Peter Tran
Bartender

Joined: Jan 02, 2001
Posts: 783
Holger,
Try this on your large CSV file.

Let me know how fast it runs. It can use some more tweaking to make it run faster, but try this solution first.
-Peter
[This message has been edited by Peter Tran (edited January 14, 2001).]
Holger Prause
Ranch Hand

Joined: Oct 09, 2000
Posts: 47
He thx to both of you for helping .Its workin now .The time i tooked me to parse the file is 22 sec.So its ok now.Thank you very much.

But theres one Question left.I have to find out when the 5000th line is reached - and then i have to create a new outputfile.
But i reading the data in now with an chararray.How to find out at which line i am ?
thx again,
Holger
Peter Tran
Bartender

Joined: Jan 02, 2001
Posts: 783
Holger,
Which solution are you using? You can try to keep a count of a unique character appearing on a line. For example, there should only be one new line character per line. Once you hit 5000, you can create your new file.
-Peter
Peter Tran
Bartender

Joined: Jan 02, 2001
Posts: 783
Can you zip me up your input file (if it's doesn't contain confidential information)? I would like to try some different solution to see the performance.
Thanks,
-Peter
Ps. I will post my result if you send me your input. Remember to zip it up, because a 76Meg file is pretty large to pass over email.
Holger Prause
Ranch Hand

Joined: Oct 09, 2000
Posts: 47
I am using the solution of peter for my huge CSV File.
But also thanks to Jim - he showed me that my replace Method is absolutely nonsens
thx
Bye,

Holger
Jim Yingst
Wanderer
Sheriff

Joined: Jan 30, 2000
Posts: 18670
I'd be inclined to take Peter's filter() method and adapt it into a new FilterReader class, CommaFilter. Then you can do something like
<code><pre>BufferedReader reader = new BufferedReader(new CommaFilter(new FileReader(file)));</pre></code>
...and then take advantage of BufferedReader's readLine() method to count lines. This way you get nice clean separation of the comma filtering functionality from the line counting functionality. I imagine it will be a little slower than Peter's version (since it would create a String object for each line), but probably not much (file IO is the main delay here I expect - the other parts of the system are probably fast enough to keep up without complaint). You can also experiment with different orders of the Readers, or additional buffers (should there be a BufferedReader between the FileReader and the CommaFilter for speed?)
Peter Tran
Bartender

Joined: Jan 02, 2001
Posts: 783
Jim,
I'll take your suggestion into consideration when I try some tweaks to get some benchmarks. It just takes so much time to get accurate benchmarks. *sigh*
Thanks for the suggestion.
-Peter
Jim Yingst
Wanderer
Sheriff

Joined: Jan 30, 2000
Posts: 18670
"It just takes so much time to get accurate benchmarks."
Sure. That's why I'm just kibitzing from the sidelines, not volunteering to find out myself. Let me know what you find out though.
Thomas Paul
mister krabs
Ranch Hand

Joined: May 05, 2000
Posts: 13974
I go along with Jim. My first thought was that this is crying out for a FilterReader.


Associate Instructor - Hofstra University
Amazon Top 750 reviewer - Blog - Unresolved References - Book Review Blog
 
I agree. Here's the link: http://zeroturnaround.com/jrebel - it saves me about five hours per week
 
subject: Parsing a huge File
 
Similar Threads
apache POI - HWPF search & replace
read from a file into a string
replaceAll
how to read a file data into byte array
Looking for an order-of-magnitude speed-up