wood burning stoves 2.0*
The moose likes I/O and Streams and the fly likes Converting xml files to test files Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of OCM Java EE 6 Enterprise Architect Exam Guide this week in the OCMJEA forum!
JavaRanch » Java Forums » Java » I/O and Streams
Bookmark "Converting xml files to test files" Watch "Converting xml files to test files" New topic
Author

Converting xml files to test files

Bhasker Reddy
Ranch Hand

Joined: Jun 13, 2000
Posts: 176
I am processing xml files and converting them to a text files(with specific record types). It takes me around an hour to process one gig file.
I am using printwriter to write to text file. I am using println to write the to text file. Do you have any suggestions to improve this. I need to do
this in 10 minutes(instead of an hour) as we are going to process around 100 gigs of data every day.
Please let me know if you have any suggestions.??


Bhasker Reddy
Joe Ess
Bartender

Joined: Oct 29, 2001
Posts: 8877
    
    8

First and foremost, make sure your hardware is up to the task. If your computer's CPU is at 100%, your memory use is 100% and your disk is thrashing, the computer is more occupied with swapping than working on your program. When running an enterprise application, there is no substute for enterprise-class hardware. Since you want to process gigs of information you probably won't be able to run many other processes on this computer. Especially not CPU-intensive things like web servers or databases.
Next, since you are working with XML you may be using DOM to parse your XML. I've found DOM to be a performance bottleneck and was able to get an order of magnitude better performance by manually parsing XML files. DOM is primarially for the editing of XML, so SAX may be a better alternative in your case (haven't tried it). Are you loading the file into memory, processing it, then writing it out? BAD IDEA for a one gig file unless you have MANY,MANY gigs of physical RAM. DOM loads the entire document into memory and allows one to manipulate it. SAX is event-driven. It processes a subset of the document at a time and generates events so your program can process and write those subsets out to disk, saving valuable resources.
The online book from Sun, Java Platform Performance, has some good general information for getting the most out of the java api.


"blabbing like a narcissistic fool with a superiority complex" ~ N.A.
[How To Ask Questions On JavaRanch]
Bhasker Reddy
Ranch Hand

Joined: Jun 13, 2000
Posts: 176
We are using enterprise class hardware with probably 18GB of RAM. We have our own parsing routine that's pretty fast. But I guess writing it to the text file takes lot of time. Parsing xml is fast, I am using printwriter println method. Is there something else that saves time. is fileWriter better than printwriter. Do you have any other ideas to make it faster.
thanks
Bhasker
Abhik Sarkar
Ranch Hand

Joined: Jun 14, 2003
Posts: 61
Hi Bhaskar,

I hope you have a BufferWriter between your PrintWriter and the FileWriter. If not, putting that in would help immediately.

Also, if you have used the default constructor of the PrintWriter, then autoFlush is enabled. This means that each time you call println(), the stream will be flushed. I have written a small program to demonstrate the difference it makes. Please ignore that bad Exception handling... I just wanted to demontrate the difference.



Here is the output from some test runs...


As you can see, the process speeds up around 3 times! In you case, that could mean that the time taken reduces to around 20 minutes.

Whether or not you want the output to be flushed immediately depends on the nature of your application. If it doing batch processing, you could do away with frequent flushing... if it needs to display data in real-time, you need to flush frequently.

Hope this helps,
Abhik.
Bhasker Reddy
Ranch Hand

Joined: Jun 13, 2000
Posts: 176
I am not displaying any data. I am outputting it to a test file. I am taking xml file and converting it to a text file. I am using println and printwriter to do it. Do you mean to say if I flush frequently, will it be
faster. Do I need to store all the output and then write it once to file.
Instead of writing it every single line.
Bhasker Reddy
Ranch Hand

Joined: Jun 13, 2000
Posts: 176
I parse xml file, at the end of parsing , I have an object and arraylist that contains all the parsed information. I read object and arraylist and apply business logic and output data to a pipe delimited text file. I use printWriter out to do this. I output in this way
out.println(str + "\r"); I use this line 114 times to output all the information in object and arraylist. Do You guys think instead of doing it in this way. Can I store all the "str" (it is a string) into an arraylist(let's say strArrayList.
and at the end of applying business logic , output it to print writer using

ListIterator stringList;
stringList= strArrayList.listIterator();
int i=0;
while(stringList.hasNext()){
STRING str= (STRING)stringList.next();
out.println(str);
i++;
if (i=20){
out.flush();
i=0;
}
}

Do you think it will be faster if i do in this way.
Abhik Sarkar
Ranch Hand

Joined: Jun 14, 2003
Posts: 61
Hi Bhasker,

My point was that using a BufferedWriter would improve the performance and that flushing too often slows things down. So, if you aren't using a BufferedWriter, you should definitely consider using it. Also, if you don't have a lot of content to write to the file, you could consider the possibility of putting everything into a StringBuffer before writing the entire StringBuffer to the file in one go.

You could look around on the internet for articles on improving performance. Here is one I came across on
Performance Tuning on Sun's Java site.

Hope this helps,
Abhik.
Bhasker Reddy
Ranch Hand

Joined: Jun 13, 2000
Posts: 176
Basically I am converting xml files to record type based text files. My xml files are based on multiple accounts. Every account has hundreds of records. When I parse I have a object of account, I read multiple tags and inner tags in account and output to text file. Whenever I read a record or
tag, I ouput to a text file using printwriter and println methods. I am talking about huge data(data inorder of gigs).
Is there a way instead of writing every line, can I store it in a object(serialization??) and write it at once. How to do this. Will this be much
faster than the outputing one line at a time.
Abhik Sarkar
Ranch Hand

Joined: Jun 14, 2003
Posts: 61
Hi Bhasker,

If it is only text, you can use the StringBuffer. Here is a modified version of my earlier example... you can see from the output that the execution time has reduced further.


 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Converting xml files to test files