This week's book giveaway is in the Agile and other Processes forum. We're giving away four copies of The Mikado Method and have Ola Ellnestam and Daniel Brolund on-line! See this thread for details.
I have a very large HTML log file, let's say it can be several hundred megs large. I need to be able to append to the end of the file as my program runs. So, I have to remove the </body> and </html> tags at the very end of the existing log file before appending to it.
The solution I currently have is to read in the existing HTML file, write each line of it to a temporary HTML file, and then exclude the </body> and </html> tags at the end. Delete the original HTML file and rename the temporary file to the name of the preexisting HTML file.
The problem with this is performance of course, I would like to not rewrite the entire file every time the log file is written to. Is there any way to handle this better?
Why don't you just find the address of the body end tag, then overwrite the file from there and when you're done you simply add end tags for body and html, assuming you always write enough to completely overwrite both tags.
There is just one issue with that. FileChannel (and RandomAccessFile, another class that could be used) both deal with bytes, not text. With simple ASCII files that isn't a big problem, but as soon as you get more exotic characters you will need proper encoding. I'm not sure how FileChannel can handle that. Perhaps it's possible using Charset / CharsetEncoder, where you take a CharBuffer or String and convert it into a ByteBuffer first, which you then write to the FileChannel. There is one catch though - finding a safe place to start writing. What if the < of </body> is encoded in two bytes, with the first byte also being used for the previous character? (In other words, one byte contains data on two different characters.)
If not all then most browsers are very tolerant in that they try to display HTML even when the syntax is not perfect (i.e. not well formed). I suspect that you could get away with not having the closing </body></html> tags at all. Then it is a simple matter of just appending to the file.
In the unlikely event that you find that the </body></html> closing tags are needed then you could make the process that presents the file (jsp or servlet ?) append the closing tags.
Retired horse trader.
Note: double-underline links may be advertisements automatically added by this site and are probably not endorsed by me.
I would have to go back and question the design. An HTML file which is hundreds of megabytes? Why? Who is going to load that into their browser and look at it? Or if it isn't intended to be loaded into a browser, then why is it HTML?
Steve Bassoli
Greenhorn
Joined: Feb 22, 2010
Posts: 4
posted
0
Thanks for the suggestions everyone, I ended up going with a RandomAccessFile (Thanks Rob Prime). We're always encoding the output file as UTF8 so it's not a concern that a character consists of more than one byte. RandomAccessFile is very useful in that you can actually easily write UTF8.
It's not the slickest solution, but I had to know I could remove the characters after my <!--REMOVE_HTML--> comment.
Steve Bassoli wrote:We're always encoding the output file as UTF8 so it's not a concern that a character consists of more than one byte.
Err ... since utf-8 encoding is a variable length encoding surely this should be a concern. You will probably get away with it since your </body></html> chars are always just 1 byte per char under utf-8 encoding.