Assuming you are running a JRE that can support that much data in memory, then it will depend on what you intend doing with the data. You will need to store it in a collection of some kind - array, List, Set, Map.
But before you do that you should revisit your design to make sure you really need to have all that data in memory at once - is it possible to just load part of the data at a time. If so it means you will be less restricted on what machines your program will run on.
Once you've done that you need to read up on the collections I mentioned to decide which is most suitable for your needs.
Joined: Nov 30, 2010
Thank you so much for your promt reply. Let me explain you in detail. So that you can guide me if possible
RAM: 3 GB
I have around 20000(around 1 or 2 million records in each file) files with coulmn A and another file with columns A and B (this is like lookup file). So now i have to iterate through all those 20000 files and column A has to be replaced with column B(from the look file). This is the requirement.
I am looking for the option that does not requires to load data to database.
Thanks in advance
Stuart A. Burkett
Joined: May 30, 2012
So it sounds like the only thing you need to keep in memory is the lookup file. You then just read each of the other files in line by line. For each line you make the required changes and then write it out to a temporary file. Once you've processed every line in the file, you delete it and then rename your temporary file to the name of the original file.
Joined: Nov 30, 2010
Yes. I need to keep only that look up file which is around 3GB. is it possible?
Balasubramaniam Muthusamy wrote:Yes. I need to keep only that look up file which is around 3GB. is it possible?
A 3Gb lookup file? Even assuming it's text (which it probably shouldn't be), I would reckon a 20,000 line lookup file would fit into a few meg.
Methinks your problems start a lot further back than this.
Why on earth would anyone keep 20,000 files around to support a system? Especially ones of that size?
The only possible reason I can think of is that it's independently distributed and that this is some sort of 'batched' update involving temporary files, or some "database" made up of a bunch of redundant copies of "data"; in which case why not just bite the bullet and implement a proper one?
Isn't it funny how there's always time and money enough to do it WRONG?
Articles by Winston can be found here
I think it also depends on whether this is a one-off run, or something you need to run daily/hourly/weekly....
If it's a one-off, you could read in part of the lookup file, process the 20,000 data files, then read the next chunk. You'd need logic to handle interruptions, but I think those would be solvable. so write it, and let it go for as long as it takes.
alternatively, you could process one of the 20k files in its entirety. you'd write the changes to a .tmp file. Then, when you complete the lookup file, you rename the .tmp to the original name. by looking at timestamps, you could figure out which had been done and which hadn't. If the job is killed, the .tmp file can be discarded, and you would restart on the untouched source file...
Sure, this will take a while, but what do you expect when you have 20,000,000,000 records to process against a 3GB file...
There are only two hard things in computer science: cache invalidation, naming things, and off-by-one errors
Joined: Nov 30, 2010
Thank you so much for all your replies.
This is just one shot fix and not going to run anymore. My lookup file has around 125 million records. is there any way we can split into chunks fo data and process them? I am also looking for is there any kind of index option or RandomAccessFile?
Balasubramaniam Muthusamy wrote:This is just one shot fix and not going to run anymore.
Hmmm. Hate to say, but I've heard that before.
My lookup file has around 125 million records. is there any way we can split into chunks fo data and process them? I am also looking for is there any kind of index option or RandomAccessFile?
Sure, there are plenty of splitter utilities out there; or you could simply write one yourself (possibly better if the "splitting" is dependent on the data you're working on) and run it before your main update. Perl or awk are also very good for that sort of stuff.
But like I said before, from the little you've given us to go on, I suspect your problems start long before this. It just sounds 'off'. And unless you fix that you're probably doomed to repeat this exercise.