I have a Json file which is 1 Terabytes in size. Each Json Object contains text with 500-600 words. There are 50 million Json objects.
Now this is what I have to do with this Json file. I need to insert 200-300 words and a percentage value into a web page. Once this is done, the web application will read the entire Json file checking whether the inserted words are available in any Json object, and what is the percentage of the availability. If the availability percentage is higher than the percentage I inserted then this application will also keep track of words available in Json Object compared to the input list and words missing in Json Object compared to the input list.
I felt reading 1TB is too big, so I did a trick. I converted the text in every Json Object into hash (this hash represents any word with 3 characters) and saved it into a text file. Now the hash in every new line of this text file represents the text in that Particular Json Object. This text file is 120GB big. 50 Million lines.
My problem is that reading and performing the above job is still harder. It takes hours of time to complete! Why? Because the application read "every" line in this Hash, search which words are available and which words are not. So this "checking" algorithms are running for 50 million times!
Is there anyway I can reduce the time of this operation and do it within few seconds? I know applications in chemistry and genetic medicine does the exact same thing within seconds! I am open to all the solutions, whether it is a Big data solution, data mining or a simple fix, whatever. Please help.
PS: I thought of a Hadoop based solution but purchasing lot of computers! That is a huge cost! Even running in Amazon is double cost! I don't have cash to buy multiple computers too!
PS: I had a suggestion to use GPU computing. Argument was that hadoop uses lot of cores to run the app, and GPU computing does the same (Note I am not saying hadoop can be run in GPU). It is also said GPUs like NVidia Tesla are built for running massive loops. But I have simple loops, just running lot of times.
PS: (Please note I have posted the same question here, but I did not find the answer I am looking for)
Are you better than me? Then please show me my mistakes..