• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Jeanne Boyarsky
  • Ron McLeod
  • Paul Clapham
  • Liutauras Vilda
Sheriffs:
  • paul wheaton
  • Rob Spoor
  • Devaka Cooray
Saloon Keepers:
  • Stephan van Hulst
  • Tim Holloway
  • Carey Brown
  • Frits Walraven
  • Tim Moores
Bartenders:
  • Mikalai Zaikin

Performing text data inside loops

 
Ranch Hand
Posts: 507
Netbeans IDE Oracle Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

I have a Json file which is 1 Terabytes in size. Each Json Object contains text with 500-600 words. There are 50 million Json objects.

Now this is what I have to do with this Json file. I need to insert 200-300 words and a percentage value into a web page. Once this is done, the web application will read the entire Json file checking whether the inserted words are available in any Json object, and what is the percentage of the availability. If the availability percentage is higher than the percentage I inserted then this application will also keep track of words available in Json Object compared to the input list and words missing in Json Object compared to the input list.

I felt reading 1TB is too big, so I did a trick. I converted the text in every Json Object into hash (this hash represents any word with 3 characters) and saved it into a text file. Now the hash in every new line of this text file represents the text in that Particular Json Object. This text file is 120GB big. 50 Million lines.

My problem is that reading and performing the above job is still harder. It takes hours of time to complete! Why? Because the application read "every" line in this Hash, search which words are available and which words are not. So this "checking" algorithms are running for 50 million times!

Is there anyway I can reduce the time of this operation and do it within few seconds? I know applications in chemistry and genetic medicine does the exact same thing within seconds! I am open to all the solutions, whether it is a Big data solution, data mining or a simple fix, whatever. Please help.

PS: I thought of a Hadoop based solution but purchasing lot of computers! That is a huge cost! Even running in Amazon is double cost! I don't have cash to buy multiple computers too!

PS: I had a suggestion to use GPU computing. Argument was that hadoop uses lot of cores to run the app, and GPU computing does the same (Note I am not saying hadoop can be run in GPU). It is also said GPUs like NVidia Tesla are built for running massive loops. But I have simple loops, just running lot of times.

PS: (Please note I have posted the same question here, but I did not find the answer I am looking for)
 
reply
    Bookmark Topic Watch Topic
  • New Topic