• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Jeanne Boyarsky
  • Ron McLeod
  • Paul Clapham
  • Liutauras Vilda
Sheriffs:
  • paul wheaton
  • Rob Spoor
  • Devaka Cooray
Saloon Keepers:
  • Stephan van Hulst
  • Tim Holloway
  • Carey Brown
  • Frits Walraven
  • Tim Moores
Bartenders:
  • Mikalai Zaikin

IO and doing logic take a long time?

 
Ranch Hand
Posts: 507
Netbeans IDE Oracle Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
**Please note that I have posted this question here but posting here because I still have no answer***

I am going to ask kind of a serious question. I have a file with "sentences" and the file size is 500MB. Since it take a long time to read, I created a *Hash* for this and saved it to another file (I first gathered list of words which will be in my program. Then created hashes for them. Then I added this into a `HashMap` so the 'key' is the word and the 'value' is the hash. My using this `HashMap` I converted the entire 500MB into a separate Hash file). Now this Hash is 77 MB. This hash can represent any word using 3 characters, and it create unique hashes for each word. One line in this Hash indicates one sentence in real file.

Now, I am going to enter a list of words into the program. Program will convert these words to *Hash* too. Then it will go through the Hash file I explained before (77MB) and find whether the words I entered are presented in the list (I am comparing Hashes). If presented, then I get the word (Hash indication of the word), convert it to the real word. Below is my program.



I tried my best to reduce the code so I can post something short. Above code is still big, but without all of its parts, you will not understand it.

Now my question is, my program is very very slow. If I insert 50 words into the application, it take more than 1 hour to do the work I explained before. I have tried 2 weeks to find a solution, but I could not. Can someone please help me here to speed this work? FYI, it takes no longer than 12 seconds to read the 77MB Hash file. Something else is wrong. Please help.>
 
lowercase baba
Posts: 13089
67
Chrome Java Linux
  • Likes 1
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Hi Yohan,

I edited your post to split up some REALLY long lines...it just makes your whole post easier to read.

I can't give you any specific advice, but I will tell you what we always say in the performance forum.

Get a profiler, and look and see where your program is REALLY spending its time. you can guess all you want, but you will inevitably be wrong. There is no reason to try and speed up some part that isn't really taking any time.
 
Yohan Weerasinghe
Ranch Hand
Posts: 507
Netbeans IDE Oracle Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

fred rosenberger wrote:Hi Yohan,

I edited your post to split up some REALLY long lines...it just makes your whole post easier to read.

I can't give you any specific advice, but I will tell you what we always say in the performance forum.

Get a profiler, and look and see where your program is REALLY spending its time. you can guess all you want, but you will inevitably be wrong. There is no reason to try and speed up some part that isn't really taking any time.



Thanks for the reply. what did you mean by a 'profiler'?

 
Bartender
Posts: 10780
71
Hibernate Eclipse IDE Ubuntu
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Yohan Weerasinghe wrote:Thanks for the reply. what did you mean by a 'profiler'?


It's a program that can analyse your program while it's running, and give you some idea of what is being executed most.

An alternative to the above suggestion: Ask someone WHY you're trying to parse 500 megabytes of text on an ad hoc basis.

This (I suspect) is the actual source of your problem.

Winston
 
Yohan Weerasinghe
Ranch Hand
Posts: 507
Netbeans IDE Oracle Java
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Winston Gutkowski wrote:

Yohan Weerasinghe wrote:Thanks for the reply. what did you mean by a 'profiler'?


It's a program that can analyse your program while it's running, and give you some idea of what is being executed most.

An alternative to the above suggestion: Ask someone WHY you're trying to parse 500 megabytes of text on an ad hoc basis.

This (I suspect) is the actual source of your problem.

Winston



Thank you for the reply. I have 2 questions to ask.

1. If the way I am reading files is not the best way, what is the best way? How it should be handled? 500MB file is a Test case. Real file is 4 TeraBytes.
2. Do you think there is "any" place in this code I can use Multi-Threading?
 
fred rosenberger
lowercase baba
Posts: 13089
67
Chrome Java Linux
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Yohan Weerasinghe wrote:1. If the way I am reading files is not the best way, what is the best way? How it should be handled? 500MB file is a Test case. Real file is 4 TeraBytes.


It doesn't matter what you are doing...a 4TB file is going to take a long time to process. How it should be handled depends on what you are REALLY trying to accomplish - which we don't know. Don't tell us about hashmaps and so forth. Pretend you are trying to explain to a child the ultimate goal, and you only have about 15-20 seconds.

Yohan Weerasinghe wrote:2. Do you think there is "any" place in this code I can use Multi-Threading?


That is putting the cart before the horse. Multi-threading is not some panacea that will always fix a problem. In fact, there are many scenarios where threading can SLOW your program.

Never say "I'm going to try THIS to speed up my program" when you have no idea why your program is slow. and again, it doesn't matter what you are doing if you have to process a 4 TB file. You honestly might be better off buying a faster CPU than spending time optimizing your code.
 
Rancher
Posts: 2759
32
Eclipse IDE Spring Tomcat Server
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Agree with fred. You need to profile your application. And break it apart into well-named methods. And time each method.

Also, the reason you are not getting many responses is because you are trying to explain how you are trying to do something without explaining what you are doing. Also, it's tough to read your code because it's not broken out into methods. However, there are few things that I can make out

1) You are using String buffers to hold arrays of things. Then later you split the string returned by string buffer to recreate the array. Strings are very slow. Arrays are very fast. Don;t use Strings as arrays
2) Why don;t you read the input file once? Why do you read it every time?
3) Just looking at the structure of the code, for each line, the number of iterations will be (matchingWordsHolder.lenght+unmatchingWordsHolder.length)*wordMap.length.. which is finalUserDefinedWordCollection.length*wordMap.length. finalUserDefinedWordCollection.length is the number of unique words in the input. I can't figure out from the code how big wordMap is. Is it all the words in your input.txt. If yes, then that's where the problem is. You have a O(n^2) implementation here, which is bad. What you want to do is figure out how to make it O(n).


From what I understand, for each word in your input string, you want to find all the sentences that contain the word. Is that right? I might be wrong. You might wanna tell us what you are doing. If you you are trying to do this what you need is what is commonly called a reverse index A very simple implementation of a reverse index could be this

Create a Hashmap: the key is word, the value is a list. The list contains the line number of the line that contains the word
Store your original sentences in a format that allows you to look up the sentence by line number
When you want to find the sentences, look up the word in the hashmap, get list of lie numbers, and read those lines

That's it. You can extend this in many ways. If you want to do an AND operation, and find the lines that have all the words in the input, then you basically find the intersection of each word's line numbers. If you want to do an OR operation, you do an union of the line numbers. That's it.


BTW, is this for study or a real project? If it's for a real project use something like Apache Lucene. It does a lot of the things that I described here.. and more. Don;t reinvent the wheel.
 
Winston Gutkowski
Bartender
Posts: 10780
71
Hibernate Eclipse IDE Ubuntu
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Yohan Weerasinghe wrote:How it should be handled? 500MB file is a Test case. Real file is 4 TeraBytes...


Then your problem is 8 times as big. And (I suspect) will only get bigger until somebody can explain to you, in simple terms, WHY you are parsing all this text. I've worked on some very big systems (airlines and telecoms) and I've never dealt with that volume of text on an ad hoc basis. About the only thing I can think of that might would be the Google search engine.

Sounds to me me like someone has said to you: "Noah, we have a flood. Deal with it." (Note: not fix it)

Noah had to deal with an old-testament God. You don't.

Winston
 
Why should I lose weight? They make bigger overalls. And they sure don't make overalls for tiny ads:
a bit of art, as a gift, that will fit in a stocking
https://gardener-gift.com
reply
    Bookmark Topic Watch Topic
  • New Topic