File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
http://aspose.com/file-tools
The moose likes Java in General and the fly likes IO and doing logic take a long time? Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » Java in General
Bookmark "IO and doing logic take a long time?" Watch "IO and doing logic take a long time?" New topic
Forums: Java in General Performance
Author

IO and doing logic take a long time?

Yohan Weerasinghe
Ranch Hand

Joined: Oct 07, 2010
Posts: 499

**Please note that I have posted this question here but posting here because I still have no answer***

I am going to ask kind of a serious question. I have a file with "sentences" and the file size is 500MB. Since it take a long time to read, I created a *Hash* for this and saved it to another file (I first gathered list of words which will be in my program. Then created hashes for them. Then I added this into a `HashMap` so the 'key' is the word and the 'value' is the hash. My using this `HashMap` I converted the entire 500MB into a separate Hash file). Now this Hash is 77 MB. This hash can represent any word using 3 characters, and it create unique hashes for each word. One line in this Hash indicates one sentence in real file.

Now, I am going to enter a list of words into the program. Program will convert these words to *Hash* too. Then it will go through the Hash file I explained before (77MB) and find whether the words I entered are presented in the list (I am comparing Hashes). If presented, then I get the word (Hash indication of the word), convert it to the real word. Below is my program.



I tried my best to reduce the code so I can post something short. Above code is still big, but without all of its parts, you will not understand it.

Now my question is, my program is very very slow. If I insert 50 words into the application, it take more than 1 hour to do the work I explained before. I have tried 2 weeks to find a solution, but I could not. Can someone please help me here to speed this work? FYI, it takes no longer than 12 seconds to read the 77MB Hash file. Something else is wrong. Please help.>


Are you better than me? Then please show me my mistakes..
fred rosenberger
lowercase baba
Bartender

Joined: Oct 02, 2003
Posts: 11229
    
  16

Hi Yohan,

I edited your post to split up some REALLY long lines...it just makes your whole post easier to read.

I can't give you any specific advice, but I will tell you what we always say in the performance forum.

Get a profiler, and look and see where your program is REALLY spending its time. you can guess all you want, but you will inevitably be wrong. There is no reason to try and speed up some part that isn't really taking any time.


There are only two hard things in computer science: cache invalidation, naming things, and off-by-one errors
Yohan Weerasinghe
Ranch Hand

Joined: Oct 07, 2010
Posts: 499

fred rosenberger wrote:Hi Yohan,

I edited your post to split up some REALLY long lines...it just makes your whole post easier to read.

I can't give you any specific advice, but I will tell you what we always say in the performance forum.

Get a profiler, and look and see where your program is REALLY spending its time. you can guess all you want, but you will inevitably be wrong. There is no reason to try and speed up some part that isn't really taking any time.


Thanks for the reply. what did you mean by a 'profiler'?

Winston Gutkowski
Bartender

Joined: Mar 17, 2011
Posts: 7639
    
  19

Yohan Weerasinghe wrote:Thanks for the reply. what did you mean by a 'profiler'?

It's a program that can analyse your program while it's running, and give you some idea of what is being executed most.

An alternative to the above suggestion: Ask someone WHY you're trying to parse 500 megabytes of text on an ad hoc basis.

This (I suspect) is the actual source of your problem.

Winston

Isn't it funny how there's always time and money enough to do it WRONG?
Articles by Winston can be found here
Yohan Weerasinghe
Ranch Hand

Joined: Oct 07, 2010
Posts: 499

Winston Gutkowski wrote:
Yohan Weerasinghe wrote:Thanks for the reply. what did you mean by a 'profiler'?

It's a program that can analyse your program while it's running, and give you some idea of what is being executed most.

An alternative to the above suggestion: Ask someone WHY you're trying to parse 500 megabytes of text on an ad hoc basis.

This (I suspect) is the actual source of your problem.

Winston


Thank you for the reply. I have 2 questions to ask.

1. If the way I am reading files is not the best way, what is the best way? How it should be handled? 500MB file is a Test case. Real file is 4 TeraBytes.
2. Do you think there is "any" place in this code I can use Multi-Threading?
fred rosenberger
lowercase baba
Bartender

Joined: Oct 02, 2003
Posts: 11229
    
  16

Yohan Weerasinghe wrote:1. If the way I am reading files is not the best way, what is the best way? How it should be handled? 500MB file is a Test case. Real file is 4 TeraBytes.

It doesn't matter what you are doing...a 4TB file is going to take a long time to process. How it should be handled depends on what you are REALLY trying to accomplish - which we don't know. Don't tell us about hashmaps and so forth. Pretend you are trying to explain to a child the ultimate goal, and you only have about 15-20 seconds.
Yohan Weerasinghe wrote:2. Do you think there is "any" place in this code I can use Multi-Threading?

That is putting the cart before the horse. Multi-threading is not some panacea that will always fix a problem. In fact, there are many scenarios where threading can SLOW your program.

Never say "I'm going to try THIS to speed up my program" when you have no idea why your program is slow. and again, it doesn't matter what you are doing if you have to process a 4 TB file. You honestly might be better off buying a faster CPU than spending time optimizing your code.
Jayesh A Lalwani
Bartender

Joined: Jan 17, 2008
Posts: 2321
    
  28

Agree with fred. You need to profile your application. And break it apart into well-named methods. And time each method.

Also, the reason you are not getting many responses is because you are trying to explain how you are trying to do something without explaining what you are doing. Also, it's tough to read your code because it's not broken out into methods. However, there are few things that I can make out

1) You are using String buffers to hold arrays of things. Then later you split the string returned by string buffer to recreate the array. Strings are very slow. Arrays are very fast. Don;t use Strings as arrays
2) Why don;t you read the input file once? Why do you read it every time?
3) Just looking at the structure of the code, for each line, the number of iterations will be (matchingWordsHolder.lenght+unmatchingWordsHolder.length)*wordMap.length.. which is finalUserDefinedWordCollection.length*wordMap.length. finalUserDefinedWordCollection.length is the number of unique words in the input. I can't figure out from the code how big wordMap is. Is it all the words in your input.txt. If yes, then that's where the problem is. You have a O(n^2) implementation here, which is bad. What you want to do is figure out how to make it O(n).


From what I understand, for each word in your input string, you want to find all the sentences that contain the word. Is that right? I might be wrong. You might wanna tell us what you are doing. If you you are trying to do this what you need is what is commonly called a reverse index A very simple implementation of a reverse index could be this

Create a Hashmap: the key is word, the value is a list. The list contains the line number of the line that contains the word
Store your original sentences in a format that allows you to look up the sentence by line number
When you want to find the sentences, look up the word in the hashmap, get list of lie numbers, and read those lines

That's it. You can extend this in many ways. If you want to do an AND operation, and find the lines that have all the words in the input, then you basically find the intersection of each word's line numbers. If you want to do an OR operation, you do an union of the line numbers. That's it.


BTW, is this for study or a real project? If it's for a real project use something like Apache Lucene. It does a lot of the things that I described here.. and more. Don;t reinvent the wheel.
Winston Gutkowski
Bartender

Joined: Mar 17, 2011
Posts: 7639
    
  19

Yohan Weerasinghe wrote:How it should be handled? 500MB file is a Test case. Real file is 4 TeraBytes...

Then your problem is 8 times as big. And (I suspect) will only get bigger until somebody can explain to you, in simple terms, WHY you are parsing all this text. I've worked on some very big systems (airlines and telecoms) and I've never dealt with that volume of text on an ad hoc basis. About the only thing I can think of that might would be the Google search engine.

Sounds to me me like someone has said to you: "Noah, we have a flood. Deal with it." (Note: not fix it)

Noah had to deal with an old-testament God. You don't.

Winston
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: IO and doing logic take a long time?