wood burning stoves 2.0*
The moose likes Performance and the fly likes Performance: Search for a string in large text file of 1 GB size and print the line when match found Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of Murach's Java Servlets and JSP this week in the Servlets forum!
JavaRanch » Java Forums » Java » Performance
Bookmark "Performance: Search for a string in large text file of 1 GB size and print the line when match found" Watch "Performance: Search for a string in large text file of 1 GB size and print the line when match found" New topic
Author

Performance: Search for a string in large text file of 1 GB size and print the line when match found

srinu pearl
Greenhorn

Joined: Jan 08, 2014
Posts: 7
I have a large text file of 1 GB size. I need to print the line when a matching word is found in a particular line. Below is the code I am using. But if there are many lines that has the matching word, it's taking lot of time. Can anyone suggest a solution to print the lines much faster.


Scanner scanner = new Scanner(file);
while (scanner.hasNextLine()) {
String line = scanner.nextLine();
if(line.contains("xyz")) {

System.out.println(line);

}
}
Paul Clapham
Bartender

Joined: Oct 14, 2005
Posts: 18541
    
    8

Here's an experiment you can do: Take out the if-statement and just leave the code which reads the file but does nothing else. How much time does that take?
Campbell Ritchie
Sheriff

Joined: Oct 13, 2005
Posts: 37937
    
  22
Then work out how much of the time is spent looking for the line and how much printing it.

Moving this discussion to our performance forum.
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 41068
    
  43
Why do you want to do this in Java? A tool like grep -available on any Unix/Linux-derived OS, and also on Windows via UnxUtils- likely would be much faster. It could be invoked from Java code via Runtime.exec or ProcessBuilder if need be.


Ping & DNS - my free Android networking tools app
Winston Gutkowski
Bartender

Joined: Mar 17, 2011
Posts: 7505
    
  18

srinu pearl wrote:I have a large text file of 1 GB size. I need to print the line when a matching word is found in a particular line. Below is the code I am using. But if there are many lines that has the matching word, it's taking lot of time.

So: it's only taking a lot of time when there are lots of lines containing the string? That suggests that maybe the print side is what's slowing you down; but as Paul and the others have said, you really need to do some tests to work out where your bottlenecks are.

A few thoughts:
  • I suspect a BufferedReader might be faster than a Scanner, but I don't know.
  • If you want the mental exercise, and your searches are simply for fixed strings (no "wildcard" characters) you could try writing a B-M-H search. However, it's more of a curio than anything else, and you may find that it makes very little difference at all.
  • Alternatively, have a look at the Pattern (java.util.regex.Pattern) and Matcher (java.util.regex.Matcher) classes and have a go at implementing a more 'grep'-like search. This may well actually make things slower, but you might have a more useful program as a result.

  • However, whatever you decide, efficiency should come a distant second to correctness. Slow or not, what you have right now is about as simple as it gets, so you can be pretty darn sure that it's right.

    HIH

    Winston

    Isn't it funny how there's always time and money enough to do it WRONG?
    Articles by Winston can be found here
    Luan Cestari
    Ranch Hand

    Joined: Feb 07, 2010
    Posts: 162

    Scanner is very slow. You need to optimize the whole algorithm there. First try to minimize the IO, moving the whole string into the memory. If it is too large, as the Java String follows UTF-8, if you file is just an ascii text, you may create a byte[] to load the whole text. You may also try different approach of IO/NIO/Selector to read the file. A more simple approach is to use bufferedreader. Then you might run the whole string looking for the pattern. You may use regex or stringtokenizer (this one is faster in my tests) (or comparing the bytes, if you choose to go extreme to low level).

    As people suggested, you may also rely on some operation system feature, as it is low level code and it is also very stable (in case of linux). You would just need to to integrate, capturing the output of the other process. Refer to http://docs.oracle.com/javase/7/docs/api/java/lang/ProcessBuilder.html to create a process from Java code.

    Regards


    Please, visit me for some cool tech post at www.ourdailycodes.com
    Jayesh A Lalwani
    Bartender

    Joined: Jan 17, 2008
    Posts: 2271
        
      28

    Searching for a random string in a 1GB text file is like giving a cat a bath. You can do it, but there's no good way of doing it. The larger question is why are you giving this cat a bath at all? Most applications don't really need to search for random strings in 1GB files. If you are building an application that stores data in files, you would design your files so that they would be indexed. If you are building an application that needs to do text searches on text files, you would generally put them in a Lucene index.

    Is this a real requirement or just an exercise? If someone came to me and told me to build a system that searches random strings in unindexed files, my first question will be "how much money do you have?" I can build a map-reduce application that spreads the file on 100 machines and searches it. Or if the file is going to be searched many times, I might load the file in a Lucene Index. However, a map-reduce/Lucene index might be overkill for some applications, and it would require investment in infrastructure.

    SO, the question is:- why are you searching for random strings in 1 GB unindexed text file? Why are you giving the cat a bath?
    Campbell Ritchie
    Sheriff

    Joined: Oct 13, 2005
    Posts: 37937
        
      22
    Luan Cestari wrote: . . . the Java String follows UTF-8, . . .
    UTF-16, surely?
    fred rosenberger
    lowercase baba
    Bartender

    Joined: Oct 02, 2003
    Posts: 11153
        
      16

    "a lot of time" isn't a very good spec. How fast do you NEED this to go? Without a defined metric, you're chasing a ghost...no matter how much you improve it, someone can say "yeah, but can it go FASTER?" You need to define how fast is fast enough so that you know when you are done.


    There are only two hard things in computer science: cache invalidation, naming things, and off-by-one errors
    Luan Cestari
    Ranch Hand

    Joined: Feb 07, 2010
    Posts: 162

    Campbell Ritchie wrote:
    Luan Cestari wrote: . . . the Java String follows UTF-8, . . .
    UTF-16, surely?


    I mean, the default encoding is UTF-8, almost sure of this like described in http://stackoverflow.com/questions/1749064/how-to-find-default-charset-encoding-in-java it may vary but the size of bytes a lot of more than ascii (I already saw some people need to do this so save memory to put into the hot memory and not waste space)
    Mike Simmons
    Ranch Hand

    Joined: Mar 05, 2008
    Posts: 2986
        
        9
    In addition to the other excellent advice above:

    If you're printing to System.out, there's a good chance that the slowness is because your terminal takes time to show you the text. Try writing to a file instead and see how that affects things. One quick & easy way to do this with no code changes is to simply redirect your output to a file. E.g. from the command line, if you're using something like

    java -cp $MY_CLASSPATH my.package.MyProgram

    change it to

    java -cp $MY_CLASSPATH my.package.MyProgram > outfile.txt

    and look for your results in outfile.txt.

    Or, in the Java program, replace the System.out with a PrintWriter wrapped around a BufferedWriter wrapped around a FileOutputStream. Or, try just a PrintWriter that writes directly to a file (new PrintWrtier(file)) - nowadays that may be optimized well enough.
    srinu pearl
    Greenhorn

    Joined: Jan 08, 2014
    Posts: 7
    This is a interview question. I tried using pattern, it worked fast. But I am just able to search the word, but not able to print the line when match found.

    int line = 0;

    Pattern pattern = Pattern.compile("invalid holding");

    txtscan = new Scanner(new File("C://logs/LM.2014-02-14.log"));

    while (txtscan.findWithinHorizon(pattern,0) != null) {
    System.out.println(line);
    line++;
    }
    Jayesh A Lalwani
    Bartender

    Joined: Jan 17, 2008
    Posts: 2271
        
      28

    Gah! I would answer the question with "It's a stupid idea to search for a random string in a random 1GB file. If someone comes to me with these requirements, I would go back to them and figure out what the real requirement that requires us to search a random string in a 1 GB file, and see if we can come up with a better solution"

    1 GB files don;t fall out of the sky. There is some system that is producing the 1GB file. There has to be a better way to interface with that system.
    T.A. Nguyen
    Ranch Hand

    Joined: Sep 02, 2008
    Posts: 36

    srinu pearl wrote:This is a interview question. I tried using pattern, it worked fast. But I am just able to search the word, but not able to print the line when match found.

    int line = 0;

    Pattern pattern = Pattern.compile("invalid holding");

    txtscan = new Scanner(new File("C://logs/LM.2014-02-14.log"));

    while (txtscan.findWithinHorizon(pattern,0) != null) {
    System.out.println(line);
    line++;
    }

    the method findWithinHorizon() return a line of data, you just need to assigned to a variable and display it.


    That should do the trick!


    T.A. Nguyen
    http://ta.cnci.org
    http://www.linkedin.com/in/nguyenta
    Mike Simmons
    Ranch Hand

    Joined: Mar 05, 2008
    Posts: 2986
        
        9
    Well, findWithinHorizon() doesn't return the whole line, just the part that matches the pattern you're searching for. If the requirement really is to print the whole line, as stated, then this doesn't quite work. You can modify the pattern to account for this, but it slows things down.

    In comparison, using a BufferedReader and contains() runs notably faster, on my machine.

    Of course, using grep is still much faster.
    Mike Simmons
    Ranch Hand

    Joined: Mar 05, 2008
    Posts: 2986
        
        9
    It's vexing, because using a regex to search for the raw text is pretty fast (pretty much the same as using a BufferedReader and contains):

    And adding quotes and a multiline flag doesn't really change that:

    You can then grab the rest of the line, without really impacting performance:

    But when you add the bit to grab the first part of the line, before your target pattern, that's what slows it down:

    I guess it's got a lot of false matches of the start of a line, that it has to deal with.

    It seems like it should be possible for a regex Pattern to improve on String.contains(), using the Boyer-Moore algorithm for example. But in practice, it doesn't seem to do any better, that I can see. It's usually slower. Of course, a regex is much more flexible, if you need something other than an exact match. It's too bad there's no method like findWithinHorizon that exposes the underlying Matcher, or MatchResult, telling you the position of the find and giving you a way of looking back at what came just before it.

    It's also vexing that all pure Java-based solutions that I see here still pale in effectiveness next to grep.

    Winston Gutkowski
    Bartender

    Joined: Mar 17, 2011
    Posts: 7505
        
      18

    srinu pearl wrote:This is a interview question.

    So presumably you're now trying to implement this after the fact; and if so, I don't see how you can possibly know whether your program is "slow" or not.

    Personally, what you wrote in your opening post seems extremely simple to me and does exactly what was asked; so I'm not quite sure what the issue is. Any program is going to take time to read a gigabyte of text; and the likelihood is that most of that time will be spent in I/O, not in your program.

    Other than that, I'm with Jayesh: You should be asking yourself why you're having to trawl a gigabyte of text to begin with; not working out how to shave a few seconds off the process.

    Winston
     
    I agree. Here's the link: http://aspose.com/file-tools
     
    subject: Performance: Search for a string in large text file of 1 GB size and print the line when match found
     
    Similar Threads
    Reading multiple txt files
    Writing to a file
    How to restrict file writing to unique occurrence of a matched word that can occur multiple times?
    Searching a large text file...
    Read Line From file