• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Jeanne Boyarsky
  • Ron McLeod
  • Paul Clapham
  • Liutauras Vilda
Sheriffs:
  • paul wheaton
  • Rob Spoor
  • Devaka Cooray
Saloon Keepers:
  • Stephan van Hulst
  • Tim Holloway
  • Carey Brown
  • Frits Walraven
  • Tim Moores
Bartenders:
  • Mikalai Zaikin

Performance: Search for a string in large text file of 1 GB size and print the line when match found

 
Greenhorn
Posts: 7
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I have a large text file of 1 GB size. I need to print the line when a matching word is found in a particular line. Below is the code I am using. But if there are many lines that has the matching word, it's taking lot of time. Can anyone suggest a solution to print the lines much faster.


Scanner scanner = new Scanner(file);
while (scanner.hasNextLine()) {
String line = scanner.nextLine();
if(line.contains("xyz")) {

System.out.println(line);

}
}
 
Marshal
Posts: 28193
95
Eclipse IDE Firefox Browser MySQL Database
  • Likes 1
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Here's an experiment you can do: Take out the if-statement and just leave the code which reads the file but does nothing else. How much time does that take?
 
Marshal
Posts: 79178
377
  • Likes 1
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Then work out how much of the time is spent looking for the line and how much printing it.

Moving this discussion to our performance forum.
 
Rancher
Posts: 43081
77
  • Likes 1
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Why do you want to do this in Java? A tool like grep -available on any Unix/Linux-derived OS, and also on Windows via UnxUtils- likely would be much faster. It could be invoked from Java code via Runtime.exec or ProcessBuilder if need be.
 
Bartender
Posts: 10780
71
Hibernate Eclipse IDE Ubuntu
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

srinu pearl wrote:I have a large text file of 1 GB size. I need to print the line when a matching word is found in a particular line. Below is the code I am using. But if there are many lines that has the matching word, it's taking lot of time.


So: it's only taking a lot of time when there are lots of lines containing the string? That suggests that maybe the print side is what's slowing you down; but as Paul and the others have said, you really need to do some tests to work out where your bottlenecks are.

A few thoughts:
  • I suspect a BufferedReader might be faster than a Scanner, but I don't know.
  • If you want the mental exercise, and your searches are simply for fixed strings (no "wildcard" characters) you could try writing a B-M-H search. However, it's more of a curio than anything else, and you may find that it makes very little difference at all.
  • Alternatively, have a look at the Pattern (java.util.regex.Pattern) and Matcher (java.util.regex.Matcher) classes and have a go at implementing a more 'grep'-like search. This may well actually make things slower, but you might have a more useful program as a result.

  • However, whatever you decide, efficiency should come a distant second to correctness. Slow or not, what you have right now is about as simple as it gets, so you can be pretty darn sure that it's right.

    HIH

    Winston
     
    Ranch Hand
    Posts: 172
    Redhat Ruby C++
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator
    Scanner is very slow. You need to optimize the whole algorithm there. First try to minimize the IO, moving the whole string into the memory. If it is too large, as the Java String follows UTF-8, if you file is just an ascii text, you may create a byte[] to load the whole text. You may also try different approach of IO/NIO/Selector to read the file. A more simple approach is to use bufferedreader. Then you might run the whole string looking for the pattern. You may use regex or stringtokenizer (this one is faster in my tests) (or comparing the bytes, if you choose to go extreme to low level).

    As people suggested, you may also rely on some operation system feature, as it is low level code and it is also very stable (in case of linux). You would just need to to integrate, capturing the output of the other process. Refer to http://docs.oracle.com/javase/7/docs/api/java/lang/ProcessBuilder.html to create a process from Java code.

    Regards
     
    Rancher
    Posts: 2759
    32
    Eclipse IDE Spring Tomcat Server
    • Likes 3
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator
    Searching for a random string in a 1GB text file is like giving a cat a bath. You can do it, but there's no good way of doing it. The larger question is why are you giving this cat a bath at all? Most applications don't really need to search for random strings in 1GB files. If you are building an application that stores data in files, you would design your files so that they would be indexed. If you are building an application that needs to do text searches on text files, you would generally put them in a Lucene index.

    Is this a real requirement or just an exercise? If someone came to me and told me to build a system that searches random strings in unindexed files, my first question will be "how much money do you have?" I can build a map-reduce application that spreads the file on 100 machines and searches it. Or if the file is going to be searched many times, I might load the file in a Lucene Index. However, a map-reduce/Lucene index might be overkill for some applications, and it would require investment in infrastructure.

    SO, the question is:- why are you searching for random strings in 1 GB unindexed text file? Why are you giving the cat a bath?
     
    Campbell Ritchie
    Marshal
    Posts: 79178
    377
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator

    Luan Cestari wrote: . . . the Java String follows UTF-8, . . .

    UTF-16, surely?
     
    lowercase baba
    Posts: 13089
    67
    Chrome Java Linux
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator
    "a lot of time" isn't a very good spec. How fast do you NEED this to go? Without a defined metric, you're chasing a ghost...no matter how much you improve it, someone can say "yeah, but can it go FASTER?" You need to define how fast is fast enough so that you know when you are done.
     
    Luan Cestari
    Ranch Hand
    Posts: 172
    Redhat Ruby C++
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator

    Campbell Ritchie wrote:

    Luan Cestari wrote: . . . the Java String follows UTF-8, . . .

    UTF-16, surely?



    I mean, the default encoding is UTF-8, almost sure of this like described in http://stackoverflow.com/questions/1749064/how-to-find-default-charset-encoding-in-java it may vary but the size of bytes a lot of more than ascii (I already saw some people need to do this so save memory to put into the hot memory and not waste space)
     
    Master Rancher
    Posts: 4806
    72
    • Likes 1
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator
    In addition to the other excellent advice above:

    If you're printing to System.out, there's a good chance that the slowness is because your terminal takes time to show you the text. Try writing to a file instead and see how that affects things. One quick & easy way to do this with no code changes is to simply redirect your output to a file. E.g. from the command line, if you're using something like

    java -cp $MY_CLASSPATH my.package.MyProgram

    change it to

    java -cp $MY_CLASSPATH my.package.MyProgram > outfile.txt

    and look for your results in outfile.txt.

    Or, in the Java program, replace the System.out with a PrintWriter wrapped around a BufferedWriter wrapped around a FileOutputStream. Or, try just a PrintWriter that writes directly to a file (new PrintWrtier(file)) - nowadays that may be optimized well enough.
     
    srinu pearl
    Greenhorn
    Posts: 7
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator
    This is a interview question. I tried using pattern, it worked fast. But I am just able to search the word, but not able to print the line when match found.

    int line = 0;

    Pattern pattern = Pattern.compile("invalid holding");

    txtscan = new Scanner(new File("C://logs/LM.2014-02-14.log"));

    while (txtscan.findWithinHorizon(pattern,0) != null) {
    System.out.println(line);
    line++;
    }
     
    Jayesh A Lalwani
    Rancher
    Posts: 2759
    32
    Eclipse IDE Spring Tomcat Server
    • Likes 1
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator
    Gah! I would answer the question with "It's a stupid idea to search for a random string in a random 1GB file. If someone comes to me with these requirements, I would go back to them and figure out what the real requirement that requires us to search a random string in a 1 GB file, and see if we can come up with a better solution"

    1 GB files don;t fall out of the sky. There is some system that is producing the 1GB file. There has to be a better way to interface with that system.
     
    Ranch Hand
    Posts: 36
    Eclipse IDE Java ME Oracle
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator

    srinu pearl wrote:This is a interview question. I tried using pattern, it worked fast. But I am just able to search the word, but not able to print the line when match found.

    int line = 0;

    Pattern pattern = Pattern.compile("invalid holding");

    txtscan = new Scanner(new File("C://logs/LM.2014-02-14.log"));

    while (txtscan.findWithinHorizon(pattern,0) != null) {
    System.out.println(line);
    line++;
    }


    the method findWithinHorizon() return a line of data, you just need to assigned to a variable and display it.


    That should do the trick!
     
    Mike Simmons
    Master Rancher
    Posts: 4806
    72
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator
    Well, findWithinHorizon() doesn't return the whole line, just the part that matches the pattern you're searching for. If the requirement really is to print the whole line, as stated, then this doesn't quite work. You can modify the pattern to account for this, but it slows things down.

    In comparison, using a BufferedReader and contains() runs notably faster, on my machine.

    Of course, using grep is still much faster.
     
    Mike Simmons
    Master Rancher
    Posts: 4806
    72
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator
    It's vexing, because using a regex to search for the raw text is pretty fast (pretty much the same as using a BufferedReader and contains):

    And adding quotes and a multiline flag doesn't really change that:

    You can then grab the rest of the line, without really impacting performance:

    But when you add the bit to grab the first part of the line, before your target pattern, that's what slows it down:

    I guess it's got a lot of false matches of the start of a line, that it has to deal with.

    It seems like it should be possible for a regex Pattern to improve on String.contains(), using the Boyer-Moore algorithm for example. But in practice, it doesn't seem to do any better, that I can see. It's usually slower. Of course, a regex is much more flexible, if you need something other than an exact match. It's too bad there's no method like findWithinHorizon that exposes the underlying Matcher, or MatchResult, telling you the position of the find and giving you a way of looking back at what came just before it.

    It's also vexing that all pure Java-based solutions that I see here still pale in effectiveness next to grep.

     
    Winston Gutkowski
    Bartender
    Posts: 10780
    71
    Hibernate Eclipse IDE Ubuntu
    • Mark post as helpful
    • send pies
      Number of slices to send:
      Optional 'thank-you' note:
    • Quote
    • Report post to moderator

    srinu pearl wrote:This is a interview question.


    So presumably you're now trying to implement this after the fact; and if so, I don't see how you can possibly know whether your program is "slow" or not.

    Personally, what you wrote in your opening post seems extremely simple to me and does exactly what was asked; so I'm not quite sure what the issue is. Any program is going to take time to read a gigabyte of text; and the likelihood is that most of that time will be spent in I/O, not in your program.

    Other than that, I'm with Jayesh: You should be asking yourself why you're having to trawl a gigabyte of text to begin with; not working out how to shave a few seconds off the process.

    Winston
     
    Don't get me started about those stupid light bulbs.
    reply
      Bookmark Topic Watch Topic
    • New Topic