File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
http://aspose.com/file-tools
The moose likes I/O and Streams and the fly likes Extract all integers from a text file, how to do? Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of Spring in Action this week in the Spring forum!
JavaRanch » Java Forums » Java » I/O and Streams
Bookmark "Extract all integers from a text file, how to do?" Watch "Extract all integers from a text file, how to do?" New topic
Author

Extract all integers from a text file, how to do?

Ellen Zhao
Ranch Hand

Joined: Sep 17, 2002
Posts: 581
I am writing a class to extract integers from a text file, count the number of all integers, sum the integers and caculate the average. The worked part is as below:

Anyone could tell me how to implement the extracting segment? And, is there any more sophisticated way other than using StringTokenizer and BufferedReader to accomplish the goal? Thank you very much in advance.

Regards,
Ellen
[ January 01, 2003: Message edited by: Ellen Fu ]
Avi Abrami
Ranch Hand

Joined: Oct 11, 2000
Posts: 1134

Hi Ellen,
To convert a "String" to an "int", you can use the method "parseInt" in class "java.lang.Integer".
Good Luck,
Avi.
Nayanjyoti Talukdar
Ranch Hand

Joined: Feb 12, 2002
Posts: 71
Hi Ellen,
The code will give compiler error, 'coz u have used BufferedReader object in the constructor of StringTokenizer. It expects String object. To solve u'r problem, u can store the whole file in a StringBuffer object. Using toString() method convert that to String and pass it to StringTokenizer and check for each token. if the token is integer, increment the counter and add that to previous value.
Hope this will help u.
----------------
Nayan.
Ellen Zhao
Ranch Hand

Joined: Sep 17, 2002
Posts: 581
Hi,
Thank you very much. The code has been modified according to your suggestion as below:

Thank Avi, I know Integer.parseInt() can be used to convert a String object to an integer. My question is: The text file contains both text and integers, for example " Today is 2003-1-2 ". How to judge whether the next token is an integer or not? In the example, I want to extract 2003, 1, 2, but not " Today is ". I looked at Java document, there is no isInteger() method. Do you have any idea? Thank you very much in advance.

Regards,
Ellen
Barry Gaunt
Ranch Hand

Joined: Aug 03, 2002
Posts: 7729
Character.isDigit() ???
-Barry


Ask a Meaningful Question and HowToAskQuestionsOnJavaRanch
Getting someone to think and try something out is much more useful than just telling them the answer.
Barry Gaunt
Ranch Hand

Joined: Aug 03, 2002
Posts: 7729
Regular expressions?
That's the way to do it!
Here.
-Barry
[ January 02, 2003: Message edited by: Barry Gaunt ]
Avi Abrami
Ranch Hand

Joined: Oct 11, 2000
Posts: 1134

Ellen,
Excuse me for misinterpreting your question. I would like to point out, though, that the regular expression package that Barry mentioned is only available with the latest JDK version -- 1.4. I don't know if that's suitable for you or not. In any case, there are also third party regular expression packages that will work with earlier JDK versions.
I would also like to mention (hopefully something you aren't already aware of ;-) that the second argument in your invocation of the "StringTokenizer" constructor:
StringTokenizer st = new StringTokenizer(br, " ");
is a list of delimiters. This means that in your code, you are only using a space character as a delimiter. Using your example of '2003-1-2', you would be able to parse it by invoking the following:
StringTokenizer st = new StringTokenizer(br," -");
in other words, a space _and_ a hyphen. Of course, you can include as many different characters as you like in the delimiter list -- but each delimiter consists of a single character only.
Hope this has helped you (a little more than my previous effort :-)
Good Luck,
Avi.
Nayanjyoti Talukdar
Ranch Hand

Joined: Feb 12, 2002
Posts: 71
Hi Ellen,
The code is as follows...I tried that, it was working..Have a look!!

Code is:

I think this will help u.
---------------
Nayan
[ January 02, 2003: Message edited by: Nayanjyoti Talukdar ]
Ellen Zhao
Ranch Hand

Joined: Sep 17, 2002
Posts: 581
Barry,
Thank you very much.

Avi,
the regular expression package that Barry mentioned is only available with the latest JDK version -- 1.4.

my JDK version is 1.4.1, so that the regular expression package is okay for me.
This means that in your code, you are only using a space character as a delimiter. Using your example of '2003-1-2', you would be able to parse it by invoking the following:

yeah, you are right! My mistake. Reading your correction, I revisted the JDK document on StringTokenizer more closely, please look at this:
StringTokenizer
public StringTokenizer(String str)Constructs a string tokenizer for the specified string. The tokenizer uses the default delimiter set, which is " \t\n\r\f": the space character, the tab character, the newline character, the carriage-return character, and the form-feed character. Delimiter characters themselves will not be treated as tokens.

Now I wonder, are the " " and "-" in the " -" treated as a sequence or actually the two characters have an " or " relationship? For example, when there is a line "Today is 2003 -1 -2" of course the " -" parameter will do, but if there is a line: "Today is 2003-1-2"
will the parameter " -" do or not?

Nayanjyoti,
I tried your code, it worked perfectly. Thank you very much for your help, now the class has been completely done.
I think your approach is very elegant under this condition.
Regards,
Ellen
[ January 02, 2003: Message edited by: Ellen Fu ]
Barry Gaunt
Ranch Hand

Joined: Aug 03, 2002
Posts: 7729
Ellen, there is a problem with Nayanjyoti's solution, even though it works for you. The problem is that the program is relying upon exception handling to perform flow control. This use of exception handling is a definite NO-NO.
Sorry no time to explain in detail, but read up on philosophy of exceptions if you have time.
-Barry
Pauline McNamara
Sheriff

Joined: Jan 19, 2001
Posts: 4012
    
    6
Originally posted by Barry Gaunt:
Regular expressions?
That's the way to do it!
Here.
-Barry
[ January 02, 2003: Message edited by: Barry Gaunt ]

If you go with the regular expressions, Dirk's
article in JavaRanch's newsletter might help.
Maulin Vasavada
Ranch Hand

Joined: Nov 04, 2001
Posts: 1871
hi all,
i feel StreamTokenizer is the best way out. is not it???
i mean reg exp and all would be too much work for nothing...
regards
maulin
David Weitzman
Ranch Hand

Joined: Jul 27, 2001
Posts: 1365
Originally posted by Maulin Vasavada:

i mean reg exp and all would be too much work for nothing...

If by work, you mean the work of learning regular expressions, it's definately worth the effort. They're useful in a lot of situations, from file maintanance to programming to search and replace in a modern IDE.
[ January 07, 2003: Message edited by: David Weitzman ]
Nayanjyoti Talukdar
Ranch Hand

Joined: Feb 12, 2002
Posts: 71
Hi Barry,
Can u plz tell me what is the problem in my solution? Do u mean,this is not the right approach for the solution?
Regards
Nayan.
Barry Gaunt
Ranch Hand

Joined: Aug 03, 2002
Posts: 7729
Hi Nayan, the problem is in this snippet:

You are using this try/catch construct in a while loop to control the flow of the loop. You are using an unchecked exception in a way it is not meant for.
Getting an unchecked exception is a BAD THING and should only be used to indicate that an unexpected, unrecoverable programming error has occurred. A program that incurs an unchecked exception should terminate, and the programming error corrected.
In Ellen's data file it is perfectly normal for nonnumerical data to occur. So the program should be written using normal string processing techniques to separate numerical from non-numerical data. If a string has been selected as being numeric but an Integer.parseInt() method throws a NumberFormatException then that is a valid use of unchecked exceptions because it indicates a programming error.
Using exceptions the way you have done is 1) against the philosophy of exception handling, and 2) extremely inefficient because of the way exceptions are implemented in the JVM.
In some programming assignments I have also used exceptions in an incorrect way.
For example in a program that expects one single integer as a parameter. I have caught an ArrayIndexOutOfBounds exception, or an NumberFormatException if the user enters no or bad data. I printed a message and terminated the program. But this is not really the correct way of doing things. A missing parameter or a badly entered number is predictable, and should be programmed for. I took the lazy way, not the correct way.
BTW do not take all I have said too seriously, I am learning too. I am sure a Java Guru will correct or add to what I have written. If I am corrected I will learn from it as well as you will.
-Barry
PS I will add some references to some books I have that discuss this topic a little later.
[ January 07, 2003: Message edited by: Barry Gaunt ]
corrected names of exceptions NumericFormatException, InvalidNumericFormatException -> NumberFormatException.
[ January 08, 2003: Message edited by: Barry Gaunt ]
Maulin Vasavada
Ranch Hand

Joined: Nov 04, 2001
Posts: 1871
hi david,
well, i didn't mean reg exp in general. i know its very useful. but what i thought was,
1. reg exp is not Java API standard till jdk1.4 so if u use jdk1.3 then u've to use some off the shelf thing which might have many classes and all.
i think this unncessarily increases the code size and requriements (of having appropriate jars etc) if we are not going to use that reg exp thing anywhere else u know...
2. if we have jdk1.3 and StreamTokenizer then why not use it?
(of course if we had jdk1.4 i might go with reg exp as well)
i think from the impl perspective as well. its easier to use reg exp but we have to measure cost of using it in terms of processing it has to do against the processing that is done by StreamTokenizer to achieve equal goal. though this will lead us both to "Performance" forum i just pointed what i had in mind ...
anyways, i don't want to start a 'sub thread' in the main thread arguing 'why'. its important that Ellen gets something working
regards
maulin
Barry Gaunt
Ranch Hand

Joined: Aug 03, 2002
Posts: 7729
Maulin, I considered StreamTokeniser but rejected it. Is it not for parsing source code? (Too lazy to check in detail right now). If it is how do you decide between an integer token and an identifier like A424242XXX?
And 123.546 should tokenize as 123 and 546 not as a decimal number?
-Barry
[ January 07, 2003: Message edited by: Barry Gaunt ]
Jim Yingst
Wanderer
Sheriff

Joined: Jan 30, 2000
Posts: 18671
Getting an unchecked exception is a BAD THING and should only be used to indicate that an unexpected, unrecoverable programming error has occurred. A program that incurs an unchecked exception should terminate, and the programming error corrected.
Sounds like you're overstating the case here. There are many unchecked exceptions that are recoverable. Perhaps you're thinking of Errors? It's also true that many RuntimeExceptions indicate programming errors, e.g. ArrayIndexOutOfBounds, or even NullPointerException. The only reason these should be thrown is if the programmer has screwed up. I refer to these as SPEs - StupidProgrammerExceptions. However there are other RuntimeExceptions which are caused by things outside the programmer's control, and which it's perfectly valid to catch and deal with. NumberFormatException is just one such example. Consider the following:
How would you rewrite this code to accomplish the same thing without catching the NumberFormatException?
Another consideration is that even for StupidProgrammerExceptions, exiting the process may not be the best solution. During initial development, sure. But what if you're delivering a finished program to a customer? Can you be 100% sure you've found and removed all stupid programmer bugs? Probably not. Yet in many cases it's not desireable to shut down a program entirely. Many applications will choose to log the error, possibly notify the user about it, and try to move on. Certainly the next release of the software should fix the bug - but the customer may need to continue to use the software in the meantime.
Anyway, back to the particular problem at hand. I think that the appropriateness of catching NumberFormatException here depends primarily on whether an invalid input is considered normal, or an unusual occurrance. In this problem, it's apparently quite normal, and so using an exception here may be confusing to programmers who assume that it implies an unusual circumstance. And it will certainly be a bit slower (though this won't actually matter for many programs, as the difference between, say, 1 millisecond and 5 milliseconds, is not a big deal). So I'd agree, it's not the best solution here. But I wanted to trim some of the overstatement I saw in your reply.
Ellen- I think you need to consider all the different types of input you might see. What about something like this:
hgj36yuitog350ugog4w4er1
Should this yield 5 numbers (36, 350, 4, 4, 1)? Or is it something you shouldn't have to worry about? Can you make any assumtions or guarantees about the delimiters between number, or should you assume that any non-numeric character acts as a delimiter? If it's the latter, you might be better off just looping through all the chars in a line and count how many times the character type changes from numeric to non-numeric and vice versa. Something to think about...


"I'm not back." - Bill Harding, Twister
David Weitzman
Ranch Hand

Joined: Jul 27, 2001
Posts: 1365
This is actually a good occasion for using finite automata (I just mentioned them in another thread, so I'm in a String parsing mood). All you need is a state machine that looks something like this (this state machine only looks for integers seperated off by whitespace -- ab213 and 1-2-3 wouldn't count. You can design your own specs):

I could have done a cleaner job of organizing this state machine logically, but this should be comprehensible.
If you're interested, I can reply later with a description of how to turn this gunk above (or similar gunk) into working code.
[ January 07, 2003: Message edited by: David Weitzman ]
Barry Gaunt
Ranch Hand

Joined: Aug 03, 2002
Posts: 7729
Nayan, pay attention to Jim's comment. He's one of the Java Gurus I told you
Thanks Jim, I understand your example, and have done a similar thing myself to handle erroneous input; but only to handle "occasional errors" like errors in program startup parameters.
BTW I promised the references:
Arnold, Gosling, Holmes "The Java Programming Language", 3rd Ed Sun/Addison Wesley
Bloch "Effective Java, Programming Language Guide", Sun/Addison Wesley
Nayanjyoti Talukdar
Ranch Hand

Joined: Feb 12, 2002
Posts: 71
Thankx Jim for the nice explanation. Even, I was wondering what Barry has talked abt UncheckedException(i.e NumberFormatException). Surely, I agree that unchecked exceptions are generally handled by the JVM since it's out of programmer's control. If we've control over it, we programmer can also handle it whatever I did in my code. Yeah, I do agree that as far as design is concerned,that may not be the optimal solution as Jim said. I just mentioned that 'coz U said I had some problems in my code..Anyway, that was a nice discussion..
-------------
Nayan.
Ellen Zhao
Ranch Hand

Joined: Sep 17, 2002
Posts: 581
Hi, all the people above,
I didn�t come back to this thread for more than one week(embeded with java concurrent programming), It�s astonishing and great pleasure to see so much good advice offered by you. Thank you very much!
To Maulin and Barry: I think I prefer using regular expression to modify my code, I don�t know how to use StreamTokenizer(or StringTokenizer?) to analyse the data in my condition.
To David: Good idea to implement a finit state automat! And I think actually it might be some low level handling in the implementation of regular expression. Recently my algorithm course comes up to String Matching, I have been building several kinds of finite state automat for a while. I will be very glad to learn how to efficiently convert them to java code from you.
To Jim: Good consideration! I forgot to think about it. I should separate string like NH76dse23 from integers like 23872. Maybe I should implement the term "property" to avoid some unwanted invalid condition.
Thanks to you all again. I�m always glad to learn from you.

Best Regards,
Ellen
[ January 13, 2003: Message edited by: Ellen Fu ]
[ January 13, 2003: Message edited by: Ellen Fu ]
David Weitzman
Ranch Hand

Joined: Jul 27, 2001
Posts: 1365
First, I implemented the state machine using a nifty little tool called SMC (State Machine Compiler) which is built off Robert C. Martin's idea and code. Unless you are seriously concerned with speed, that's a good way to go.
NumFinder.sm (I used tabs for alignment, which has gotten a bit mucked up in this presentation. Sorry about that.):

Using that description as a guide, I also wrote by hand a more optimized version
NumStateMachine.java:

This code exhibits some bad decisions when it comes to good design (the if statements don't make sense unless you're looking at the definition above), but it should run pretty fast. It may contain mistakes, which is the risk of writing such nonsense by hand without corresponding unit tests.
The controller will take either an SMC generated state machine or the hand generated one (although you have to uncomment one line and comment out another to switch between them). Note that I used a cheap trick of multiplying by 10 and adding instead of building a String and calling Integer.parseInt(), which may lead to silent overflows if a large number like 230938173898927137273432 should happen to show up in the text being parsed.
NumFinder.java:

That's my share of hacking for the day.
[ January 14, 2003: Message edited by: David Weitzman ]
Ellen Zhao
Ranch Hand

Joined: Sep 17, 2002
Posts: 581
Hi David,
Brilliant code! Collected into my util.jar. Also the SMC resource is definitely something to me. Thank you very much!

Best Regards,
Ellen
[ January 16, 2003: Message edited by: Ellen Fu ]
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Extract all integers from a text file, how to do?