Meaningless Drivel is fun!*
The moose likes Java in General and the fly likes Extracting sentences from a text file Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of Android Security Essentials Live Lessons this week in the Android forum!
JavaRanch » Java Forums » Java » Java in General
Bookmark "Extracting sentences from a text file" Watch "Extracting sentences from a text file" New topic
Author

Extracting sentences from a text file

Ayan Biswas
Ranch Hand

Joined: Jul 10, 2010
Posts: 104
I need to write a program that will extract sentences from a text file.If I use '.' as a delimiter and separate the text by it then each acronyme becomes a sentence!!How to solve this problem?


AyanBiswas
Henry Wong
author
Sheriff

Joined: Sep 28, 2004
Posts: 18544
    
  40

Ayan Biswas wrote:I need to write a program that will extract sentences from a text file.If I use '.' as a delimiter and separate the text by it then each acronyme becomes a sentence!!How to solve this problem?



One option is to further qualify your definition of what is a sentence. For example, if a sentence must be longer than one word, or longer than two letters, wouldn't that take care of your false positives from acronyms?

Henry


Books: Java Threads, 3rd Edition, Jini in a Nutshell, and Java Gems (contributor)
Ayan Biswas
Ranch Hand

Joined: Jul 10, 2010
Posts: 104
One option is to further qualify your definition of what is a sentence. For example, if a sentence must be longer than one word, or longer than two letters, wouldn't that take care of your false positives from acronyms?


here is the problem if i follow the instructions.
suppose the sentence is like this "<some text> U.S.A<some text>".Problem will persist in that case
Ayan Biswas
Ranch Hand

Joined: Jul 10, 2010
Posts: 104
some text "U" ,will be the first sentence."S" will be the next sentence(which I can append to "U" as word count =1) and "A" some text will be the last sentence.so,problem persists in the last sentence.
Rob Spoor
Sheriff

Joined: Oct 27, 2005
Posts: 19655
    
  18

Your definition of sentence end is not correct. A sentence doesn't necessarily end in a dot (or question mark, exclamation mark, etc). You could regard the end of a sentence a dot, question mark or exclamation mark but only if it is followed by whitespace (space, enter, tab, etc) or nothing at all (end of String). This is the approach that Javadoc also uses.

That's still flawed however, as the sentence would end with U.S.A. even if there's something after it. Javadoc also has this problem; I've seen several Javadoc comments in the summary list end with "i.e.". We need to redefine what a sentence end is. You can expand the previous definition to include that the next word should start with an uppercase letter. However, that will still be incorrect if you have a name or something other with an uppercase letter after an acronym. It becomes evident that full sentence recognition is still not trivial (or even possible?) to do from code.


SCJP 1.4 - SCJP 6 - SCWCD 5 - OCEEJBD 6
How To Ask Questions How To Answer Questions
Henry Wong
author
Sheriff

Joined: Sep 28, 2004
Posts: 18544
    
  40

This is why my response was to further qualify your definition of what is a sentence -- and the rest of the response was just examples.

Only the OP knows the exact definition of what is a sentence, and hence, able to correctly qualify it. Now, of course, if the definition is as used in any generic text, then it is very difficult, if not impossible.

Henry
Ayan Biswas
Ranch Hand

Joined: Jul 10, 2010
Posts: 104
Thanks for all the replies.
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Extracting sentences from a text file
 
Similar Threads
readLine()
Reading 3 lines of text
primitive data type manipulation in wrapper classes
StreamTokenizer
Retaining paragraph breaks from text