This week's book giveaway is in the Agile and other Processes forum. We're giving away four copies of The Mikado Method and have Ola Ellnestam and Daniel Brolund on-line! See this thread for details.
Hi friends, I am developing a java application where i need to extract text content from web pages and then summarize it based on a keyword given by the user.I have extracted the text content from web pages but i need to summarize it based on keyword given.Is there any java tools available which can help me sort this problem or someone can send me some code which converts the text to bits of text. thanking u in advance Pradeep
Jan Groth
Ranch Hand
Joined: Feb 03, 2004
Posts: 456
posted
0
no easy way to achieve this, sounds like you need a search engine, which indices the text for you.
btw: if not a must, you can save the detour to extract the text from the webpage...
I'm not aware of a text summarization API in Java. Lucene lets you index and search text, but it does not address summarization. I'm also not sure what you mean by "summarize it based on a keyword" - do you want to extract those parts of the text that deal with that particular keyword?
William Brogden
Author and all-around good cowpoke
Rancher
Joined: Mar 22, 2000
Posts: 12269
1
posted
0
You need to parse the text into units that make sense to humans - phrases, sentences and paragraphs. Next score those units according to the presence of keyword(s), now select the best of the units that are "hits" according to typical writing principles and the size of the summary you are aming for.
What do I mean about writing principles? Think about how you yourself scan text. For example you expect the first sentence of a paragraph to be meaningful in terms of the content of that paragraph. You expect a good chance that the last paragraph of an article to summarize the article.
In the prehistoric era of computers (showing my age now) there was an indexing technique called KWIC - Key Word In Context. It created a listing with the n words preceeding a key word plus the n words following. This put a burden on the reader to recognize a significant context versus a trival one.
This is a topic of continued interest to me, let us know what you come up with. Bill