Forum:

HTML to plain text parser

Ranch Hand

Posts: 3640

posted 18 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

I have html content store in database.

I want to retrieve and want to display as unformatted text.

Are there any utility that parse HTML content into text?

e.g.
What I have following in database?
<p><b>Chetan Parekh</b></p>

What I need?
Chetan Parekh

My blood is tested +ve for Java.

Anoop Chandran

Greenhorn

Posts: 4

posted 18 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

You may need to write a parser which looks for html tags and will take off if that is contained in the specified String. Hope you are getting the data from db as Blob.

Chetan Parekh

Ranch Hand

Posts: 3640

posted 18 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

Originally posted by Anoop Chandran:
You may need to write a parser which looks for html tags and will take off if that is contained in the specified String.

I am looking for redymade parser that does the same. Are there any?

Hope you are getting the data from db as Blob.

You are right.
[ December 16, 2005: Message edited by: Chetan Parekh ]

My blood is tested +ve for Java.

Ulf Dittmer

Rancher

Posts: 43081

posted 18 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

NekoHTML is an HTML parser which produces a DOM tree. I'm not sure if it can export the plain text, but it should be a good and easy starting point.

I don't think you need to store HTML as Blob - Clob should be sufficient, which would make it easier to work with.

Chetan Parekh

Ranch Hand

Posts: 3640

posted 18 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

Only this will do

My blood is tested +ve for Java.

Michael Duffy

Ranch Hand

Posts: 163

posted 18 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

I'd wonder why HTML is stored in a database at all. Sounds like a design where the view layer has penetrated all the way back to persistence - not a sound idea in my opinion.

Chetan Parekh

Ranch Hand

Posts: 3640

posted 18 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

Originally posted by Michael Duffy:
I'd wonder why HTML is stored in a database at all. Sounds like a design where the view layer has penetrated all the way back to persistence - not a sound idea in my opinion.

We are developing content management system, where user can submit formatted content that we need to store in database.

My blood is tested +ve for Java.

William Brogden

Author and all-around good cowpoke

Posts: 13078

posted 18 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

You might find the open source JTidy utility to be helpful. You might even want to run the submitted formatted content through JTidy before accepting it to keep bad HTML out of your database.
Bill

Ulf Dittmer

Rancher

Posts: 43081

posted 18 years ago

Number of slices to send:

Optional 'thank-you' note:

Send

String thisStringHasNoHtml = stringWithHtml.replaceAll("\\<.*?\\>","");

This will not work. E.g. "<abc>text</abc>" will be reduced to nothing, because most regexp packages perform greedy matching. That means that they match as far to the right as possible, and don't stop at the first possible match if a longer one is available.
Either use the non-greedy option if it is available, or a string like "\\<[^<]*?\\>", which prevents another opening angle bracket to be part of the match. It's probably better to replace by a space -and not the empty string-, so that words don't get joined inadvertently.