• Post Reply Bookmark Topic Watch Topic
  • New Topic
programming forums Java Mobile Certification Databases Caching Books Engineering Micro Controllers OS Languages Paradigms IDEs Build Tools Frameworks Application Servers Open Source This Site Careers Other Pie Elite all forums
this forum made possible by our volunteer staff, including ...
Marshals:
  • Campbell Ritchie
  • Jeanne Boyarsky
  • Ron McLeod
  • Paul Clapham
  • Liutauras Vilda
Sheriffs:
  • paul wheaton
  • Rob Spoor
  • Devaka Cooray
Saloon Keepers:
  • Stephan van Hulst
  • Tim Holloway
  • Carey Brown
  • Frits Walraven
  • Tim Moores
Bartenders:
  • Mikalai Zaikin

HTML to plain text parser

 
Ranch Hand
Posts: 3640
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I have html content store in database.

I want to retrieve and want to display as unformatted text.

Are there any utility that parse HTML content into text?

e.g.
What I have following in database?
<p><b>Chetan Parekh</b></p>

What I need?
Chetan Parekh
 
Greenhorn
Posts: 4
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
You may need to write a parser which looks for html tags and will take off if that is contained in the specified String. Hope you are getting the data from db as Blob.
 
Chetan Parekh
Ranch Hand
Posts: 3640
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Originally posted by Anoop Chandran:
You may need to write a parser which looks for html tags and will take off if that is contained in the specified String.



I am looking for redymade parser that does the same. Are there any?

Hope you are getting the data from db as Blob.


You are right.
[ December 16, 2005: Message edited by: Chetan Parekh ]
 
Rancher
Posts: 43081
77
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
NekoHTML is an HTML parser which produces a DOM tree. I'm not sure if it can export the plain text, but it should be a good and easy starting point.

I don't think you need to store HTML as Blob - Clob should be sufficient, which would make it easier to work with.
 
Chetan Parekh
Ranch Hand
Posts: 3640
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
Only this will do
 
Ranch Hand
Posts: 163
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
I'd wonder why HTML is stored in a database at all. Sounds like a design where the view layer has penetrated all the way back to persistence - not a sound idea in my opinion.
 
Chetan Parekh
Ranch Hand
Posts: 3640
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

Originally posted by Michael Duffy:
I'd wonder why HTML is stored in a database at all. Sounds like a design where the view layer has penetrated all the way back to persistence - not a sound idea in my opinion.



We are developing content management system, where user can submit formatted content that we need to store in database.
 
Author and all-around good cowpoke
Posts: 13078
6
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator
You might find the open source JTidy utility to be helpful. You might even want to run the submitted formatted content through JTidy before accepting it to keep bad HTML out of your database.
Bill
 
Ulf Dittmer
Rancher
Posts: 43081
77
  • Mark post as helpful
  • send pies
    Number of slices to send:
    Optional 'thank-you' note:
  • Quote
  • Report post to moderator

String thisStringHasNoHtml = stringWithHtml.replaceAll("\\<.*?\\>","");



This will not work. E.g. "<abc>text</abc>" will be reduced to nothing, because most regexp packages perform greedy matching. That means that they match as far to the right as possible, and don't stop at the first possible match if a longer one is available.
Either use the non-greedy option if it is available, or a string like "\\<[^<]*?\\>", which prevents another opening angle bracket to be part of the match. It's probably better to replace by a space -and not the empty string-, so that words don't get joined inadvertently.
 
reply
    Bookmark Topic Watch Topic
  • New Topic