IntelliJ open source
[Logo] JavaRanch » JavaRanch Saloon
  Search | FAQ | Recent Topics | Hot Topics
Register / Login


Reply Bookmark it! Watch this topic JavaRanch » Forums » Java » Java in General
 
RSS feed
 
New topic
Author

Convert html file to normal text

Harish Ponduri
Greenhorn

Joined: Jun 08, 2009
Messages: 14

I need to convert a String which has a HTML file but i need to convert that into normal text by removing all html related tags.
Detail:

I got a html file and from one of my java class is reading that entire HTML page and converting it to a single string and now i want all that HTML tags to be removed from that string..


thank you very much in advance.. Hari

This message was edited 1 time. Last update was at by Ulf Dittmer

sandeep lokhande
Ranch Hand

Joined: Jan 25, 2010
Messages: 52

You have the String with the HTML file and you want to convert it to text?
if we save file as txt then it will save as txt, what is the problem?
You want to remove html tag and extract the only text?
Does your html has images etc.?
Please tell me what you exactly want?

<I want the Best>
Harish Ponduri
Greenhorn

Joined: Jun 08, 2009
Messages: 14

Hi Sandeep,
Thank you for your reply

my problem is i got a activity to be run and in which it takes all the mails in my box and put it in database ok...
now there might be chances like people can put the html email, so previously i was just concatinating all the Multipart data in my body and concatinating to string and posting in database.

But in some html cases all the data like html tags are also getting concatinated and saved in db.

so now i just want to remove all the html tags and save only the actual content/information of that...

thanking you,
hari
Ulf Dittmer
Sheriff

Joined: Mar 22, 2005
Messages: 26684

This should help: http://forums.sun.com/thread.jspa?threadID=634120

Java web chartsImageJ PluginsSpecification URLsJava FAQs
Henry Wong
author
Bartender

Joined: Sep 28, 2004
Messages: 9915

Harish Ponduri wrote:
so now i just want to remove all the html tags and save only the actual content/information of that...


Well, it depends on how you want it removed. If you don't care too much to be 100% accurate, and may leave parts of some tags behind, then using regular expressions is probably the easiest route.

If you need to be accurate, and your HTML is well formed, then using JAXP (or JAXB if you are using java 6) would work.


If you need to be accurate, and the HTML can be anything, then you'll need a third party parser. Just google for an HTML parser. There are lots of open sources ones.

Henry

Books: Java Threads, 3rd Edition, Jini in a Nutshell, and Java Gems (contributor)
David Newton
Author
Bartender

Joined: Sep 29, 2008
Messages: 6626

Please see EaseUp. Never mind, Ulf edited the subject.

This message was edited 1 time. Last update was at by David Newton


Consultant/Trainer | Polyglottal Developer | Struts Committer/PMC | Struts 2 Web Application Development
Harish Ponduri
Greenhorn

Joined: Jun 08, 2009
Messages: 14

Hello Sandeep,

Thank you for your valuable reply.

can you just brief me what is this jaxp and where can i get information on this jaxp
any tutorial for it or any pdf you have?
can you please share some document related to jaxp.

thanking you.
David Newton
Author
Bartender

Joined: Sep 29, 2008
Messages: 6626

Did you consider searching on the web? Invariably faster than waiting for an answer here.

Consultant/Trainer | Polyglottal Developer | Struts Committer/PMC | Struts 2 Web Application Development
 
 
 
Reply Bookmark it! Watch this topic JavaRanch » Forums » Java » Java in General
 
RSS feed
 
New topic
MyEclipse Enterprise Workbench

.