This week's book giveaway is in the Mac OS forum.
We're giving away four copies of a choice of "Take Control of Upgrading to Yosemite" or "Take Control of Automating Your Mac" and have Joe Kissell on-line!
See this thread for details.
The moose likes Java in General and the fly likes Convert html file to normal text Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


JavaRanch » Java Forums » Java » Java in General
Bookmark "Convert html file to normal text" Watch "Convert html file to normal text" New topic
Author

Convert html file to normal text

Harish Ponduri
Greenhorn

Joined: Jun 08, 2009
Posts: 19
I need to convert a String which has a HTML file but i need to convert that into normal text by removing all html related tags.
Detail:

I got a html file and from one of my java class is reading that entire HTML page and converting it to a single string and now i want all that HTML tags to be removed from that string..


thank you very much in advance.. Hari
sandeep lokhande
Ranch Hand

Joined: Jan 25, 2010
Posts: 118

You have the String with the HTML file and you want to convert it to text?
if we save file as txt then it will save as txt, what is the problem?
You want to remove html tag and extract the only text?
Does your html has images etc.?
Please tell me what you exactly want?


Thanks and Regards,
Sandeep Lokhande.
Harish Ponduri
Greenhorn

Joined: Jun 08, 2009
Posts: 19
Hi Sandeep,
Thank you for your reply

my problem is i got a activity to be run and in which it takes all the mails in my box and put it in database ok...
now there might be chances like people can put the html email, so previously i was just concatinating all the Multipart data in my body and concatinating to string and posting in database.

But in some html cases all the data like html tags are also getting concatinated and saved in db.

so now i just want to remove all the html tags and save only the actual content/information of that...

thanking you,
hari
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 42039
    
  64
This should help: http://forums.sun.com/thread.jspa?threadID=634120


Ping & DNS - my free Android networking tools app
Henry Wong
author
Sheriff

Joined: Sep 28, 2004
Posts: 18875
    
  40

Harish Ponduri wrote:
so now i just want to remove all the html tags and save only the actual content/information of that...


Well, it depends on how you want it removed. If you don't care too much to be 100% accurate, and may leave parts of some tags behind, then using regular expressions is probably the easiest route.

If you need to be accurate, and your HTML is well formed, then using JAXP (or JAXB if you are using java 6) would work.


If you need to be accurate, and the HTML can be anything, then you'll need a third party parser. Just google for an HTML parser. There are lots of open sources ones.

Henry


Books: Java Threads, 3rd Edition, Jini in a Nutshell, and Java Gems (contributor)
David Newton
Author
Rancher

Joined: Sep 29, 2008
Posts: 12617

Please see EaseUp. Never mind, Ulf edited the subject.
Harish Ponduri
Greenhorn

Joined: Jun 08, 2009
Posts: 19
Hello Sandeep,

Thank you for your valuable reply.

can you just brief me what is this jaxp and where can i get information on this jaxp
any tutorial for it or any pdf you have?
can you please share some document related to jaxp.

thanking you.
David Newton
Author
Rancher

Joined: Sep 29, 2008
Posts: 12617

Did you consider searching on the web? Invariably faster than waiting for an answer here.
 
GeeCON Prague 2014
 
subject: Convert html file to normal text