• Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

Convert html file to normal text

 
Harish Ponduri
Greenhorn
Posts: 19
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I need to convert a String which has a HTML file but i need to convert that into normal text by removing all html related tags.
Detail:

I got a html file and from one of my java class is reading that entire HTML page and converting it to a single string and now i want all that HTML tags to be removed from that string..


thank you very much in advance.. Hari
 
sandeep lokhande
Ranch Hand
Posts: 120
Eclipse IDE Firefox Browser
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
You have the String with the HTML file and you want to convert it to text?
if we save file as txt then it will save as txt, what is the problem?
You want to remove html tag and extract the only text?
Does your html has images etc.?
Please tell me what you exactly want?
 
Harish Ponduri
Greenhorn
Posts: 19
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi Sandeep,
Thank you for your reply

my problem is i got a activity to be run and in which it takes all the mails in my box and put it in database ok...
now there might be chances like people can put the html email, so previously i was just concatinating all the Multipart data in my body and concatinating to string and posting in database.

But in some html cases all the data like html tags are also getting concatinated and saved in db.

so now i just want to remove all the html tags and save only the actual content/information of that...

thanking you,
hari
 
Ulf Dittmer
Rancher
Posts: 42967
73
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
This should help: http://forums.sun.com/thread.jspa?threadID=634120
 
Henry Wong
author
Marshal
Pie
Posts: 21004
77
C++ Chrome Eclipse IDE Firefox Browser Java jQuery Linux VI Editor Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Harish Ponduri wrote:
so now i just want to remove all the html tags and save only the actual content/information of that...


Well, it depends on how you want it removed. If you don't care too much to be 100% accurate, and may leave parts of some tags behind, then using regular expressions is probably the easiest route.

If you need to be accurate, and your HTML is well formed, then using JAXP (or JAXB if you are using java 6) would work.


If you need to be accurate, and the HTML can be anything, then you'll need a third party parser. Just google for an HTML parser. There are lots of open sources ones.

Henry
 
David Newton
Author
Rancher
Posts: 12617
IntelliJ IDE Ruby
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Please see EaseUp. Never mind, Ulf edited the subject.
 
Harish Ponduri
Greenhorn
Posts: 19
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hello Sandeep,

Thank you for your valuable reply.

can you just brief me what is this jaxp and where can i get information on this jaxp
any tutorial for it or any pdf you have?
can you please share some document related to jaxp.

thanking you.
 
David Newton
Author
Rancher
Posts: 12617
IntelliJ IDE Ruby
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Did you consider searching on the web? Invariably faster than waiting for an answer here.
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic