aspose file tools*
The moose likes Java in General and the fly likes How i read,remove html script tags,content ? Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » Java in General
Bookmark "How i read,remove html script tags,content ?" Watch "How i read,remove html script tags,content ?" New topic
Author

How i read,remove html script tags,content ?

Tran Tuan Hung
Ranch Hand

Joined: Apr 08, 2007
Posts: 59
I have an html file,so i would like to remove all tag and contents on it,please help me.
Thanks you very much.
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 42930
    
  68
Libraries like TagSoup, JTidy or NekoXNI can convert HTML into a DOM document, which makes it relatively easy to recurse through the DOM tree and extract all text.
Tran Tuan Hung
Ranch Hand

Joined: Apr 08, 2007
Posts: 59
I am sorry for my bad English,i wanna correct it, i have a file html includes
tags scripts,I want to read this file from my java file and i only remove this tags script and contents on it,Finally i want to save to a other file so suggest me to take it?
I only want remove tags script and content on this tag.
Thanks in advance
[ April 12, 2007: Message edited by: Tran Tuan Hung ]
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 42930
    
  68
Now I'm confused. You want to remove all "script" tags and their inner content? So in effect remove all JavaScript from a page? What about event handler sin the HTML - without the scripts they won't work - should they remain on the page?

I think it would be best if you posted a short example HTML page before and after the transformation you have in mind.
Tran Tuan Hung
Ranch Hand

Joined: Apr 08, 2007
Posts: 59
For example:
I have code, (The html page before)

Now,from my java file ,i will read this file html and i want to remove all tags script and content
in the " <script>.....content here....</script>",finally i will save it to 1 other html file,the orginal file still is
keep.
My idea is i will copy all contents from this html file when i suddenly the tags "<script>" ..content.. "</script>"
then i have no copy,after i copy again.But this way lose much time,so i want to use regex,but i dont well on it
please suggest me to comple my problem.
The html page after transformation .


Sorry for my bad english
Thanks and best regard,
[ April 12, 2007: Message edited by: Tran Tuan Hung ]
Tran Tuan Hung
Ranch Hand

Joined: Apr 08, 2007
Posts: 59
Hi there,
I code it as following:
import java.io.*;
import java.util.regex.*;
import java.lang.StringBuffer;

public class TagsRemover {
private static FileReader fr = null;
private StringBuffer sb,sb1 = null;
private String line = null;

public void readFile() throws IOException{
FileReader fr = new FileReader("Test.html");
BufferedReader br = new BufferedReader(fr);
StringBuffer sb = new StringBuffer();
String line = br.readLine();
try {
while (line!= null) {
sb.append(line).append("\\n");
line = br.readLine();
}
fr.close();
} catch (IOException e) {
// TODO: handle exception
System.out.println("Can not read the file");
}
}
public String removeTag(String data){
StringBuilder regex = new StringBuilder("<script[^>]*>(.*?)</script>");
int flags = Pattern.MULTILINE | Pattern.DOTALL| Pattern.CASE_INSENSITIVE;
Pattern pattern = Pattern.compile(regex.toString(), flags);
Matcher matcher = pattern.matcher(data);
return matcher.replaceAll("");
}
public void saveFile(String str) throws IOException{
try {
FileWriter fw = new FileWriter("Test1.html");
fw.write(str, 0, str.length());
fw.close();
} catch (Exception e) {
// TODO: handle exception
}
}
public static void main(String[] args) throws IOException {
TagsRemover tr = new TagsRemover();
tr.readFile();
String remover = tr.removeTag("script");
tr.saveFile("Test1.html");
}
}

But the code is not works as i want, please help me to correct it!
Thanks a lot and best regards,
[ April 17, 2007: Message edited by: Tran Tuan Hung ]
Joseph Sweet
Ranch Hand

Joined: Jan 29, 2005
Posts: 327
Hi,

Have you looked into JAXB?

http://java.sun.com/javaee/5/docs/tutorial/doc/JAXB.html#wp100322


We must know, we will know. -- David Hilbert
Tran Tuan Hung
Ranch Hand

Joined: Apr 08, 2007
Posts: 59
Thanks for the link my friend, but i would like to use regular expression
Someone?, anyone??,PLEASE!!!
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 42930
    
  68
But the code is not works as i want


That's not a very useful problem description. What should the code do, and what does it do right now? Give some short examples.
Tran Tuan Hung
Ranch Hand

Joined: Apr 08, 2007
Posts: 59
maybe i am bad english, i am sorry,
i have created 3 methods.
- readFile()method for read file and save the contents on it into a StringBuffer.
- removeTags() method for remove script tags and contents on it (just only the content on this tags).
- saveFile() method for saved it (after i was removed script tags and contents)to an other file (Test1.html)
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 42930
    
  68
Yes, but what is the problem? Give us an example of an HTML file/snippet that you are working on, and what it looks like after the code transforms it. Then show us the code that does the transformation.
Tran Tuan Hung
Ranch Hand

Joined: Apr 08, 2007
Posts: 59
Ok, my code is still confused, i am sorry
The html file input is: "Test.html"
<html>
<head>
<title> i am noob </title>
<script type="javascript">
function a () {
}

</script>
</head>
<body>
</div> abcd</div>
</body>
</html>

I would like after i deleted tag script an content on it, then i will have an file Test1.html with content following,but the my code is confuse so it is not works.
Here is file output i want : "Test1.html"
<html>
<head>
<title> i am noob </title>
</head>
<body>
</div> abcd</div>
</body>
</html>

Thanks you for listening!
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 42930
    
  68
I wouldn't use regexps in this case. The following is pseudocode that removes script tags. Be sure to make it case-insensitive.

[ April 17, 2007: Message edited by: Ulf Dittmer ]
Tran Tuan Hung
Ranch Hand

Joined: Apr 08, 2007
Posts: 59
Thanks you very much Ulf Dittmer
Joseph Sweet
Ranch Hand

Joined: Jan 29, 2005
Posts: 327
I think you are trying to reinvent the wheel (which is what they let you do in school, you know, parsing manually every byte).

This is not C here. You can learn C in 2 days because it comes with no libraries.

There are so many XML packages in Java, why don't you learn to use some of them. Sure it takes time.

How about trying SAX2.
Van Cam
Greenhorn

Joined: Aug 05, 2006
Posts: 13
Hi
Trying this code:

[ May 03, 2007: Message edited by: Van Cam ]
Campbell Ritchie
Sheriff

Joined: Oct 13, 2005
Posts: 40052
    
  28
From Joseph Sweet:
You can learn C in 2 days
The 1st April forum has been shut!
Joseph Sweet
Ranch Hand

Joined: Jan 29, 2005
Posts: 327
Yes, because ANSI C comes without thousands of APIs.

Not talking about POSIX or Visual C++ or .Net libraries. Just ANSI C.

98% of the time I spend on Java is to find out what API I should actually use. To understand how to use this or that API. How do I find my way in that labyrinth of alternative APIs. Not on issues that have to do with the pure language itself.

What Should a Good Enterprise Java Developer Know
[ April 20, 2007: Message edited by: Joseph Sweet ]
Joanne Neal
Rancher

Joined: Aug 05, 2005
Posts: 3742
    
  16
Originally posted by Joseph Sweet:
Yes, because ANSI C comes without thousands of APIs.

Not talking about POSIX or Visual C++ or .Net libraries. Just ANSI C.

98% of the time I spend on Java is to find out what API I should actually use. To understand how to use this or that API. How do I find my way in that labyrinth of alternative APIs. Not on issues that have to do with the pure language itself.

What Should a Good Enterprise Java Developer Know

[ April 20, 2007: Message edited by: Joseph Sweet ]


I think you need to differentiate between learning to program in Java and learning to develop enterprise applications using Java. They are two totally separate things. I doubt if anyone would be able to write an enterprise application in C after 2 days.
[ April 20, 2007: Message edited by: Joanne Neal ]

Joanne
 
With a little knowledge, a cast iron skillet is non-stick and lasts a lifetime.
 
subject: How i read,remove html script tags,content ?