This week's book giveaway is in the General Computing forum. We're giving away four copies of Arduino in Action and have Martin Evans, Joshua Noble, and Jordan Hochenbaum on-line! See this thread for details.
Libraries like TagSoup, JTidy or NekoXNI can convert HTML into a DOM document, which makes it relatively easy to recurse through the DOM tree and extract all text.
I am sorry for my bad English,i wanna correct it, i have a file html includes tags scripts,I want to read this file from my java file and i only remove this tags script and contents on it,Finally i want to save to a other file so suggest me to take it? I only want remove tags script and content on this tag. Thanks in advance [ April 12, 2007: Message edited by: Tran Tuan Hung ]
Ulf Dittmer
Marshal
Joined: Mar 22, 2005
Posts: 35443
9
posted
0
Now I'm confused. You want to remove all "script" tags and their inner content? So in effect remove all JavaScript from a page? What about event handler sin the HTML - without the scripts they won't work - should they remain on the page?
I think it would be best if you posted a short example HTML page before and after the transformation you have in mind.
Tran Tuan Hung
Ranch Hand
Joined: Apr 08, 2007
Posts: 59
posted
0
For example: I have code, (The html page before)
Now,from my java file ,i will read this file html and i want to remove all tags script and content in the " <script>.....content here....</script>",finally i will save it to 1 other html file,the orginal file still is keep. My idea is i will copy all contents from this html file when i suddenly the tags "<script>" ..content.. "</script>" then i have no copy,after i copy again.But this way lose much time,so i want to use regex,but i dont well on it please suggest me to comple my problem. The html page after transformation .
Sorry for my bad english Thanks and best regard, [ April 12, 2007: Message edited by: Tran Tuan Hung ]
public class TagsRemover { private static FileReader fr = null; private StringBuffer sb,sb1 = null; private String line = null;
public void readFile() throws IOException{ FileReader fr = new FileReader("Test.html"); BufferedReader br = new BufferedReader(fr); StringBuffer sb = new StringBuffer(); String line = br.readLine(); try { while (line!= null) { sb.append(line).append("\\n"); line = br.readLine(); } fr.close(); } catch (IOException e) { // TODO: handle exception System.out.println("Can not read the file"); } } public String removeTag(String data){ StringBuilder regex = new StringBuilder("<script[^>]*>(.*?)</script>"); int flags = Pattern.MULTILINE | Pattern.DOTALL| Pattern.CASE_INSENSITIVE; Pattern pattern = Pattern.compile(regex.toString(), flags); Matcher matcher = pattern.matcher(data); return matcher.replaceAll(""); } public void saveFile(String str) throws IOException{ try { FileWriter fw = new FileWriter("Test1.html"); fw.write(str, 0, str.length()); fw.close(); } catch (Exception e) { // TODO: handle exception } } public static void main(String[] args) throws IOException { TagsRemover tr = new TagsRemover(); tr.readFile(); String remover = tr.removeTag("script"); tr.saveFile("Test1.html"); } }
But the code is not works as i want, please help me to correct it! Thanks a lot and best regards, [ April 17, 2007: Message edited by: Tran Tuan Hung ]
Thanks for the link my friend, but i would like to use regular expression Someone?, anyone??,PLEASE!!!
Ulf Dittmer
Marshal
Joined: Mar 22, 2005
Posts: 35443
9
posted
0
But the code is not works as i want
That's not a very useful problem description. What should the code do, and what does it do right now? Give some short examples.
Tran Tuan Hung
Ranch Hand
Joined: Apr 08, 2007
Posts: 59
posted
0
maybe i am bad english, i am sorry, i have created 3 methods. - readFile()method for read file and save the contents on it into a StringBuffer. - removeTags() method for remove script tags and contents on it (just only the content on this tags). - saveFile() method for saved it (after i was removed script tags and contents)to an other file (Test1.html)
Ulf Dittmer
Marshal
Joined: Mar 22, 2005
Posts: 35443
9
posted
0
Yes, but what is the problem? Give us an example of an HTML file/snippet that you are working on, and what it looks like after the code transforms it. Then show us the code that does the transformation.
Tran Tuan Hung
Ranch Hand
Joined: Apr 08, 2007
Posts: 59
posted
0
Ok, my code is still confused, i am sorry The html file input is: "Test.html"
<html> <head> <title> i am noob </title> <script type="javascript"> function a () { }
I would like after i deleted tag script an content on it, then i will have an file Test1.html with content following,but the my code is confuse so it is not works. Here is file output i want : "Test1.html"
<html> <head> <title> i am noob </title> </head> <body> </div> abcd</div> </body> </html>
Thanks you for listening!
Ulf Dittmer
Marshal
Joined: Mar 22, 2005
Posts: 35443
9
posted
0
I wouldn't use regexps in this case. The following is pseudocode that removes script tags. Be sure to make it case-insensitive.
[ April 17, 2007: Message edited by: Ulf Dittmer ]
Tran Tuan Hung
Ranch Hand
Joined: Apr 08, 2007
Posts: 59
posted
0
Thanks you very much Ulf Dittmer
Joseph Sweet
Ranch Hand
Joined: Jan 29, 2005
Posts: 327
posted
0
I think you are trying to reinvent the wheel (which is what they let you do in school, you know, parsing manually every byte).
This is not C here. You can learn C in 2 days because it comes with no libraries.
There are so many XML packages in Java, why don't you learn to use some of them. Sure it takes time.
Yes, because ANSI C comes without thousands of APIs.
Not talking about POSIX or Visual C++ or .Net libraries. Just ANSI C.
98% of the time I spend on Java is to find out what API I should actually use. To understand how to use this or that API. How do I find my way in that labyrinth of alternative APIs. Not on issues that have to do with the pure language itself.
Originally posted by Joseph Sweet: Yes, because ANSI C comes without thousands of APIs.
Not talking about POSIX or Visual C++ or .Net libraries. Just ANSI C.
98% of the time I spend on Java is to find out what API I should actually use. To understand how to use this or that API. How do I find my way in that labyrinth of alternative APIs. Not on issues that have to do with the pure language itself.
[ April 20, 2007: Message edited by: Joseph Sweet ]
I think you need to differentiate between learning to program in Java and learning to develop enterprise applications using Java. They are two totally separate things. I doubt if anyone would be able to write an enterprise application in C after 2 days. [ April 20, 2007: Message edited by: Joanne Neal ]