aspose file tools*
The moose likes I/O and Streams and the fly likes Reading HTML file Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » I/O and Streams
Bookmark "Reading HTML file " Watch "Reading HTML file " New topic
Author

Reading HTML file

Hema Sukumar
Greenhorn

Joined: Dec 22, 2000
Posts: 22
Hi,
I'm trying to read a HTML page & print out the email id's in
that page. I have a problem while using the String Tokenizer.
I pass the "mailto:" tag we use in the HTML to identify the email id's as the delimiter string in the String Tokenizer.
Here's the sample HTML page & the java code I used.
HTML Page
------------------------------------------------
<HTML>

<BODY LINK="#FFFF00" VLINK="#FFFF00" BGCOLOR="#000000">


<CENTER>

Hema's Page


Mail me...

</CENTER>

</BODY>
</HTML>
---------------------------------------------------------------
When I used "mailto" as delimiter while reading this file,
I expected it to print every line after "mailto" & I thought
I can get the substring between : and " as email id.
i.e, -----------------------------------------------
:hemasu@hotmail.com">Mail me...

</CENTER>
</BODY>
</HTML>
-----------------------------------------------
But what I got instead was the delimited output of every line for the individual characters in the String.
ie. Something like this...
---------------------------------------------------
<HTML>

<BODY LINK="#FFFF00" VLINK="#FFFF00" BGCOLOR="#000000">

<FONT FACE="C<br /> c S<br /> ns MS" SIZE=5 C<br /> r="#FFFFFF">
<CENTER>

He
's P
ge


<A HREF="<BR rel="nofollow">he<br /> su@h<br /> .c<br /> ">M
e...

</CENTER>
</BODY>
</HTML>
--------------------------------------------------------------

The java code I used is..
-------------------------------------------------------------
public static void main( String[] args)
{
try {
String filename = "hema.html";
BufferedReader br = new BufferedReader( new FileReader(filename));

String s;
while ( (s= br.readLine()) !=null)
{
//System.out.println(s);
String set = "mailto:";
StringTokenizer st = new StringTokenizer(s, set);
while (st.hasMoreTokens())
{
String token = st.nextToken();
System.out.println(token);

int start = token.indexOf(':');
System.out.println(start);

int end = token.indexOf ('"');
System.out.println(end);

String email = token.substring( start, end) + "," + "\n";
System.out.println(email);
PrintWriter pout = new PrintWriter( new FileWriter("email.txt"));
pout.print(email);
}

}
br.close();
pout.close();
}
catch (Exception e) {
System.err.println(e.getMessage());

}


}

It's reading the HTML page & printing every line correctly. But I have trouble printing & processing the tokens.
Any help will be highly appreciated.
Thanks,
Hema
Peter Tran
Bartender

Joined: Jan 02, 2001
Posts: 783
Hema,
The problem is StringTokenizer doesn't treat your set as one token, but rather as a set of tokens.
Please read the following article on the pitfall of the StringTokenizer class.
-Peter
Hema Sukumar
Greenhorn

Joined: Dec 22, 2000
Posts: 22
Thanks Peter..
What will I ever do without Java Ranch ..
-Hema
 
 
subject: Reading HTML file