File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
The moose likes Beginning Java and the fly likes HTML parser Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » Beginning Java
Bookmark "HTML parser" Watch "HTML parser" New topic

HTML parser

Flora Ng

Joined: Jul 05, 2001
Posts: 11
I'm stuck in this problem...
I'm writing a program that behaves like a parser. It checks the HTML, ignore everything inside the tag, but extract numbers that are outside the tag (numbers that are visible via the use of browser).
However the problem is: if the program reads in character by character. When it comes to '<', it will think this is an open tag and will ignore everything until a '>' comes up
So for example in the following sentence:
three < five<br /> The system will continue to look for '>' and never terminate.
What's the best solution to that? Are there any way to identify HTML tag?
Thanks in advance.
Cindy Glass
"The Hood"

Joined: Sep 29, 2000
Posts: 8521
The traditional way around it is to use "& lt' and '& gt' (without the spaces) if you want to display greater to and less than and know that they are not html. This of course only works if YOU get to control the input into the html page.

[This message has been edited by Cindy Glass (edited August 16, 2001).]

"JavaRanch, where the deer and the Certified play" - David O'Meara
I agree. Here's the link:
subject: HTML parser
It's not a secret anymore!