File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
The moose likes Other Open Source Projects and the fly likes Best Java API for balancing HTML tags Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Products » Other Open Source Projects
Bookmark "Best Java API for balancing HTML tags" Watch "Best Java API for balancing HTML tags" New topic

Best Java API for balancing HTML tags

Ajay Dhar
Ranch Hand

Joined: Jan 26, 2011
Posts: 30
Does anyone know of a really good API for balancing HTML tags? Say for example I have the following HTML snippet:

Feedback control is the basic mechanism by which systems, whether mechanical, electrical, or biological, maintain their equilibrium or homeostasis. In the higher life forms, the conditions under which life can continue are quite narrow. A change in body temperature of half a degree is generally a sign of illness. The homeostasis of the body is maintained through the use of feedback control [Wiener 1948]. A primary contribution of C.R. Darwin during the last century was the theory that feedback over long time periods is responsible for the evolution of species. In 1931 V. Volterra explained the balance between two populations of fish in a closed pond using the theory of feedback.</P>
Feedback control may be defined as the use of difference signals, determined by comparing the actual values of system variables to their desired values, as a means of controlling a system. An everyday example of a feedback control system is an automobile speed control, which uses the difference between the actual and the desired speed to vary the fuel flow rate. Since the system output is used to regulate its input, such a device is said to be a <em>closed-loop control system</em>.</P>
In this book we shall show how to use <em>modern control theory</em> to design feedback control systems. Thus, we are concerned not with natural control systems, such as those that occur in living organisms or in society, but with man-made control systems such as those used to control aircraft, automobilies, satellites, robots, and industrial processes.</P>
Realizing that the best way to understand an area is to examine its evolution and the reasons for its existence, we shall first provide a short history of automatic control theory. Then, we give a brief discussion of the philosophies of classical and modern control theory.</P>
The references for Chapter 1 are at the end of this chapter. The references for the remainder of the book appear at the end of the book.</P>

The paragraphs are followed by closing </p> tags but none of the paragraphs start with an open <p> tag. I'm looking for an API to balance the <p> tags accurately so that I can convert the HTML page to XML and extract the paragraphs with XPATH.


OCPJP 6, OCEEJBD 6, GIAC Secure Software Programmer-Java (GSSP-Java)
Tim Moores

Joined: Sep 21, 2011
Posts: 2413
Check out NekoHtml, JTidy, TagSoup and HtmlCleaner.
I agree. Here's the link:
subject: Best Java API for balancing HTML tags
It's not a secret anymore!