File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
http://aspose.com/file-tools
The moose likes Java in General and the fly likes Parsing html text. Big Moose Saloon
  Search | Java FAQ | Recent Topics
Register / Login


JavaRanch » Java Forums » Java » Java in General
Reply Bookmark "Parsing html text." Watch "Parsing html text." New topic
Author

Parsing html text.

Tad Dicks
Ranch Hand

Joined: Nov 16, 2004
Posts: 264
I have to write a class that will parse some html input (that's programmatically generated) and will put items that logically belong in lists into lists. Such as if something starts with:
A. or 1. or (1) or (a) or a.

assume the start of a list and then try to pick up lists inside of lists. The main difficulty I see is with ending a list, since these lists are inside of larger documents. The class/program is going to have to make some best guesses. I'm just wondering if there isn't already something out there similiar and if not whats the best way to tackle the problem.

-Tad
Jody Brown
Ranch Hand

Joined: Nov 09, 2005
Posts: 43
Without seeing a sample of the html you are attempting to parse, my suggestions are a bit more limited, but off of the top of my head, you could consider the following. If your lists are stored in regular html structures (dropdown boxes, html lists etc) you could tokenise the tags (the <option> and </option> tags in a select box for example) and extract the strings between the two tokens for storage in your Java data structure. Or, alternatively, you could write a utility class that searches for common identifiers in a list , the likes of which you gave examples of, using the String.indexOf() method to search for opening and closing brackets, and then extracting the rest of the string from that point onwards for storage.

Hope this helps.
Tad Dicks
Ranch Hand

Joined: Nov 16, 2004
Posts: 264
Unfortunately the html doesn't include things like drop down boxes etc. Most of it is tagged in p's, span, and div tags (and table tags). I was thinking along the same lines finding the indexOf for the list starts charSequences.

-Tad
Jody Brown
Ranch Hand

Joined: Nov 09, 2005
Posts: 43
Well, tokenising might still be worth considering. You can tokenise at any level, and use for examle the <p> and </p> tags to grab everything between a paragraph. The same goes for the <table> and </table> tags. This might cut out a lot of the fluff before you get down to the dirty job of parsing the strings using indexOf. This might be useful if you have nested lists - you are liable to run into some processing overheads if your lists are nested fairly deeply, espeically if you use the likes of recursion to dig down into them automatically.
Tad Dicks
Ranch Hand

Joined: Nov 16, 2004
Posts: 264
I think I'm going to delve into using the Pattern/Matcher classes to do it... the span/para etc tags show up everywhere in the text, splitting things in some odd places.


-Tad
 
I agree. Here's the link: http://ej-technologies/jprofiler - if it wasn't for jprofiler, we would need to run our stuff on 16 servers instead of 3.
 
subject: Parsing html text.
 
Similar Threads
Mock exams
Ordered List tag inside jsp
merging two lists inside a list
problem in selecting second arraylist from mysql database
splitting up lists