| Author |
Regex Servlet issue
|
Dean O'olish
Greenhorn
Joined: Mar 03, 2009
Posts: 16
|
|
I am obtaining a string of HTML and I need to parse this HTML and turn it into a text file. I have everything else working except a way to parse the HTML string. There are a handful of tags that I need to do some formatting on; for example a table row (<tr> or <tr class=""..) needs to be replaced with a "\r".
I am having a hard time finding a regular expression that can accomplish this.
Below is exactly what I am trying to do
should be parsed to
Can anyone help me out with this?
Thanks!!!
|
 |
Bear Bibeault
Author and ninkuma
Marshal
Joined: Jan 10, 2002
Posts: 56162
|
|
|
Nothing to do with servlets really. Moved to Java in General.
|
[Smart Questions] [JSP FAQ] [Books by Bear] [Bear's FrontMan] [About Bear]
|
 |
Henry Wong
author
Sheriff
Joined: Sep 28, 2004
Posts: 16681
|
|
Can anyone help me out with this?
Well, I guess you can extract each row, with a find() on "<tr>(.*?)</tr>". And once you have the row, you can extract the columns, with a find() on "<td>(.*?)</td>". In both cases, the extracted text is in group 1, of course.
Henry
|
Books: Java Threads, 3rd Edition, Jini in a Nutshell, and Java Gems (contributor)
|
 |
Piet Verdriet
Ranch Hand
Joined: Feb 25, 2006
Posts: 266
|
|
Henry Wong wrote:
Can anyone help me out with this?
Well, I guess you can extract each row, with a find() on "<tr>(.*?)</tr>". ...
@OP: be sure to use the DOT-ALL option when using Henry's suggestion: by default, the DOT meta character matches any character except a new line character.
Also note that when there's a (small) mistake in your HTML, the regex can easily break and can produce strange (or unexpected) output. Parsing HTML can bets be done using a dedicated HTML parser which can better recover from improperly formed HTML.
My 2 cents.
|
 |
 |
|
|
subject: Regex Servlet issue
|
|
|