The moose likes Java in General and the fly likes Regex Servlet issue Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » Java in General
Bookmark "Regex Servlet issue" Watch "Regex Servlet issue" New topic

Regex Servlet issue

Dean O'olish

Joined: Mar 03, 2009
Posts: 16
I am obtaining a string of HTML and I need to parse this HTML and turn it into a text file. I have everything else working except a way to parse the HTML string. There are a handful of tags that I need to do some formatting on; for example a table row (<tr> or <tr class=""..) needs to be replaced with a "\r".

I am having a hard time finding a regular expression that can accomplish this.

Below is exactly what I am trying to do

should be parsed to

Can anyone help me out with this?

Bear Bibeault
Author and ninkuma

Joined: Jan 10, 2002
Posts: 63844

Nothing to do with servlets really. Moved to Java in General.

[Asking smart questions] [About Bear] [Books by Bear]
Henry Wong

Joined: Sep 28, 2004
Posts: 20517

Can anyone help me out with this?

Well, I guess you can extract each row, with a find() on "<tr>(.*?)</tr>". And once you have the row, you can extract the columns, with a find() on "<td>(.*?)</td>". In both cases, the extracted text is in group 1, of course.


Books: Java Threads, 3rd Edition, Jini in a Nutshell, and Java Gems (contributor)
Piet Verdriet
Ranch Hand

Joined: Feb 25, 2006
Posts: 266
Henry Wong wrote:
Can anyone help me out with this?

Well, I guess you can extract each row, with a find() on "<tr>(.*?)</tr>". ...

@OP: be sure to use the DOT-ALL option when using Henry's suggestion: by default, the DOT meta character matches any character except a new line character.
Also note that when there's a (small) mistake in your HTML, the regex can easily break and can produce strange (or unexpected) output. Parsing HTML can bets be done using a dedicated HTML parser which can better recover from improperly formed HTML.

My 2 cents.
I agree. Here's the link:
subject: Regex Servlet issue
It's not a secret anymore!