File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
The moose likes Java in General and the fly likes Regex Servlet issue Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login

Win a copy of REST with Spring (video course) this week in the Spring forum!
JavaRanch » Java Forums » Java » Java in General
Bookmark "Regex Servlet issue" Watch "Regex Servlet issue" New topic

Regex Servlet issue

Dean O'olish

Joined: Mar 03, 2009
Posts: 16
I am obtaining a string of HTML and I need to parse this HTML and turn it into a text file. I have everything else working except a way to parse the HTML string. There are a handful of tags that I need to do some formatting on; for example a table row (<tr> or <tr class=""..) needs to be replaced with a "\r".

I am having a hard time finding a regular expression that can accomplish this.

Below is exactly what I am trying to do

should be parsed to

Can anyone help me out with this?

Bear Bibeault
Author and ninkuma

Joined: Jan 10, 2002
Posts: 63540

Nothing to do with servlets really. Moved to Java in General.

[Asking smart questions] [About Bear] [Books by Bear]
Henry Wong

Joined: Sep 28, 2004
Posts: 20370

Can anyone help me out with this?

Well, I guess you can extract each row, with a find() on "<tr>(.*?)</tr>". And once you have the row, you can extract the columns, with a find() on "<td>(.*?)</td>". In both cases, the extracted text is in group 1, of course.


Books: Java Threads, 3rd Edition, Jini in a Nutshell, and Java Gems (contributor)
Piet Verdriet
Ranch Hand

Joined: Feb 25, 2006
Posts: 266
Henry Wong wrote:
Can anyone help me out with this?

Well, I guess you can extract each row, with a find() on "<tr>(.*?)</tr>". ...

@OP: be sure to use the DOT-ALL option when using Henry's suggestion: by default, the DOT meta character matches any character except a new line character.
Also note that when there's a (small) mistake in your HTML, the regex can easily break and can produce strange (or unexpected) output. Parsing HTML can bets be done using a dedicated HTML parser which can better recover from improperly formed HTML.

My 2 cents.
I agree. Here's the link:
subject: Regex Servlet issue
It's not a secret anymore!