This week's book giveaway is in the Servlets forum.
We're giving away four copies of Murach's Java Servlets and JSP and have Joel Murach on-line!
See this thread for details.
The moose likes Java in General and the fly likes Regex Servlet issue Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of Murach's Java Servlets and JSP this week in the Servlets forum!
JavaRanch » Java Forums » Java » Java in General
Bookmark "Regex Servlet issue" Watch "Regex Servlet issue" New topic
Author

Regex Servlet issue

Dean O'olish
Greenhorn

Joined: Mar 03, 2009
Posts: 16
I am obtaining a string of HTML and I need to parse this HTML and turn it into a text file. I have everything else working except a way to parse the HTML string. There are a handful of tags that I need to do some formatting on; for example a table row (<tr> or <tr class=""..) needs to be replaced with a "\r".

I am having a hard time finding a regular expression that can accomplish this.

Below is exactly what I am trying to do



should be parsed to



Can anyone help me out with this?

Thanks!!!
Bear Bibeault
Author and ninkuma
Marshal

Joined: Jan 10, 2002
Posts: 60752
    
  65

Nothing to do with servlets really. Moved to Java in General.


[Asking smart questions] [Bear's FrontMan] [About Bear] [Books by Bear]
Henry Wong
author
Sheriff

Joined: Sep 28, 2004
Posts: 18504
    
  40

Can anyone help me out with this?


Well, I guess you can extract each row, with a find() on "<tr>(.*?)</tr>". And once you have the row, you can extract the columns, with a find() on "<td>(.*?)</td>". In both cases, the extracted text is in group 1, of course.

Henry


Books: Java Threads, 3rd Edition, Jini in a Nutshell, and Java Gems (contributor)
Piet Verdriet
Ranch Hand

Joined: Feb 25, 2006
Posts: 266
Henry Wong wrote:
Can anyone help me out with this?


Well, I guess you can extract each row, with a find() on "<tr>(.*?)</tr>". ...


@OP: be sure to use the DOT-ALL option when using Henry's suggestion: by default, the DOT meta character matches any character except a new line character.
Also note that when there's a (small) mistake in your HTML, the regex can easily break and can produce strange (or unexpected) output. Parsing HTML can bets be done using a dedicated HTML parser which can better recover from improperly formed HTML.

My 2 cents.
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: Regex Servlet issue
 
Similar Threads
Help me to have a sortable table in my html profile!
I am not getting forwarded to another page where i want to go
render string array in xslt
Table with fixed header and scolling body in Safari.
Problem in SessionManagement in Struts