Win a copy of Design for the Mind this week in the Design forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

Regex Servlet issue

 
Dean O'olish
Greenhorn
Posts: 16
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I am obtaining a string of HTML and I need to parse this HTML and turn it into a text file. I have everything else working except a way to parse the HTML string. There are a handful of tags that I need to do some formatting on; for example a table row (<tr> or <tr class=""..) needs to be replaced with a "\r".

I am having a hard time finding a regular expression that can accomplish this.

Below is exactly what I am trying to do



should be parsed to



Can anyone help me out with this?

Thanks!!!
 
Bear Bibeault
Author and ninkuma
Marshal
Pie
Posts: 64715
86
IntelliJ IDE Java jQuery Mac Mac OS X
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Nothing to do with servlets really. Moved to Java in General.
 
Henry Wong
author
Marshal
Pie
Posts: 21003
77
C++ Chrome Eclipse IDE Firefox Browser Java jQuery Linux VI Editor Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Can anyone help me out with this?


Well, I guess you can extract each row, with a find() on "<tr>(.*?)</tr>". And once you have the row, you can extract the columns, with a find() on "<td>(.*?)</td>". In both cases, the extracted text is in group 1, of course.

Henry
 
Piet Verdriet
Ranch Hand
Posts: 266
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Henry Wong wrote:
Can anyone help me out with this?


Well, I guess you can extract each row, with a find() on "<tr>(.*?)</tr>". ...


@OP: be sure to use the DOT-ALL option when using Henry's suggestion: by default, the DOT meta character matches any character except a new line character.
Also note that when there's a (small) mistake in your HTML, the regex can easily break and can produce strange (or unexpected) output. Parsing HTML can bets be done using a dedicated HTML parser which can better recover from improperly formed HTML.

My 2 cents.
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic