I am obtaining a string of HTML and I need to parse this HTML and turn it into a text file. I have everything else working except a way to parse the HTML string. There are a handful of tags that I need to do some formatting on; for example a table row (<tr> or <tr class=""..) needs to be replaced with a "\r".
I am having a hard time finding a regular expression that can accomplish this.
Well, I guess you can extract each row, with a find() on "<tr>(.*?)</tr>". And once you have the row, you can extract the columns, with a find() on "<td>(.*?)</td>". In both cases, the extracted text is in group 1, of course.
Well, I guess you can extract each row, with a find() on "<tr>(.*?)</tr>". ...
@OP: be sure to use the DOT-ALL option when using Henry's suggestion: by default, the DOT meta character matches any character except a new line character.
Also note that when there's a (small) mistake in your HTML, the regex can easily break and can produce strange (or unexpected) output. Parsing HTML can bets be done using a dedicated HTML parser which can better recover from improperly formed HTML.
My 2 cents.
I’ve looked at a lot of different solutions, and in my humble opinion Aspose is the way to go. Here’s the link: http://aspose.com