• Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

HTML to Text via Screen scrape

 
Dean O'olish
Greenhorn
Posts: 16
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
hello,

I am tasked with providing my client the ability to view a web page as text (Think of a T icon in the top right corner that is clickable from a JSP). The client would like the ability to simply cut and paste what they see in front of them into another source. Now I bet many will say just highlight the HTML and cut it, which is what we originally proposed however they say they are also getting hidden markup which isn't what they want.

In any case I need to find a way to get the HTML into text. I was thinking screen scrape, but am having a hard time passing the information via AJAX since a web page is pretty large and the querystring can only have so much information.

Does anyone know of a good library to do this (whether is it via screen scrape or not) ?

Thanks for your help,

- DTE
 
Bear Bibeault
Author and ninkuma
Marshal
Pie
Posts: 64718
86
IntelliJ IDE Java jQuery Mac Mac OS X
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
"Dean Ooooh", please check your private messages for an important administrative matter.

I've also moved this post to a more appropriate location.
 
Bear Bibeault
Author and ninkuma
Marshal
Pie
Posts: 64718
86
IntelliJ IDE Java jQuery Mac Mac OS X
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Dean Oo wrote:but am having a hard time passing the information via AJAX since a web page is pretty large and the querystring can only have so much information.

Then don't pass it in the query string.
 
Dean O'olish
Greenhorn
Posts: 16
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I am not and nor would I want to... I am just explaining what I have tried or was thinking...

Bear Bibeault wrote:
Dean Oo wrote:but am having a hard time passing the information via AJAX since a web page is pretty large and the querystring can only have so much information.

Then don't pass it in the query string.
 
Paul Clapham
Sheriff
Posts: 20980
31
Eclipse IDE Firefox Browser MySQL Database
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
First step would be to determine the requirements.

It doesn't make sense to say you want to "get the HTML into text", because HTML is already text. And there's this "hidden markup"... what is that?

And when your client is viewing a web page, what are they using to do that? A browser connected to the web page? A text editor looking at the source code... the HTML?

And are these web pages just anything they can get to on the web, or just the web pages served by your site?
 
Dean O'olish
Greenhorn
Posts: 16
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I am trying to get the HTML, parse it into plain text, however I am having a hard time figuring out a way to get that HTML content from either the front end or the back end. I tried capturing its dynamic (depends on request and session information) information at the servlet, then parse it... but I can not figure that out.


I next thought I would try and screen scrape and pass the string html content to the server through a post, but the content was too big, and Ajax is just annoying with something like this.


Any suggestions on how to get the rendered content of a JSP into my servlet action (using struts)?

Thanks



Paul Clapham wrote:First step would be to determine the requirements.

It doesn't make sense to say you want to "get the HTML into text", because HTML is already text. And there's this "hidden markup"... what is that?

And when your client is viewing a web page, what are they using to do that? A browser connected to the web page? A text editor looking at the source code... the HTML?

And are these web pages just anything they can get to on the web, or just the web pages served by your site?
 
Eric Pascarello
author
Rancher
Posts: 15385
6
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
"content was too big" What is too big?

Have you gone through basic Ajax tutorials on how to send post data to the server and dispay a result on the page? If you have not, it might be a good time to try it and not jump right into a big task.

Eric
 
Dean O'olish
Greenhorn
Posts: 16
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
I have, I know Ajax/JS very well... using mootools, prototype, jquery as well as my own libraries. I've decided not to go with the idea of screen scraping and am trying to figure out how I can get the HTML content as a String through my Struts Action class. I've seen some posts where others have either suggested or tried creating a HttpResponseWrapper class, but I have not been able to make it work for me as of yet...

My work does not want to use a filter nor custom tags, which leaves me with wrapping the Response.

I think I may have to brush up on my OutputStream class. What I am trying to accomplish is:

1. view the HTML from the request in a readable form to make sure all the content is there.
2. take that readable content and put it through a proprietary library that turns a String of HTML into a text file
3. display that text file to the user


any suggestions would be greatly appreciated

Thanks




Eric Pascarello wrote:"content was too big" What is too big?

Have you gone through basic Ajax tutorials on how to send post data to the server and dispay a result on the page? If you have not, it might be a good time to try it and not jump right into a big task.

Eric
 
Eric Pascarello
author
Rancher
Posts: 15385
6
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Why are you converting it to a textfile?

I would request the page that you are grabbing the contents for and output back the response to the client. No need to store it in a txt file unless you play on using that at some later time and do not want to fetch it again.

Eric
 
Dean O'olish
Greenhorn
Posts: 16
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Our client wants to view the jsp as a text file (I dont know why, but I have to build it) with the HTML stripped.

How do you propose that I call the JSP and get a hold of the rendered content?


Eric Pascarello wrote:Why are you converting it to a textfile?

I would request the page that you are grabbing the contents for and output back the response to the client. No need to store it in a txt file unless you play on using that at some later time and do not want to fetch it again.

Eric
 
Eric Pascarello
author
Rancher
Posts: 15385
6
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
If you are going to request the file via Ajax it is not going to be a textfile in the responseText.

Eric
 
Dean O'olish
Greenhorn
Posts: 16
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Eric,

As I've stated I'm not doing that anymore and trying to get the information from the request

Eric Pascarello wrote:If you are going to request the file via Ajax it is not going to be a textfile in the responseText.

Eric
 
Eric Pascarello
author
Rancher
Posts: 15385
6
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Offtopic..Can I ask why you type your response before the quoted text. It is rather backward convention.

Eric
 
Dean O'olish
Greenhorn
Posts: 16
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Eric Pascarello wrote:Offtopic..Can I ask why you type your response before the quoted text. It is rather backward convention.

Eric


Easier to read
 
Bear Bibeault
Author and ninkuma
Marshal
Pie
Posts: 64718
86
IntelliJ IDE Java jQuery Mac Mac OS X
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Dean O'olish wrote:Easier to read

For who? Certainly not the typical reader of these forums.
 
Dean O'olish
Greenhorn
Posts: 16
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
ya, I enjoy scrolling down to the bottom of a long thread just to read a one line comment... It's a lot easier ontop Bear

Bear Bibeault wrote:
Dean O'olish wrote:Easier to read

For who? Certainly not the typical reader of these forums.
 
Christophe Verré
Sheriff
Posts: 14691
16
Eclipse IDE Ubuntu VI Editor
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Wouldn't it be even easier not to quote the previous post at all ?
 
David Newton
Author
Rancher
Posts: 12617
IntelliJ IDE Ruby
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
A: Top-posting.
Q: What is the most annoying thing in e-mail?

Top-posting is wrong, and always has been.

On-topic: use a filter; escape response, change header.
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic