GeeCON Prague 2014*
The moose likes Sockets and Internet Protocols and the fly likes Indexing dynamically created pages (Javascript) and Flash contents Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


JavaRanch » Java Forums » Java » Sockets and Internet Protocols
Bookmark "Indexing dynamically created pages (Javascript) and Flash contents" Watch "Indexing dynamically created pages (Javascript) and Flash contents" New topic
Author

Indexing dynamically created pages (Javascript) and Flash contents

Raffaele Sgarro
Greenhorn

Joined: Nov 05, 2009
Posts: 5
Hello folks
I need to extract some data from complex websites, which pages are mostly generated by javascript and also to index some Flash content in those sites. What approach do you suggest? I am very comfortable with Java programming: is there any web-scraping framework written in Java? Is it necessary to embed some web browser (Mozilla)?

Best regards =)
Rob Spoor
Sheriff

Joined: Oct 27, 2005
Posts: 19697
    
  20

Can you read a single web page? If so the next step is taking its contents and parsing it. Do a search in this forum, Other JSE/JEE APIs and Java in General for information on how to parse a web page. You need all src and href attributes to start with, possibly others as well. I once wrote a link checker that recursively could check the links on a web site, so it shared a basic principle - take a web page and retrieve all links from it. Your program just needs to download them all.


SCJP 1.4 - SCJP 6 - SCWCD 5 - OCEEJBD 6
How To Ask Questions How To Answer Questions
Raffaele Sgarro
Greenhorn

Joined: Nov 05, 2009
Posts: 5
unfortunately my problem is not so simple
My HTML document is simple a bunch of <script> tags and some <object>s... The page is generated from javascript in "browser space", so I need some sort of JavaScript engine capable of creating and manipulating the DOM; then I should parse the DOM objects, rather than the pure html.
Consider a website consisting of a single html document mysite.com/index.php
All <a href="javascript:void(0)">s in that document are bounded to some javascript function, so that the navigation actually happens to be a sequence of asynchronous calls. I need an engine capable of execute that code... Mozilla? Any experience with (XUL? XPCOM?) bindings?
Also, there are page made of a single Flash (swf) GUI... How do I interact with it? How do I navigate through its "menus" and finally retrieve the information I need?
 
GeeCON Prague 2014
 
subject: Indexing dynamically created pages (Javascript) and Flash contents