This week's book giveaways are in the Refactoring and Agile forums.
We're giving away four copies each of Re-engineering Legacy Software and Docker in Action and have the authors on-line!
See this thread and this one for details.
Win a copy of Re-engineering Legacy Software this week in the Refactoring forum
or Docker in Action in the Cloud/Virtualization forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

Indexing dynamically created pages (Javascript) and Flash contents

 
Raffaele Sgarro
Greenhorn
Posts: 5
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hello folks
I need to extract some data from complex websites, which pages are mostly generated by javascript and also to index some Flash content in those sites. What approach do you suggest? I am very comfortable with Java programming: is there any web-scraping framework written in Java? Is it necessary to embed some web browser (Mozilla)?

Best regards =)
 
Rob Spoor
Sheriff
Pie
Posts: 20495
54
Chrome Eclipse IDE Java Windows
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Can you read a single web page? If so the next step is taking its contents and parsing it. Do a search in this forum, Other JSE/JEE APIs and Java in General for information on how to parse a web page. You need all src and href attributes to start with, possibly others as well. I once wrote a link checker that recursively could check the links on a web site, so it shared a basic principle - take a web page and retrieve all links from it. Your program just needs to download them all.
 
Raffaele Sgarro
Greenhorn
Posts: 5
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
unfortunately my problem is not so simple
My HTML document is simple a bunch of <script> tags and some <object>s... The page is generated from javascript in "browser space", so I need some sort of JavaScript engine capable of creating and manipulating the DOM; then I should parse the DOM objects, rather than the pure html.
Consider a website consisting of a single html document mysite.com/index.php
All <a href="javascript:void(0)">s in that document are bounded to some javascript function, so that the navigation actually happens to be a sequence of asynchronous calls. I need an engine capable of execute that code... Mozilla? Any experience with (XUL? XPCOM?) bindings?
Also, there are page made of a single Flash (swf) GUI... How do I interact with it? How do I navigate through its "menus" and finally retrieve the information I need?
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic