File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
The moose likes Sockets and Internet Protocols and the fly likes Indexing dynamically created pages (Javascript) and Flash contents Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login

Win a copy of Android Security Essentials Live Lessons this week in the Android forum!
JavaRanch » Java Forums » Java » Sockets and Internet Protocols
Bookmark "Indexing dynamically created pages (Javascript) and Flash contents" Watch "Indexing dynamically created pages (Javascript) and Flash contents" New topic

Indexing dynamically created pages (Javascript) and Flash contents

Raffaele Sgarro

Joined: Nov 05, 2009
Posts: 5
Hello folks
I need to extract some data from complex websites, which pages are mostly generated by javascript and also to index some Flash content in those sites. What approach do you suggest? I am very comfortable with Java programming: is there any web-scraping framework written in Java? Is it necessary to embed some web browser (Mozilla)?

Best regards =)
Rob Spoor

Joined: Oct 27, 2005
Posts: 19655

Can you read a single web page? If so the next step is taking its contents and parsing it. Do a search in this forum, Other JSE/JEE APIs and Java in General for information on how to parse a web page. You need all src and href attributes to start with, possibly others as well. I once wrote a link checker that recursively could check the links on a web site, so it shared a basic principle - take a web page and retrieve all links from it. Your program just needs to download them all.

How To Ask Questions How To Answer Questions
Raffaele Sgarro

Joined: Nov 05, 2009
Posts: 5
unfortunately my problem is not so simple
My HTML document is simple a bunch of <script> tags and some <object>s... The page is generated from javascript in "browser space", so I need some sort of JavaScript engine capable of creating and manipulating the DOM; then I should parse the DOM objects, rather than the pure html.
Consider a website consisting of a single html document
All <a href="javascript:void(0)">s in that document are bounded to some javascript function, so that the navigation actually happens to be a sequence of asynchronous calls. I need an engine capable of execute that code... Mozilla? Any experience with (XUL? XPCOM?) bindings?
Also, there are page made of a single Flash (swf) GUI... How do I interact with it? How do I navigate through its "menus" and finally retrieve the information I need?
I agree. Here's the link:
subject: Indexing dynamically created pages (Javascript) and Flash contents
Similar Threads
j_security_check not working with flash
Is flash present
Will GWT kill Flash?
Difficulty in developing a Web Browser in Java?
Can I get actionscript values into a servlet?