File APIs for Java Developers
Manipulate DOC, XLS, PPT, PDF and many others from your application.
http://aspose.com/file-tools
The moose likes Servlets and the fly likes crawler Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of Android Security Essentials Live Lessons this week in the Android forum!
JavaRanch » Java Forums » Java » Servlets
Bookmark "crawler" Watch "crawler" New topic
Author

crawler

Sunil Baboo
Greenhorn

Joined: Aug 12, 2010
Posts: 16
Can web crawler be able to extract html source of web url that only run on cookie enable brower?
Lester Burnham
Rancher

Joined: Oct 14, 2008
Posts: 1337
Some can, some can't, so it depends on which one in particular you're asking about. If you haven't settled on a specific one, start here: http://java-source.net/open-source/crawlers
Sunil Baboo
Greenhorn

Joined: Aug 12, 2010
Posts: 16
Hi,

Thanks for your quick response. Well, its java spider. I know this is working great for other web link but I am not sure it can work for web link that required cookie enable browser.


Thank you


Lester Burnham
Rancher

Joined: Oct 14, 2008
Posts: 1337
What is "java spider" - some software you wrote? Downloaded? Bought?
Naren Chivukula
Ranch Hand

Joined: Feb 03, 2004
Posts: 576

@Lester
Well, its java spider.

Sunil's referring to JSpider in the link you provided.


Cheers,
Naren
(OCEEJBD6, SCWCD5, SCDJWS, SCJP1.4 and Oracle SQL 1Z0-051)
Sunil Baboo
Greenhorn

Joined: Aug 12, 2010
Posts: 16
Actually, my concern is:
I had application that crawl on web link. If login required on web link, it login as well. But there are some sites which required cookie enabled browser. I attempt to login on this site, i always return back with login page. In my application, crawler first request login page, retrive cookie information from this header and request another page after login attaching those server send cookie on it. It has been working for most of the site but not site which ask cookie enable browser. Am i missing something? so that my application crawl on those page after login.

Thanks in advanced.
Amit Ghorpade
Bartender

Joined: Jun 06, 2007
Posts: 2716
    
    6

The sites that need cookies enabled, store sessions in cookies (usually) and do not support URL based session IDs or URL encoding.
So if your cookie functionality isn't on, you dont get a session and hence the login page.
Hope that helps


SCJP, SCWCD.
|Asking Good Questions|
Naren Chivukula
Ranch Hand

Joined: Feb 03, 2004
Posts: 576

Hi Sunil,
Ideally web applications should take care of handling sessions using URL Rewriting if cookies are not enabled. But, it seems in your case the sites you are accessing are not doing that. You can get through the sites you are unable to login by enabling cookies in your browser.
Sunil Baboo
Greenhorn

Joined: Aug 12, 2010
Posts: 16
Hello Amit,

yes you are right, the sites that need cookie enabled store session in cookie. so crawler fails to crawl on site that demand cookie enable browser. Am I right?

I know that crawler has no cookie enabled. Can we enable
cookie on crawler by code. Something like web broser embedded crawler?
 
jQuery in Action, 2nd edition
 
subject: crawler
 
Similar Threads
Crawler
Code for recursively downloading pages from particular web site(url)?
Art of Java
How to Extract all urls
<terminated, exit value: 0>C:\Program Files\Java\jre6\bin\javaw.exe (Feb 12, 2009 1:54:50 PM)