Win a copy of Think Java: How to Think Like a Computer Scientist this week in the Java in General forum!
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic

crawler

 
Sunil Baboo
Greenhorn
Posts: 16
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Can web crawler be able to extract html source of web url that only run on cookie enable brower?
 
Lester Burnham
Rancher
Posts: 1337
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Some can, some can't, so it depends on which one in particular you're asking about. If you haven't settled on a specific one, start here: http://java-source.net/open-source/crawlers
 
Sunil Baboo
Greenhorn
Posts: 16
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi,

Thanks for your quick response. Well, its java spider. I know this is working great for other web link but I am not sure it can work for web link that required cookie enable browser.


Thank you


 
Lester Burnham
Rancher
Posts: 1337
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
What is "java spider" - some software you wrote? Downloaded? Bought?
 
Naren Chivukula
Ranch Hand
Posts: 577
Java Notepad Tomcat Server
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
@Lester
Well, its java spider.

Sunil's referring to JSpider in the link you provided.
 
Sunil Baboo
Greenhorn
Posts: 16
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Actually, my concern is:
I had application that crawl on web link. If login required on web link, it login as well. But there are some sites which required cookie enabled browser. I attempt to login on this site, i always return back with login page. In my application, crawler first request login page, retrive cookie information from this header and request another page after login attaching those server send cookie on it. It has been working for most of the site but not site which ask cookie enable browser. Am i missing something? so that my application crawl on those page after login.

Thanks in advanced.
 
Amit Ghorpade
Bartender
Posts: 2854
10
Fedora Firefox Browser Java
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
The sites that need cookies enabled, store sessions in cookies (usually) and do not support URL based session IDs or URL encoding.
So if your cookie functionality isn't on, you dont get a session and hence the login page.
Hope that helps
 
Naren Chivukula
Ranch Hand
Posts: 577
Java Notepad Tomcat Server
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hi Sunil,
Ideally web applications should take care of handling sessions using URL Rewriting if cookies are not enabled. But, it seems in your case the sites you are accessing are not doing that. You can get through the sites you are unable to login by enabling cookies in your browser.
 
Sunil Baboo
Greenhorn
Posts: 16
  • Mark post as helpful
  • send pies
  • Quote
  • Report post to moderator
Hello Amit,

yes you are right, the sites that need cookie enabled store session in cookie. so crawler fails to crawl on site that demand cookie enable browser. Am I right?

I know that crawler has no cookie enabled. Can we enable
cookie on crawler by code. Something like web broser embedded crawler?
 
  • Post Reply
  • Bookmark Topic Watch Topic
  • New Topic