wood burning stoves 2.0*
The moose likes Java in General and the fly likes What resources would be required  for a java based web crawler Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login
JavaRanch » Java Forums » Java » Java in General
Bookmark "What resources would be required  for a java based web crawler" Watch "What resources would be required  for a java based web crawler" New topic
Author

What resources would be required for a java based web crawler

Mohit G Gupta
Ranch Hand

Joined: May 18, 2010
Posts: 634

i am thinking of making a web crawler that fetches not whole of the internet but only documents,word,ppts related to academics..
so,what all resources are required ?
can i implement on my pc or i would need a separate pc for it

OCPJP 6.0 93%
OCPJWCD 5.0 98%
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 41188
    
  45
The first resource I would put to use is a web browser in order to google for existing crawlers, or head straight to java-source.net.


Ping & DNS - my free Android networking tools app
William Brogden
Author and all-around good cowpoke
Rancher

Joined: Mar 22, 2000
Posts: 12761
    
    5
The standard Java library has all you need to get started.

I feel that web crawling has gotten a lot more complicated as people create more complex pages using more JavaScript to dynamically build a page.

The "semantic web" represents an attempt to allow better tagging of resources in a more academic style.

If I was doing a web crawler now, I would use Google searches as a front-end to locate potential sites of interest.

Bill
Mohit G Gupta
Ranch Hand

Joined: May 18, 2010
Posts: 634

thanks,William Brogden
but,how can semantic web be useful for web crawler
and you said that to use google as front end
how is that possible
my main motive is to make a web crawler and then to use it for search engine which help users to find docs,ppt related to academics.

please help, i am getting confused as i web crawler new to me
William Brogden
Author and all-around good cowpoke
Rancher

Joined: Mar 22, 2000
Posts: 12761
    
    5
Google has all sorts of services for developers.

Semantic Web - the intent of the "semantic web" is to provide ways to tag resources (such as HTML pages) so that search engines do a better job. This is a huge topic, jump right in!

Due to the massive inter-connected-ness of the web, a web crawler running on a single computer gets bogged down quickly after you get about 4 or 5 deep in the connections. The computer power Google applies to continuous web crawling is the single most mind-boggling fact of the web today.

Crawling for specific topics may still be feasible but you will need a way to start in the most useful spots and to discard the connections which are less likely to be useful.

Bill
Mohit G Gupta
Ranch Hand

Joined: May 18, 2010
Posts: 634

code.google.com
i checked the site but i was unable to find how use google as a web crawler
how can i use it as front end
please help
William Brogden
Author and all-around good cowpoke
Rancher

Joined: Mar 22, 2000
Posts: 12761
    
    5
I was not trying to say that you could use a Google API "as a web crawler" - my suggestion is that you could find usable starting web addresses with a Google API based on the kind of academic topics you appear to be interested in.

You are certainly not going to be able to crawl the entire web, so it seems to me you would want to start on pages that are already in your area.

Bill

(Note the edit, "not trying to say" stupid fingers....)
Mohit G Gupta
Ranch Hand

Joined: May 18, 2010
Posts: 634

i checked the Google Custom Search API ,
so should i use that one for search engine



i am making this for my final year project.
so,is it sufficient
Ulf Dittmer
Marshal

Joined: Mar 22, 2005
Posts: 41188
    
  45
Shouldn't a school project involve some research of your own? It sounds a bit as if you're not researching much in between asking here.
Mohit G Gupta
Ranch Hand

Joined: May 18, 2010
Posts: 634

i havedone research on web crawler and got a suggestion to use google api
but now i am unable to get how to use it.
as William Brogden said

I was trying to say that you could use a Google API "as a web crawler" - my suggestion is that you could find usable starting web addresses with a Google API based on the kind of academic topics you appear to be interested in.

You are certainly not going to be able to crawl the entire web, so it seems to me you would want to start on pages that are already in your area.


say if i want all stuff related to computer science ,how can this google api help me
i added a project on Google Custom Search API ,
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: What resources would be required for a java based web crawler
 
Similar Threads
To Herbert Schildt
Crawler
Art of Java
compiling woes
<terminated, exit value: 0>C:\Program Files\Java\jre6\bin\javaw.exe (Feb 12, 2009 1:54:50 PM)