Granny's Programming Pearls
"inside of every large program is a small program struggling to get out"
JavaRanch.com/granny.jsp
The moose likes HTML, CSS and JavaScript and the fly likes How does Yandex do its trick? Big Moose Saloon
  Search | Java FAQ | Recent Topics | Flagged Topics | Hot Topics | Zero Replies
Register / Login


Win a copy of EJB 3 in Action this week in the EJB and other Java EE Technologies forum!
JavaRanch » Java Forums » Engineering » HTML, CSS and JavaScript
Bookmark "How does Yandex do its trick?" Watch "How does Yandex do its trick?" New topic
Author

How does Yandex do its trick?

Bill Thompson
Greenhorn

Joined: Jul 24, 2005
Posts: 6
How does Yandex do its trick?

I want to programatically get the contents of a Yandex.com search result.

The proglem is that a search page url does not change when you do a search on yandex.com and advance to see more pages. It must be done somehow by javascript. Any ideas?
Jeanne Boyarsky
internet detective
Marshal

Joined: May 26, 2003
Posts: 29219
    
135

Bill,
Yes, it is done via JavaScript. Facebook does something similar. When you scroll down (or click show posts), it uses AJAX to fetch more data and paint it at the bottom of the page.

Why do you want to scrape Yandex specially? I ask because google provides an API for programmatic searching.


[Blog] [JavaRanch FAQ] [How To Ask Questions The Smart Way] [Book Promos]
Blogging on Certs: SCEA Part 1, Part 2 & 3, Core Spring 3, OCAJP, OCPJP beta, TOGAF part 1 and part 2
Bill Thompson
Greenhorn

Joined: Jul 24, 2005
Posts: 6
Jeanne Boyarsky wrote:Bill,
Yes, it is done via JavaScript. Facebook does something similar. When you scroll down (or click show posts), it uses AJAX to fetch more data and paint it at the bottom of the page.

Why do you want to scrape Yandex specially? I ask because google provides an API for programmatic searching.


Each search engine provides a slightly different set of results. I have found that if you do not use multiple search engines, you will miss out. Google has results that Bing does not have and Yandex has results that Google and Bing do not have.

Over time, I have gotten tired of cutting from the content of browsers and pasting into word processors for my records. So I decided to write a program.

I have figured out how a can modify the URL for yandex to go to different pages. I was able to use yandex by using a "&p=" to navagate to a particular page.

The duckduckgo search engine does not work this way, though.

So far, I can programmatically get all the search results of 7 search engines. I want to get the content from a duckduckgo search engine next. But, like yandex, it too uses AJAX to get content without changing the URL to be different from the search results page.

I have asked the duckduckgo people and they have not been helpful. I was suprised. They do not want people to have the ability to take content from their search results. Their response did not make a lot of sense. I asked them what sets them apart from other search engines who have a different business model where they actually welcome users to programmaticaly take content. In fact, that is exactly what a meta search engine is. They have yet to respond.

But, as I think of this, their approach of being secretive is not in line with the spirit of the interent. All HTML pages can be read. And I think they are actually abusing what AJAX is intended to do. AJAX does not exist so that people can be secretive and hide things.

Either I will just pass on including duckduckgo in my list of search engines, or I will go ahead and find a way to decipher the AJAX from their HTML page. Or I will do both. Since I am doing all this for my own private use, I will maybe do both.

This shouldn't be rocket science. I am able to see the HTML code. Shouldn't there be some sort of AJAX tag in the HTML that I can find to make use of?

the HTML content of duckduckgo does not have the HTML tags that I expect from AJAX, namely XMLHttpRequest and XMLHTTP.



Eric Pascarello
author
Rancher

Joined: Nov 08, 2001
Posts: 15376
    
    6
Use Fiddler, see what http request is sent to the server, TADA, you know what they call to get the other pages.
 
I agree. Here's the link: http://aspose.com/file-tools
 
subject: How does Yandex do its trick?
 
Similar Threads
thread newbie
some modification on this fourm page
need some advice on java natural language processing tool
How to disable IE browser BACK button.
JAVA_HOME environment variable?