Scribd Users
Scribd Users
RQ2: Are there popular hosts, domains, and content types preferred by a certain search engine?
RQ3: Is there a difference between search engines regarding the results types presented?
RQ4: How many specially displayed results are on the first results page?
RQ6: What is the difference between search engines regarding the questions above?
Research design
Data collection
For our comparison, we selected five search engines based on the restriction that they should provide
their own index and that they are of significance (as expressed in a considerable market share). We
chose Google, Yahoo, MSN/Live.com, and Ask.com. This selection corresponds with our other
empirical studies on certain aspects of search engine quality [18, 19, 21, 22, 25].
We got 500 queries from the 100k top queries and 500 from the last 100k queries of long tail
queries from Ask.com from July 2008. As we had access to the aggregated list of search queries with
their occurrences in the query log, it was possible to get a sample containing both popular and rare
queries. We sorted the list of all queries by frequency and alphabetically. Then, we selected every
second hundredth query. With that, we make sure that we have a representative selection over the
popular and over the very rare search queries in the long tail.
In the next step, we wrote a script to automatically download the search engines’ results pages. We
then developed programs for every search engine mentioned above to analyse the HTML code. We
extracted patterns in the HTML code, which gave us the ability to categorise every single result in
those pages into organic, paid advertisements, snippets, etc. Some problems occurred with Yahoo,
which blocked us after approximately 30 search queries. We then changed the program, so there were
only 25 search requests and, afterwards, a 2-hour timeout, until the next request was started. Search
engines do not allow people to send machine requests. They state in their policy that they could block
one forever or only for a while, whenever they detect such an automatic process. This is
understandable, since these major search engines have to protect themselves, especially the usage of
14
sponsored links. They have to interdict automatic requests by robots and scripts; otherwise, it would
be possible for one to write a robot that clicks on all sponsored links. We used the data for this
research project only, and we did not generate any automatic clicks on any links in the results screen.
For Google, we needed to use a proxy in the United States. Google always tracks the IP of its users
and always brings up sponsored links of the country identified by the IP. Since we are located in
Germany, only German sponsored links would have come up if we did not use the proxy. That is why
we used a US proxy and the US web search interface for all search engines. For a different country
search one will see different number of results. English search queries wouldn’t bring up that many
sponsored results in a German web interface and vice versa. For every URL (sponsored or organic) on
Position of Organic Results: position of the URL within the set of organic results
Absolute Position Within Complete Set: the position considering the whole set of elements
(including additional results, such as sponsored results or shortcuts shown above the organic
results), while results position only considers results in the organic results set
Absolute Position of Adwords: position of sponsored results (above, under, or on the right side)
URL: all URLs of organic results have been extracted and most URLs2 from sponsored results,
too.
Type of Result: as stated in Table 2, e.g., organic, sponsored, and shortcut (Whenever we found a
shortcut, we also extracted the category of this shortcut, e.g., books, flight, and dictionary
Ask 3D: Ask.com was slightly different, since we also have the information on which terms
triggered an Ask 3D result
Therefore, we are able to reconstruct all elements shown on the results list, including their position.
We can model the elements shown for the different screen and browser window sizes, respectively.
We extracted all data and shortcuts found in our examination. We will compare the results regarding
the top queries and the queries from the heavy tail.
2
Google’s Adwords had been masked by the proxy, which is why we did not extract the URLs of sponsored links for
15
Results
Since every search engine in our experiment has different modules to be presented as
discussed above, we will present the analysis of results screens for every search engine separately. We
will also analyse the most popular hosts, domains, top-level domains (TLDs), and file types. Those
will be compared directly. The overview of sponsored and organic results in popular and rare queries
The Google result set contains 499 popular search queries and 498 queries from the heavy tail.
Those search queries generated 12,522 results in total. The popular search queries come up with 6,731
results, and the rare ones, with 5,791. Only 16 rare search queries generated no results at all.
Yahoo produced 463 results sets for the popular queries and 492 for the rare queries,
respectively. In total, 9,436 results from Yahoo were processed, where 5,232 were generated from
popular queries, and 4,204, from heavy-tail queries. Sixty-four of these produced no results at all,
We obtained 11,752 results from the MSN search engine. All 500 popular search queries
produced results (6,685). Only 457 of the rare results are valid; all other search queries had been
blocked or did not go through for some other reason. Of those rare queries, only 12 did not generate
From Ask.com, we obtained a total of 9,127 URLs. All popular queries were processed, while
2 rare queries could not be processed. Popular queries produced a total of 5,224 results, and rare
queries, a total of 3,903 results. Five popular queries led to an empty results set, while for the rare
queries, the amount was 79. Table 3 gives an overview of the results sets of all search engines under
investigation. It also clearly shows how many sponsored links are on the first results screen.
Table 3: Overview of the Results Sets for the Search Engines Investigated
Google Yahoo MSN/Live Ask
Valid popular queries 499 463 500 500
Valid rare queries 498 492 457 498
URLS in results screens 12,522 9,436 11,700 9,127
URLs (from popular) 6,731 5,232 6,685 5,224
URLs (from rare) 5,791 4,204 5,065 3,903
Organic URLs 9,641 8,454 9,177 8,183
Organic URLs (from popular) 5,041 4,543 4,996 4,661
16