0% found this document useful (0 votes)
149 views

Scribd Users

The document discusses a research study that compared six major search engines across six research questions. It outlines the research design, data collection process, and results. 500 popular and 500 long-tail search queries were executed on each search engine and the results pages were analyzed. Data on sponsored links, result types, domains, and shortcuts were collected from over 100,000 results overall. Key findings included the number of results obtained from each search engine and breakdown of organic versus sponsored links for popular versus rare queries.

Uploaded by

kmrfrom
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
149 views

Scribd Users

The document discusses a research study that compared six major search engines across six research questions. It outlines the research design, data collection process, and results. 500 popular and 500 long-tail search queries were executed on each search engine and the results pages were analyzed. Data on sponsored links, result types, domains, and shortcuts were collected from over 100,000 results overall. Key findings included the number of results obtained from each search engine and breakdown of organic versus sponsored links for popular versus rare queries.

Uploaded by

kmrfrom
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

RQ1: How many sponsored links are on the results screen?

RQ2: Are there popular hosts, domains, and content types preferred by a certain search engine?

RQ3: Is there a difference between search engines regarding the results types presented?

RQ4: How many specially displayed results are on the first results page?

RQ5: To what extent are shortcuts used on the results pages?

RQ6: What is the difference between search engines regarding the questions above?

Research design

Data collection

For our comparison, we selected five search engines based on the restriction that they should provide

their own index and that they are of significance (as expressed in a considerable market share). We

chose Google, Yahoo, MSN/Live.com, and Ask.com. This selection corresponds with our other

empirical studies on certain aspects of search engine quality [18, 19, 21, 22, 25].

We got 500 queries from the 100k top queries and 500 from the last 100k queries of long tail

queries from Ask.com from July 2008. As we had access to the aggregated list of search queries with

their occurrences in the query log, it was possible to get a sample containing both popular and rare

queries. We sorted the list of all queries by frequency and alphabetically. Then, we selected every

second hundredth query. With that, we make sure that we have a representative selection over the

popular and over the very rare search queries in the long tail.

In the next step, we wrote a script to automatically download the search engines’ results pages. We

then developed programs for every search engine mentioned above to analyse the HTML code. We

extracted patterns in the HTML code, which gave us the ability to categorise every single result in

those pages into organic, paid advertisements, snippets, etc. Some problems occurred with Yahoo,

which blocked us after approximately 30 search queries. We then changed the program, so there were

only 25 search requests and, afterwards, a 2-hour timeout, until the next request was started. Search

engines do not allow people to send machine requests. They state in their policy that they could block

one forever or only for a while, whenever they detect such an automatic process. This is

understandable, since these major search engines have to protect themselves, especially the usage of

14
sponsored links. They have to interdict automatic requests by robots and scripts; otherwise, it would

be possible for one to write a robot that clicks on all sponsored links. We used the data for this

research project only, and we did not generate any automatic clicks on any links in the results screen.

For Google, we needed to use a proxy in the United States. Google always tracks the IP of its users

and always brings up sponsored links of the country identified by the IP. Since we are located in

Germany, only German sponsored links would have come up if we did not use the proxy. That is why

we used a US proxy and the US web search interface for all search engines. For a different country

search one will see different number of results. English search queries wouldn’t bring up that many

sponsored results in a German web interface and vice versa. For every URL (sponsored or organic) on

the results pages, we stored the following information:

 Position of Organic Results: position of the URL within the set of organic results
 Absolute Position Within Complete Set: the position considering the whole set of elements
(including additional results, such as sponsored results or shortcuts shown above the organic
results), while results position only considers results in the organic results set
 Absolute Position of Adwords: position of sponsored results (above, under, or on the right side)
 URL: all URLs of organic results have been extracted and most URLs2 from sponsored results,
too.
Type of Result: as stated in Table 2, e.g., organic, sponsored, and shortcut (Whenever we found a

shortcut, we also extracted the category of this shortcut, e.g., books, flight, and dictionary

 Ask 3D: Ask.com was slightly different, since we also have the information on which terms
triggered an Ask 3D result
Therefore, we are able to reconstruct all elements shown on the results list, including their position.

We can model the elements shown for the different screen and browser window sizes, respectively.

We extracted all data and shortcuts found in our examination. We will compare the results regarding

the top queries and the queries from the heavy tail.

2
Google’s Adwords had been masked by the proxy, which is why we did not extract the URLs of sponsored links for

Google, but only the position and number.

15
Results

Since every search engine in our experiment has different modules to be presented as

discussed above, we will present the analysis of results screens for every search engine separately. We

will also analyse the most popular hosts, domains, top-level domains (TLDs), and file types. Those

will be compared directly. The overview of sponsored and organic results in popular and rare queries

will also be discussed to give an overview.

The Google result set contains 499 popular search queries and 498 queries from the heavy tail.

Those search queries generated 12,522 results in total. The popular search queries come up with 6,731

results, and the rare ones, with 5,791. Only 16 rare search queries generated no results at all.

Yahoo produced 463 results sets for the popular queries and 492 for the rare queries,

respectively. In total, 9,436 results from Yahoo were processed, where 5,232 were generated from

popular queries, and 4,204, from heavy-tail queries. Sixty-four of these produced no results at all,

while only 2 popular queries had no results.

We obtained 11,752 results from the MSN search engine. All 500 popular search queries

produced results (6,685). Only 457 of the rare results are valid; all other search queries had been

blocked or did not go through for some other reason. Of those rare queries, only 12 did not generate

any results. We got 5,065 results from rare queries.

From Ask.com, we obtained a total of 9,127 URLs. All popular queries were processed, while

2 rare queries could not be processed. Popular queries produced a total of 5,224 results, and rare

queries, a total of 3,903 results. Five popular queries led to an empty results set, while for the rare

queries, the amount was 79. Table 3 gives an overview of the results sets of all search engines under

investigation. It also clearly shows how many sponsored links are on the first results screen.

Table 3: Overview of the Results Sets for the Search Engines Investigated
Google Yahoo MSN/Live Ask
Valid popular queries 499 463 500 500
Valid rare queries 498 492 457 498
URLS in results screens 12,522 9,436 11,700 9,127
URLs (from popular) 6,731 5,232 6,685 5,224
URLs (from rare) 5,791 4,204 5,065 3,903
Organic URLs 9,641 8,454 9,177 8,183
Organic URLs (from popular) 5,041 4,543 4,996 4,661
16

You might also like