CS571-Note
CS571-Note
Search Engine History and Basics : Chronology Archie, Veronica, Gopher 1980 -> 1993 Gopher Protocol (TCP/IP) -> Excite 1993 -> World
Wide Web Wandered 1993 -> AliWeb 1993 meta information -> Alta Vista 1995 natural language queries, advanced searching -> Lycos 1994
prefix matching and cataloging -> Infoseek 1994 webmaster page helps scammers ->Yahoo 1994 -> Dewey decimal system (directory structure)
problem: manual updation instead of crawlers spiders -> Look Smart 1995 pay per click fees, dependent on MSN -> Inktomi 1996 paid inclusion
model -> Ask Jeeves 1997 natural language search engine -> Google 1999 adwords and pay_per_click. Definition : A search engine is a
program designed to help find information stored on a computer system such as the World Wide Web, inside a corporate or proprietary network
or a personal computer. Types : General | Enterprise | Personal Search Engine. Web Search Internal Elements : User, Web, Crawler/spider,
Indexer, Ads, Query Processor. Spider : Collects web pages recursively , Indexer – creates inverted indexes, Query processor – serves query
results. Query Processing E.g. of Semantic analysis 1. Determining the language of the query, 2. Filtering of unnecessary words from the
query (stop words), 3. Looking for specific types of queries, e.g., Personalities (triggered on names), Cities (travel info, maps), Medical info
(triggered on names and/or results), Stock quotes, news (triggered on stock symbol), Company info ..., 4. Determining the user’s location or the
target location of the query, 5. Remembering previous queries, 6. Maintaining a user profile - Also NER named entity recognition. Google
working : Recent query history, holistic result (everything related to that person, place, Hotel Query), retains query history.
2. Web Trends and Measurements : Growth in number of users connected, Growth in Smartphone use, Growth in digital data, especially
photos and video • Growth in Social Media as an advertising platform • Transition from desktop/laptop use to mobile, Growth in tablet usage over
desktops/laptops, Decreased dominance of Microsoft Windows, Move away from server farms to cloud computing, Growth in voice
communication with devices. Web Page Language Diversity :ISO/IEC 10646 Universal Character Set, If charset is missing ISO- 8859-1 is
taken as the default unless there is a browser setting; Websites in non-western languages typically use UTF-8. Proliferation content types :
total available 16k to 51k, e.g. Apache Tika unifies other parsers, helpful in indexing. Index metadata using Lucene, Solr, Google Search
Appliance. Properties of Web Graph : In-degree and out-degree distribution follows a power law. Manual hierarchy e.g. Yahoo, Netscape -
Open directory project DMOZ (Ontology RDF format). Internet Archive Wayback machine - Snapshot history changes - Apache Nutch.
3. Crawlers and Crawling : Crawler/Spider/Robot e.g. Google - GoogleBot, Yahoo - slurp, Bing - Bingbot, Adidxbot, MSNbot, MSNBotMedia,
BingPreview. Web Crawling Issues : How to crawl? (Quality, Efficiency, Etiquette) | How much to crawl? How much to index? (Coverage,
Relative Coverage) | How often to crawl? (Freshness) depends on the site - policies (selection, revisit, politeness, parallelization), if cnn.com
then frequent crawling during elections. How Crawler Works : Plant seed page -> Place the page in DB -> Extract URL -> Place in Queue and
repeat till Queue is empty. Crawling Challenges : Handling/Avoiding malicious pages (spam/spider_traps/CMS dynamic page urls),
Non-malicious pages problems (Latency/bandwidth, prevent using Robots.txt, Mirror sites/duplicate pages), Maintain politeness (server hit too
often). Robots.txt : defines limitations of web crawler, directives are case sensitive, include regex e.g. User-agent: *, Disallow: /yoursite/temp/,
Allow: /public*/. Crawling Strategies : BFS_FIFO, DFS_LIFO, Focussed Crawling (Priority based on Indegree). BFS - high quality pages early
in crawl. Crawling check and flow : Check type of resource object (.gif, .jpeg) -> Hashing : if already visited -> If downloadable no 404 error ->
Index page -> Parse for new links reiterate. Page Duplication : detect identical and near-identical. Link Extraction : 2 anomalies (anchor no
links and dynamic page). Representing URLs - two methods (Trie data structure, lexicographically delta encoded). URL matching : Worst case
N Urls, max search url length - K, then O(NK) -> Average Binary search O(K log N) - Trie O(K). Normalizing URLs - reduce number of hashes,
Methods to Normalize (Lowercase, Capitalize, Parse unicode character, remove ports numbers). Spider traps : avoid using unique sessionIDs
as hash and too long urls stop. Spam Pages : 1st gen - keyword stuffing, 2nd gen - Cloaking, 3rd gen - doorway ads. DNS caching : IP domain
name matching, prefetching client (switch DNS, UDP), custom crawling. Crawling parallel - Multithreaded distribute URLs, can be implement two
scenarios (1st - centralized crawler masters slave, 2nd independent distributed set of crawler without cross communication ). ` : diverse
geographic location, update index some Benefits (scalability, cost, network load ) and Issues (overlap, quality, communication bandwidth),
Strategies (Independent, Dynamic, Static (when inter-partition link then 3 modes - firewall, cross-over and exchange)). Cho an d Garcia-Molina,
2000 - Revisiting policy (Uniform policy & Proportional policy)
4. Search Engine Evaluation : Measure the quality of search engines - Precision=#(relevant items retrieved) divided by #(all retrieved items)
tp/(tp + fp) and Recall=#(relevant items retrieved) divided by #(all relevant items) tp/(tp + fn), Harmonic Mean - F_Measure 2RP/R+P, Fb = (b^2 +
1)RP/(R+ b^2P) β is a parameter that controls the relative
importance of recall and precision. Mean average precision (MAP) - mean of the average precision scores for each query, negative aspects
relevant document not retrieved then precision will be 0 for that. Discounted Cumulative Gain (DCG) - Summation of (Measure of relevance/
position) - highly relevant things at the beginning of the list. E(rel_i/lg2(i+1)). Other Evaluation metrics : Precision (P), Average Precision (AP),
Cumulative Gain (CG), Discount Cumulative Gain (DCG), normalized DCG (nDCG). Non-relevance-based measures : Click-through on first
result, Studies of user-behavior, A/B testing : Show two web pages A/B and then check results. Google 4 step process : Precision Evaluation,
Side-by-Side Experiments, Live traffic, full launch.
5.Deduplication : identifying the duplicate content, Mirroring : Akami, Apache. Duplicate (using cryptographic hashing - message digest,
digital fingerprint, digest or checksum (MD5, SHA1/2, RIPEMD-160)) & Near Duplicate (syntactic similarity with edit distance). SimHash
algorithm : 1. Define a function f that captures the contents of each document in a number - E.g. hash function, signature, or a fingerprint, 2.
Create the pair < f(doc:), ID of doc;> for all doc;, 3. Sort the pairs, 4. Documents that have the same f-value or an f-value within a small threshold
are believed to be duplicates or near. Distance measure for similarity : Euclidean , Jaccard , Cosine , Edit , Hamming. Distance & Similarity
are complement. Shingle : contiguous subsequence of words. Similarity Measures : Jaccard(A,B) (also known as Resemblance) is defined as
size of (S(A,w) intersect S(B,w)) / size of (SA,w) union S(B,w)) - Containment(A,B) is defined as size of (S(A,w) intersect S(B,w)) / size of
(S(A,w)) 0 <= Resemblance <= 1 | 0 <= Containment <=1. Before comparing contents NLP techniques : White space, Capitalization,
Punctuation, Stop words. SimHash : Documents D1, and D2 are near duplicates iff Hamming-Distance(Simhash(D1), Simhash (D2)) <= K.
'rotate, sort, check adjacent'.
6. Intro to IR : Recent ones that support vector (Neo4j - graph, postgres - pg_vector), Pure vector (pinecone, weaviate, chroma). Boolean
Model (and, or, not) - problems (Rigid strict meaning of AND - all, OR- any, difficult for complex queries, difficult to rank output and feedback).
Vectors - Normalize and find the cosine distance. TF-IDF, Cosine Distance
7. Text Processing : Standing Queries : prefecteched answers like common most searched things e.g. Palestine war, Spam Filtering :
Categorization/ Classification learning supervised learning methods. Document classification : Yahoo did manual, Hardcoded rule base,
Supervised learning : Naive bayes (probability), KNN - cluster, SVM - Binary, Logistic Regression, NN, Ensemble methods, BOW : word
features, Feature selection - PCA : use common terms for selection. Naive Bayes : low storage, very good for domains with equally important
features, non-related features cancel out with affecting results. Classification using Vector Spaces : Vernoi polygon : Premise 1 same class
same contiguous region, Premise 2 : different class no overlap. Rocchio classification : convex Voronoi regions, Fisher's linear discriminant
but this method is worse than Naive Bayes - not good for disjunctive multilabel categories. KNN : lazy learning does at runtime, memory based,
case based learning, k - odd, Advantages : No feature selection or training necessary - Scales well with more classes - expensive for Test -
Accurate than NB or Rocchio - can handle polymorphic categories better than Rocchio/NB. Quadtree - for divisions of area.
8. Inverted Index : define terms(Case folding, Stemming, Stop words) vector containing all distinct words of the text collection in lexicographical
order (which is called the vocabulary) and for each word in the vocabulary, a list of all documents (and text positions) in which that word occurs.
Pointer to Posting List : Store the terms in lexical order (dictionary). Old Way : Incidence Vector word A && word B = num of incidents. Inverted
Index - Naturally Sparse so use Data structure, use Skip Lists for fast processing. Stored using merged Document ID, use SKIP list while
merging (sorted lists) skip the nodes in between. TC : O(sqrt(n)) faster than O(n). Phrase Queries : order is important between the quotes. N
Gram Indexing - search term becomes dictionary term, supports longer phrase queries using Boolean model. Alternate Solution : Document
position index, store the term position in the document term -> doc_1:location, doc_2:location. Another Approach : POS tagging - term
frequency, N-Gram index. Distributed Indexing : Using Map reduce - Master splits (partition term and doc) terms to different Parsers assigned
to Inverters parallel updating (sorting and writing) to Posting Lists.
9. Videos : Indexing done : meta-data Author, title, creation date, duration, coding quality, tags, description and Ranking done : date of upload,
number of views, duration, user rating. Vimeo/Vevo/dialy_motion - high quality video. Subtitles types : open/closed/SDH/SRT/TTXT - SRT :
SubRip subtitle. Speech Recognition : Voice Recognition (Gaudi, Google Audio indexing) - to located the words spoken. OCR - Text
Recognition on videos - TalkMiner systems. Youtube Issues :Video formats, Video device display (desktop, iphone, Android), Video distribution
worldwide - CDN, Monetization - ContentID, Next Video Watching - Recommendation System. Ranking Videos - Ensemble method - Ranking
filters : Upload date, Type, Duration, Sort By. Ranking Metrics Factors : Meta Data, Video Quality, No of views, likes, shares, links, Subtitle
and closed captions. Recommendation System : Co-Visitation Counts relatedness r(vi, vj) = Cij/f(vi, vj). False Recommendation :
AlgoTransparency.ORG. CDN : Akami, Limelight & Level3 Communication. CONTENT_ID : 11 characters total possibilities are 52^11.
Identification through ContentID and Efficient Delivery through CDN Cache. Upload Video Process : User upload > transcoded formats > CDN
intermediary transcode format > Queue > Completion Handler > API > Load Balancer. Youtube Delivery System : VideoID Space > Logical
server org > 3 tier physical cache. Acoustic Fingerprint : Time, Frequency, Amplitude - color uses FiniteStateTransducers for Hashing.
10. Query Formulation : Query working - not case sensitive, ignores punctuation, boolean search model, supports wildcards. Query
modifiers: daterange:, filetype:, inanchor:, intext:, intitle:, inurl:, site:. Alternative query types : cache:, link:, related:, info:. Special
information needs stocks:. Supports math operation @twitter, $400, ip, calculation, translation. Failed Google techs : phonebook,
rphonebook, bphonebook, reading level, wonder wheel, Google code search. Special Search Engines : Google Patents, Google Books, Google
Scholar. Relevance Feedback : Autocompletion, spelling correction. Check Quality Suggestion : MRR Mean Reciprocal Rank : average of
reciprocal ranks of results for a sample of queries - 3 guess, 1st one more similar MRR = (1/|Q|) E (1/rank_i) where i = 1 to Q
11. Map Reduce : Building blocks : MapReduce - run parallely large scale jobs on cluster of machines, GFS - large scale distributed file
systems, Big Table - semi structured storage systems. Map Reduce Uses : Google - searchIdx, articleClusteringNews,
StatisticalMachineLearning, Yahoo - SearchIdx, Spam detection, Facebook - Data Mining, AdOptimization, Spam detection. Data Collection :
Dish network remote click, Tesla - usage of car. Parallelization/Synchronization Problems : Workunits > threads, work units assignments,
aggregate/combine, work thread finished, work cannot be dividable to tasks. Map Reduce Advantages : Automatic parallelization, Fault
Tollerene, IO Scheduling, Monitoring & Status update. Map Reduce Fault Tolerant and recovery : Tasks crash (restart to work)/ node crash
(reschedule)/ task latency (launch second copy). GFS : distributed file system for efficient, reliable access, large cluster - chunks - 64 MB -
replica, master server (namespace directory, access control, chunks location, GC) - application, chunk server (replication) to files,
Communication through Heartbeats and handshakes. BigTable : compressed high performance propriety data storage built on top of GFS used
by MR, GoogleMaps, Youtube, Gmail, Row (lexicographical) and Column(2 level name structure), Garbage collection - only retain most recent K
values in cell, until older than k seconds.
12 : Search Engine Advertising : Placement : Positional Bias, Keyword matching rules : Broad match, exact match, Phrase match, negative
Keyword. Capabilities/Advantages : analytics like clicks, post-click, interactions, frequency capping, sequencing ads, excluding competitive ads,
target audience, dayparting - peak time, auction rules. Expected revenue (AdRank) = Bid Amount * Click probability. AdSense/AdMgr/AdMob :
CP_M_Thousands, CP_Engagemnt, embed using JS, based on wordNet - synonyms synsets. Advertising Model : Client webpage adbox >
Broker auction > Advertiser bids > Broker picks winner shows ad > Client sees ad waits > reports results to publishers, trackers, advertisers,
broker.
13. Knowledge Graphs : Taxonomy : classification or organization of complex systems. Ontology : RDS properties and relations between
entities. KB : collection 1. Representing information and reasoning about information. Object Model : Includes classes, subclasses and
instances, helps search engine for querying, showing rich widgets