Search

If you want to develop a search engine yourself from scratch take a look at the Inverted index data structure. Basically for each word or term you are creating a list with URLs of those documents containing the word. The lists can be stored either as files in the file system or in a NoSQL database for persistence. The core of most NoSQL databases is a key value store based on a Hash table or Trie. During indexing and searching you transform the search term to a hash. The hash serves as key, while the inverted lists are the value. This allows a near O(1) lookup performance.

The inverted index allows Boolean searches by intersecting the lists of multiple keywords.

An alternative approach is the Vector space model . Each dimension corresponds to a separate term. If a term occurs in the document, its value in the vector is non-zero. Vector operations can be used to compare documents with queries.

Building blocks of a search engine

The following components are typical building blocks of a search engine:

crawler

select documents to index
select parts of the document you want to index
regularly update the index and rankings if documents have changed, are added or deleted
follow references to other documents
deal with duplicates and canonization

parser

remove HTML markup and extract text
break text into words
normalize capitalization, hyphens and umlauts

ranking

rank the documents by relevance, time etc.

query processing

parse query into terms
parse query for phrase and Boolean search operators
deal with typos
create query suggestions
implement substring search / instant search

search

find documents matching the query and additional filters, sorted relevance
layout the search result lists

index

store the documents in a data structure which allows scalable, fast and Boolean searching

Search

crawler

parser

ranking

query processing

search

index

caching

load balancing

redundancy

analytics

Sub Notes