Search options: separate word search diacritics
Research /

Search

If you want to develop a search engine yourself from scratch take a look at the Inverted index data structure. Basically for each word or term you are creating a list with URLs of those documents containing the word. The lists can be stored either as files in the file system or in a NoSQL database for persistence. The core of most NoSQL databases is a key value store based on a Hash table or Trie. During indexing and searching you transform the search term to a hash. The hash serves as key, while the inverted lists are the value. This allows a near O(1) lookup performance.

The inverted index allows Boolean searches by intersecting the lists of multiple keywords.

An alternative approach is the Vector space model . Each dimension corresponds to a separate term. If a term occurs in the document, its value in the vector is non-zero. Vector operations can be used to compare documents with queries.

Building blocks of a search engine

The following components are typical building blocks of a search engine:

crawler

  1. select documents to index
  2. select parts of the document you want to index
  3. regularly update the index and rankings if documents have changed, are added or deleted
  4. follow references to other documents
  5. deal with duplicates and canonization

parser

  1. remove HTML markup and extract text
  2. break text into words
  3. normalize capitalization, hyphens and umlauts

ranking

  1. rank the documents by relevance, time etc.

query processing

  1. parse query into terms
  2. parse query for phrase and Boolean search operators
  3. deal with typos
  4. create query suggestions
  5. implement substring search / instant search

search

  1. find documents matching the query and additional filters, sorted relevance
  2. layout the search result lists

index

  1. store the documents in a data structure which allows scalable, fast and Boolean searching

caching

load balancing

redundancy

analytics


Sub Notes