Search
If you want to develop a search engine yourself from scratch take a look at the Inverted index data structure. Basically for each word or term you are creating a list with URLs of those documents containing the word. The lists can be stored either as files in the file system or in a NoSQL database for persistence. The core of most NoSQL databases is a key value store based on a Hash table or Trie. During indexing and searching you transform the search term to a hash. The hash serves as key, while the inverted lists are the value. This allows a near O(1) lookup performance.
The inverted index allows Boolean searches by intersecting the lists of multiple keywords.
An alternative approach is the Vector space model . Each dimension corresponds to a separate term. If a term occurs in the document, its value in the vector is non-zero. Vector operations can be used to compare documents with queries.
Building blocks of a search engine
The following components are typical building blocks of a search engine:
crawler
- select documents to index
- select parts of the document you want to index
- regularly update the index and rankings if documents have changed, are added or deleted
- follow references to other documents
- deal with duplicates and canonization
parser
- remove HTML markup and extract text
- break text into words
- normalize capitalization, hyphens and umlauts
ranking
- rank the documents by relevance, time etc.
query processing
- parse query into terms
- parse query for phrase and Boolean search operators
- deal with typos
- create query suggestions
- implement substring search / instant search
search
- find documents matching the query and additional filters, sorted relevance
- layout the search result lists
index
- store the documents in a data structure which allows scalable, fast and Boolean searching