GitHub - Shrinidhi-C/Search-Engine: A simple search engine for Environmental News NLP archive

PROBLEM STATEMENT:

Building a search engine for Environmental News NLP Archive and perform the comparison of metrics with available search engines.

DESCRIPTION

The aim is to implement a search engine that answers the queries on Environmental News NLP Archive. In the first part, the corpus is created, by indexing the documents and snippets and secondly the search queries are answered using the index created. To which features like, spell-check, ranking are added. To grade the implementation, similarity checks with the existing search engine- Elastic search.

Dataset

Environmental News NLP Archive : Contains 418 documents fromvarious news stations containing the news article URL, Match DateTime, station, source, IAshowId, IA preview Thumb, snippet.

Inverted Index

Inverted Index is a data structure that is built to parse the documents that the queries are answered on. Given a query, the index is used to return the list of documents and snippets relevant for the query. The inverted index contains mappings from terms to the documents that those terms appear in. Wherein, each term is a key in the index whose value is its postings list. A posting list of a term is the list of documents that the term appears in.

Query Types

The search engine implemented answers the query types namely:

One-word Queries: Queries that consist of a single word.
Free test Queries: Queries that consist of sequences of words separated by space, where the result will be the implicit logical ‘OR’ of all the terms present in the query.
Phrase Queries: Queries that again consist of sequences of words separated by space, and inserted with the double quotes so that the documents to contain the terms in the query exactly in the specified order are to be fetched.
Wild-card Queries: Queries that are uncertain about the spelling of a term or when multiple spelling variants of a term exist.
Proximity Queries: Queries that require the term in the query to be occurring in the given proximity within a snippet.

Spell Check

If the query word exists in the vocabulary then we assume that it is correct. If this word does not exist in the vocabulary we try to find the most similar words.The similar words are sorted based on Jaccard Distance by computing the 2Q grams of the words and returned the 3 most similar words order by Similarity and Probability.

COMPARISON

We compare all queries types of our IR systems with elastic searchengine and following are the response time:

Query |Our IR System | Elastic Search

Free Text Query |0.01 secs | 0.063 secs

Wildcard Query |0.87 secs | 0.052 secs

Proximity Query |0.0014 secs | NA

Phrase Query |0.012 secs | NA

INTERPRETATION OF EFFICIENCY

We have calculated the Precision and Recall of 5 queries for our IR system by measuring it against the elastic search results considering it as relevant documents. For Free Text query Precision is 0.97 and Recall is 1. This means our IR system is retrieving all the relevant documents from the database plus a few false positives.

LEARNING OUTCOME

● Building IR systems for query searching.

● Building Posting listing and Dictionary using B-Trees

● Handling Different types of Queries

● Building Spelling Correction using Jaccard Coefficient

● Hands on of ElasticSearch

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
Snapshots		Snapshots
LICENSE		LICENSE
README.md		README.md
doc.pdf		doc.pdf
source_code.ipynb		source_code.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

License

Shrinidhi-C/Search-Engine

Folders and files

Latest commit

History

Repository files navigation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages