-
Notifications
You must be signed in to change notification settings - Fork 37
Evaluation Measures
Evaluation measures for an information retrieval system are used to assess how well the search results satisfied the user's query intent.
Such metrics are often split into kinds: online metrics look at users' interactions with the search system, while offline metrics measure relevance, in other words how likely each result, or search engine results page (SERP) page as a whole, is to meet the information needs of the user.
(Wikipedia)
The following list includes the leaf-level RRE built-in metrics which can be used out of the box. "Leaf" because those metrics are computed at leaf level in the domain model, which means they are computed at query level:
- Precision: the fraction of retrieved documents that are relevant
- Recall: the fraction of relevant documents that are retrieved
- Precision at 1: this metric indicates if the first top result in the list is relevant or not.
- Precision at 1: this metric indicates if the first top result in the list is relevant or not.
- Precision at 2: same as above but it consider the first two results.
- Precision at 3: same as above but it consider the first three results.
- Precision at 10: this metric measures the number of relevant results in the top 10 search results
- Reciprocal Rank: it is the multiplicative inverse of the rank of the first "correct" answer: 1 for first place, 1/2 for second place, 1/3 for third and so on.
- Expected Reciprocal Rank (ERR): An extension of Reciprocal Rank with graded relevance, measures the expected reciprocal length of time that the user will take to find a relevant document.
- Average Precision: the area under the precision-recall curve.
- NDCG at 10: it is the multiplicative inverse of the rank of the first "correct" answer: 1 for first place, 1/2 for second place, 1/3 for third and so on.
- F-Measure: it measures the effectiveness of retrieval with respect to a user who attaches β times as much importance to recall as precision. RRE provides the three most popular F-Measure instances: F0.5, F1 and F2
On top of those "leaf" metrics computed at query level, RRE computes them at the upper levels of the domain model (e.g. query group, topic, corpus) using an aggregation function. The result is a new set of metrics with several levels of granularity:
- Mean Average Precision: the mean of the average precisions computed at query level.
- Mean Reciprocal Rank: the average of the reciprocal ranks computed at query level.
- all other metrics listed above aggregared by their arithmetic mean
1. What is it?
2. Quick Start
3. Project Structure
4. Evaluation Measures
5. How does it work?
5.1 Domain Model
5.2 What we need to provide
5.3 Where we need to provide
5.4 The Evaluation Process
5.5 The Evaluation Output
5.6 Persisting evaluation outputs
6. RRE Server
7. Apache Maven
7.1 Maven Plugin
7.2 Maven Reporting Plugin
7.3 Maven Archetype
8. Search Platform Framework
8.1 Supported platforms
8.2 Customising the Search Platform
9. FAQ