Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

558 Implement Search relevance decay based on date filed #4849

Merged
merged 10 commits into from
Jan 6, 2025

Conversation

albertisfu
Copy link
Contributor

@albertisfu albertisfu commented Dec 21, 2024

This PR introduces decay relevance based on the date filed, as described in #558, for all search types with a date filed.

These search types include RECAP, DOCKETS, RECAP_DOCUMENT (only V4 API), OPINIONS, and ORAL_ARGUMENT.

As pointed out in #558 we aim to combine this decay with the bm25 scores returned by Elasticsearch when sorting by score desc.

The formula described in #558 is:

weight = e^(-t / H)

H is the half-life parameter, meaning the time (t) it takes for the weight to halve which is equivalent the time (t) to reach a decay of 0.5

My first approach was to use the built in function scores available in Elasticsearch.

Similar to the formula above is the exp function.
That looks something like:

weight =exp(λ⋅max( 0, | date_filed - origin| - offset))

Where λ is:
λ = ln(decay)/scale

Both formulas can achieve the same decay behavior; however, e^(-t / H) is directly related to a decay of 0.5

Solving H we have:
H = -t/ln(weight)

Decay of 0.5 in a t time:

H = -t/ln(0.5)

In contrast, the Elasticsearch approach is:

weight =exp(λ⋅max( 0, | date_filed - origin| - offset))

Assuming we are not going to apply an offset and simplifying max( 0, | date_filed - origin| )) = t

weight =exp(λ⋅t)
λ = ln(decay)/scale

We have:
weight = exp((ln(decay)/scale )⋅t)

Which is a more flexible approach where decay and scale can be easily controlled to adjust the curve shape.

However, when using the built-in function:

"functions": [
        {
          "exp": {
            "dateFiled": {
              "origin": "now",
              "scale": "10y",
              "offset": "0d",
              "decay": 0.5
            }
          }
        }
      ]
      

I encountered an issue similar to what we found when implementing other custom score functions. If date filed is None in a document, it is shown first. This doesn't seem correct, as it prioritizes documents with no date filed over recent documents.

To solve this issue, I opted to implement the same exp function using a custom script in the build_decay_relevance_score method, which accepts a value for missing date_filed (defaulting to 1600-01-01).
This way, documents with a null date_filed are considered to belong to this date or to a date we specify.

Additionally, the decay and scale parameters are configurable in this method. The scale is given in years, which makes more sense for our data.
This will allow us to say for instance: Achieve a decay of 0.5 (decay) for documents that are 10 years old (scale).

The weight computed by this custom function score is combined with the original BM25 score using boost_mode: "multiply", which multiplies the original score by the computed weight, which can vary from 1 to ~ 0.

For example, if the weight computed for a document is 0.5 and the original score for the document is 100, the new score will be 50.

However, there is a problem with queries that don't return scores, such as those when the user doesn't provide a text query (only filters or a match-all query) so in these cases the boost_mode used is "replace". where only the decay relevance weight is used as the score, which is similar to sorting documents by date filed.

  • I also refactored many methods to centralize the application of custom scores for this and previous usages within apply_custom_score_to_main_query. This allows us to easily add new function score methods in the future, such as a different relevance score for courts.

  • Additionally, I applied further refactors to avoid sending the function score for percolator queries, where the function score leads to unexpected behavior.

  • I also tweaked count queries for main documents to avoid sending the function score, which is unnecessary for counts and to improve performance.

  • Added test classes to confirm that the decay relevance combined with BM25 scores behaves properly for all supported search types in the frontend and API v3 and v4.

  • To fine-tune the decay and scale parameters, I gathered data from Elasticsearch so we can decide which type of decay to apply based on each document's distribution over time.

In the following plots, you can see the document distribution over time and a proposal for the scale and decay parameters, with the curve shown in blue. In this approach the decay curve is adapted proportionally to the document distribution.

Dockets:

scale (years): 20
decay:0.2
dockets

RECAP Documents

scale (years): 15
decay:0.15
rds

Case Law

scale (years): 30
decay:0.4
case_law

Oral Arguments

scale (years): 15
decay:0.3
oa

However, we can propose a different decay behavior if it makes more sense to have a faster or slower decay from a specific date for each type of document.

Let me know what do you think.

@albertisfu albertisfu marked this pull request as ready for review December 23, 2024 16:35
@albertisfu albertisfu requested a review from mlissner December 23, 2024 16:37
@mlissner
Copy link
Member

mlissner commented Dec 24, 2024

Alberto, this is all very impressive, thank you. Three quick thoughts that will help me understand things:

  1. Can you please share your work for developing the charts, so we have it next time?

  2. It looks like you set the scale and decay to more or less approximate the curve of the data distribution itself. I'm not sure I have an intuition for why that would be the right approach, but my intuition is probably quite poor for this. How did you arrive at this approach to the score values?

  3. Does the score from this wind up in the API results like the bm25 score does?

@albertisfu
Copy link
Contributor Author

Sure here is the Jupiter notebook containing the code to generate the charts for each search type.

decay_relevance.ipynb.txt

And these are the ES queries used to retrieve the data:

RECAP Documents:

{
   "query":{
      "bool":{
         "filter":[
            {
               "match":{
                  "docket_child":"recap_document"
               }
            }
         ]
      }
   },
   "aggs":{
      "rds_coverage_over_time":{
         "date_histogram":{
            "field":"dateFiled",
            "calendar_interval":"year",
            "min_doc_count":0,
            "format":"yyyy"
         }
      }
   },
   "size":0
}

Dockets:

{
   "query":{
      "bool":{
         "filter":[
            {
               "match":{
                  "docket_child":"docket"
               }
            }
         ]
      }
   },
   "aggs":{
      "dockets_coverage_over_time":{
         "date_histogram":{
            "field":"dateFiled",
            "calendar_interval":"year",
            "min_doc_count":0,
            "format":"yyyy"
         }
      }
   },
   "size":0
}

Opinion Clusters:

{
   "query":{
      "bool":{
         "filter":[
            {
               "match":{
                  "cluster_child":"opinion_cluster"
               }
            }
         ]
      }
   },
   "aggs":{
      "opinions_coverage_over_time":{
         "date_histogram":{
            "field":"dateFiled",
            "calendar_interval":"year",
            "min_doc_count":0,
            "format":"yyyy"
         }
      }
   },
   "size":0
}

Oral Arguments

{
   "query":{
      "match_all": {}
   },
   "aggs":{
      "oa_coverage_over_time":{
         "date_histogram":{
            "field":"dateArgued",
            "calendar_interval":"year",
            "min_doc_count":0,
            "format":"yyyy"
         }
      }
   },
   "size":0
}

Regarding the proposed scale and decay values shown in the charts:

My reasoning was simple. The most recent documents experience little to no decay (~1), while the oldest documents are heavily penalized with a decay close to 0. From there, I tried to adjust the curve to fit our current document distribution. However, this was just an initial proposal to serve as a starting point for discussion about the best approach.

The currently proposed parameters have both advantages and disadvantages, depending on the relevance logic we aim to achieve.

For example, consider the RECAP Documents chart:

Screenshot 2024-12-25 at 11 47 46 p m

In this chart, documents from 1990 and earlier have a decay close to 0. This means that if they match a search query and are ranked high due to their BM25 score, they should appear first in the results if no decay relevance is applied. However, after applying date based decay, these documents will always appear last, regardless of how well their terms match the query.

This proposed approach represents an extreme case where older documents in the index are penalized with the highest decay possible.

If we want to be more flexible with older documents, we could apply a larger decay value and scale. For instance:

decay: 0.5
scale: 50 years

Screenshot 2024-12-26 at 12 01 20 a m

In this scenario, we observe a much slower decay. Even the oldest documents will never reach a decay close to 0, with the lowest decay for these documents being approximately 0.2.

Additionally, most documents in the index, particularly in the densest region (2000–2024), will have a decay ranging from ~0.7 to 1. However this narrow decay range could sometimes result in insufficient prioritization of newer documents, especially if the original BM25 scores between documents are not significantly different, or if the documents are closely spaced in time.

Here an example:

No decay original BM25 scores:
1.- Doc_1 filed in 2015-08-16:
score: 47.537655
2.- Doc_2 filed in 2016-08-16
score: 47.142

Decay 0.5 scale: 10 years:
1.- Doc_2 filed in 2016-08-16
score: 4.9690166
2.- Doc_1 filed in 2015-08-16:
score: 4.8421617

Decay 0.5 scale: 50 years:
1.- Doc_1 filed in 2015-08-16:
score: 8.137241
2.- Doc_2 filed in 2016-08-16
score: 7.8987794

In this example, we see how two documents with similar BM25 scores behave:

  • With no decay, Doc_1 is ranked first.
  • With a "fast" decay (scale 10 years), Doc_2 is ranked first.
  • With a "slow" decay (scale 50 years), the decay is insufficient to change the original order, and Doc_1 is ranked first even though Doc_2 is newer.

Thus, it might be better to avoid extreme settings and start with a "medium" speed decay.

Does the score from this wind up in the API results like the bm25 score does?

Yes, because this approach works by multiplying or replacing (in the case of filter-only queries) the original BM25 score generated by Elasticsearch.

@mlissner
Copy link
Member

Thanks. Very helpful. Two more thoughts:

  1. How do the decay scores interact in parent-child queries?

  2. What I meant about the API is whether the decay value for each result winds up in the score field of the JSON that we introduced in v4.1 of the API? I was thinking it should have keys for bm25, decay, and composite (combining the two). This could come in a separate PR next sprint, if helpful.

@albertisfu
Copy link
Contributor Author

Good questions!

How do the decay scores interact in parent-child queries?

In parent-child queries, the has_child query uses the "score_mode":"max", meaning that the highest score from the matching child documents is used to influence the parent document's score. Consequently, the main document score becomes a combination of the child document with the highest score + the score from the matching parent document.

So here when sorting by relevance (score desc), the function_score introduced is applied to the main query (the has_child query + parent document query). This means the main score (child + parent score) is then multiplied by the computed decay. This ensures that the decay interacts with both the child and parent score.

There is one exception: the decay does not interact with the main score if the user does not provide a text query (e.g., in a match_all query or filter-only queries). In these cases, all documents return the same main score, which can be 0 or 1. Thus, it does not make sense to multiply the main score by the computed decay. Instead, in such cases, the decay value directly replaces the main document score.

What I meant about the API is whether the decay value for each result winds up in the score field of the JSON that we introduced in v4.1 of the API? I was thinking it should have keys for bm25, decay, and composite (combining the two). This could come in a separate PR next sprint, if helpful.

Got it. Currently, the value shown in the score field in API v4.1 is the composite score, which is the main score returned by Elasticsearch. By default, when using a function_score to manipulate the document score, the computed value is either multiplied by or replaces the main document score. As a result, it is not possible to directly inspect the "original" main score or the value returned by the function_score.

I did some tests and found a couple of alternatives that might allow us to break down the scores in the API:

  • Using "explain": true:
    By passing this query parameter, detailed information about score computation is included as part of each hit. For example:
{
        "_shard": "[recap_vectors][29]",
        "_node": "zdUGFtNYR3eIfioFwAMU1g",
        "_index": "recap_vectors",
        "_id": "68107056",
        "_score": 161.7806,
        "_source": {
          "docket_slug": "apple",
          "docket_absolute_url": "/docket/68107056/apple/",
          "court_exact": "deb",
          "party_id": [],
          "party": [],
          "attorney_id": [],
          "attorney": [],
          "firm_id": [],
          "firm": [],
          "docket_child": "docket",
          "timestamp": "2024-12-22T12:51:24.205171",
          "docket_id": 68107056,
          "caseName": "APPLE",
          "case_name_full": "",
          "docketNumber": "23-00236",
          "suitNature": "",
          "cause": "",
          "juryDemand": "",
          "jurisdictionType": "",
          "dateArgued": null,
          "dateFiled": "2023-12-21",
          "dateTerminated": null,
          "assignedTo": "Brendan L. Shannon",
          "assigned_to_id": 8738,
          "referredTo": null,
          "referred_to_id": null,
          "court": "United States Bankruptcy Court, D. Delaware",
          "court_id": "deb",
          "court_citation_string": "Bankr. D. Del.",
          "chapter": null,
          "trustee_str": null,
          "date_created": "2023-12-21T13:32:01.517762+00:00",
          "pacer_case_id": "191820"
        },
        "_explanation": {
          "value": 161.7806,
          "description": "function score, product of:",
          "details": [
            {
              "value": 175.60925,
              "description": "sum of:",
              "details": [
              # A bunch of details...
               ]
            },
            {
              "value": 0.92125326,
              "description": "min of:",
              "details": [
                {
                  "value": 0.92125326,
                  "description": "script score function, computed with script:\"Script{type=inline, lang='painless', idOrCode='\n                    def default_missing_date = Instant.parse(params.default_missing_date).toEpochMilli();\n                    def decay = (double)params.decay;\n                    def now = new Date().getTime();\n\n                    // Convert scale parameter into milliseconds.\n                    def scaleStr = params.scale;\n                    double years = (double)params.scale;\n                    // Convert years to milliseconds 1 year = 365 days\n                    long scaleMillis = (long)(years * 365 * 24 * 60 * 60 * 1000);\n\n                    // Retrieve the document date. If missing or null, use default_missing_date\n                    def docDate = default_missing_date;\n                    if (doc['dateFiled'].size() > 0) {\n                        docDate = doc['dateFiled'].value.toInstant().toEpochMilli();\n                    }\n                    // λ = ln(decay)/scale\n                    def lambda = Math.log(decay) / scaleMillis;\n                    // Absolute distance from now\n                    def diff = Math.abs(docDate - now);\n                    // Score: exp( λ * max(0, |docDate - now|) )\n                    return Math.exp(lambda * diff);\n                    ', options={}, params={scale=20, decay=0.2, default_missing_date=1600-01-01T00:00:00Z}}\"",
                  "details": [
                   
                                  ]
                                },
                                {
                                  "value": 0,
                                  "description": "match on required clause, product of:",
                                  "details": [
                                   # A bunch of details...
                                   
                                  ]
                                }
                              ]
                            }
                          ]
                        }
                      ]
                    }
                  ]
                },
                {
                  "value": 3.4028235e+38,
                  "description": "maxBoost",
                  "details": []
                }
              ]
            }
          ]
        },
        "inner_hits": {
          "filter_query_inner_recap_document": {
            "hits": {
              "total": {
                "value": 6,
                "relation": "eq"
              },
              "max_score": 87.80463,
              "hits": [
                  {}
                }
               ]

I removed a lot of details from this response for simplicity but the original response can be seen here:

response_explain_example.json

You can see the main score (composite) is 161.7806, which is the product of the original score 175.60925 × the decay value 0.92125326.
From here we can extract the required values as you mentioned: bm25, decay, and composite.

However, the explain parameter can introduce additional overhead to the query for both performance and bandwidth, since the response size with the score details is almost double the normal response, as can be seen in the example attached above that sizes 329KB while the original response with no explain sized 122KB.
About the performance, both queries with and without explain don't show great differences in the took time. However, I think the real impact will be visible on the overall cluster resources if this is enabled. Perhaps we could run a benchmark in the production cluster using Rally to assess the real impact of using the explain parameter.

  • The second alternative is using script_fields: adding script_fields with the same custom decay function script, we can add a new field to the document containing the decay value. Then using the main score, we can also compute the original score: original_score = main_score / decay.

    The advantage of this approach is that it only adds a single field to the response, as opposed to the explain approach, which significantly increases the response size.
    The downside, however, is that the custom function would be executed twice for each document: once during the query phase (to compute the score for sorting results) and again during the fetch phase (to add the new field with the decay value). This could also introduce additional overhead. Benchmarking would be required to evaluate its impact.

Let me know what do you think.

@mlissner
Copy link
Member

OK, sounds like adding the score to the API isn't great, so let's not do that. The composite is fine for now. Maybe in the future, we can add &explain=True to the query if people want that.


For scoring, thanks for all the detailed information. Let's get a PR review done and I'll continue thinking about what the right values are. Feels like we've entered the artisanal stage of relevancy!

@mlissner
Copy link
Member

Back on the topic of the values, a couple things come to mind:

  1. We don't want to set the scale so aggressively that the bm25 scores of old content is wiped out. If there's a result with a solid bm25 score from 20 years ago, it should show up somewhere in the top results (but below a result with a good score from yesterday).

  2. Some norms:

    • Case law loses value after about 50 years for most purposes.
    • RECAP really only goes back about 20 years. Folks aren't generally looking for anything older than that. That said, finding an older docket that you know is in PACER should work.
    • Oral arguments don't matter much relevancy-wise. Just not much stuff in there. Probably can just use the same decay numbers as case law even though the data is so different.

I think putting these together, we probably want a case law decay of about 50 years that flattens out at a score of 0.1 or 0.2?

I think we want RECAP to have something similar, but over about 20 years instead?

This is all very seat of the pants!

@albertisfu
Copy link
Contributor Author

albertisfu commented Dec 30, 2024

We don't want to set the scale so aggressively that the bm25 scores of old content is wiped out.

I think putting these together, we probably want a case law decay of about 50 years that flattens out at a score of 0.1 or 0.2?
I think we want RECAP to have something similar, but over about 20 years instead?

Great! thank you for these insights they're helpful to determine when it's more appropriate to set the scale and decay as you described.

To prevent scores for content older than 50 years (Case law/OA) or 20 years (RECAP) from being wiped out entirely, I’ve introduced a tweak to the decay function. Instead of converging to 0, it now converges to min_score, which is configurable and set to 0.1. This ensures that even the oldest content retains a non zero score.

decay_score = exp((ln(decay)/scale )⋅t)
adjusted_decay_score = min_score + ((1 - min_score) * decay_score)

Below are the updated charts that illustrate how the decay function behaves after these adjustments:

Dockets:
scale (years): 20
decay:0.2
min_score:0.1

Screenshot 2024-12-30 at 3 20 05 p m

RECAP Documents:
scale (years): 20
decay:0.2
min_score:0.1

Screenshot 2024-12-30 at 3 20 48 p m

Case Law:
scale (years): 50
decay:0.2
min_score:0.1

Screenshot 2024-12-30 at 3 21 27 p m

Oral Arguments:
scale (years): 50
decay:0.2
min_score:0.1
Screenshot 2024-12-30 at 3 24 06 p m

decay_relevance_min_score.ipynb.txt

After these changes, it was necessary to update a few dates in the factories to align with the new scales and minimum score. So far, everything is working as expected.

@mlissner
Copy link
Member

Looks great. Once reviewed, let's ship and see how it feels!

Copy link
Contributor

@ERosendo ERosendo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the details, @albertisfu! The code looks good. Let's merge after addressing my comment.

@ERosendo ERosendo assigned albertisfu and unassigned ERosendo Jan 6, 2025
@albertisfu
Copy link
Contributor Author

Thanks @ERosendo I've applied the suggested change.

@albertisfu albertisfu assigned ERosendo and unassigned albertisfu Jan 6, 2025
@ERosendo ERosendo merged commit b7dac26 into main Jan 6, 2025
15 checks passed
@ERosendo ERosendo deleted the 558-implement-es-relevance-decay branch January 6, 2025 19:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

3 participants