Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add correct text ordering and column merging to CL #83

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
132 changes: 132 additions & 0 deletions courtlistener/process_cl.py
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a modification of the csv_to_dolma.py instead of a new file

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes will merge.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add a simple docstring to the module

Original file line number Diff line number Diff line change
@@ -0,0 +1,132 @@
import argparse
import csv
import logging
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unneeded import

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think much of this is needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed import

import os
import re
import sys

import pandas as pd

from licensed_pile.licenses import PermissiveLicenses
from licensed_pile.logs import configure_logging
from licensed_pile.write import to_dolma

SOURCE_NAME = "CourtListenerOpinion"

csv.field_size_limit(sys.maxsize)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed now that we aren't using the csv stdlib?

If it isn't there are several unused imports to remove.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleted this and removed imports.


logger = configure_logging("court-listener-opinion")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prefer moving the configure_logging call under if __name__ == "__main__": and then have a logger = logs.get_logger(...) in the process_court_listener function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is what you meant:

def process_court_listener(file_path):

    logger = logging.getLogger("court-listener-opinion")
    ...
if __name__ == "__main__":

    logger = configure_logging("court-listener-opinion")
    ...



def process_court_listener(file_path):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing Docstring

Copy link
Contributor Author

@conceptofmind conceptofmind Jun 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add a docstring following something like:

Reads the decompressed CL data dump
Adds metadata and source information
Drops unnecessary columns based on Harvard CAP input
Combines columns in the correct order based on CL documentation
Drops repeated XML/HTML columns
Extracts HTML/XML following Harvard CAP
Merges extracted text and plain text properly
Handles null values based on text row
Writes out to dolma format

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about something like:

def process_court_listener(file_path):
    """
    This function performs several operations on a given file:

    1. Reads the decompressed Court Listener (CL) data dump from the provided file path.
    2. Adds metadata and source information to the data.
    3. Drops unnecessary columns based on Harvard CAP input.
    4. Selects the first non-null value when iterating over columns based on CL documentation.
    5. Drops repeated XML/HTML columns.
    6. Extracts HTML/XML following Harvard CAP guidelines.
    7. Selects the first non-null value when iterating over extracted text and plain text columns.
    8. Handles null values based on text row.
    9. Writes out the processed data to dolma format.

    Parameters:
    file_path (str): The path to the file to be processed.
    """

df = pd.read_csv(file_path)

# add metadata column
df["metadata"] = str(PermissiveLicenses.PD)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metadata value should be a dict in the dolma format, not a string

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will create a metadata dict as noted above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added to metadata dict:

            "license": str(PermissiveLicenses.PD),


# add source column
df["source"] = SOURCE_NAME
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets add a source_name argument (which we can default to SOURCE_NAME) and set that instead of using a global.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WIll do.

Copy link
Contributor Author

@conceptofmind conceptofmind Jun 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

def process_court_listener(file_path):
    ...
    df["source"] = args.source_name
parser.add_argument(
    "--source_name",
    default="SOURCE_NAME",
    help="The name of the source.",
)


"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment doesn't add anything, the list of columns to remove is in the code below

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed comment

def process_court_listener(file_path):
    ...
    df["source"] = args.source_name

Remove columns:
date_modified
author_str
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should include the author_str as an "author" field in the "metadata" dict.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this:

"author": x["author_str"],

per_curiam
joined_by_str
type
sha1
page_count
local_path
extracted_by_ocr
author_id
cluster_id
"""
df = df.drop(
columns=[
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not something to be addressed, but I hate that pandas would do something totally different if this was a tuple instead of a list (and it would be better as a tuple as it is read only lol)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LOL

"date_modified",
"author_str",
"per_curiam",
"joined_by_str",
"type",
"sha1",
"page_count",
"local_path",
"extracted_by_ocr",
"author_id",
"cluster_id",
]
)

"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer multiple single line comments over multiline strings as comments (syntax highlighting doesn't handle the string as a comment)

https://peps.python.org/pep-0008/#block-comments

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

    # For each row, select the first non-null value when iterating over columns in this order:
    # html_with_citations
    # html_columbia
    # html_lawbox
    # xml_harvard
    # html_anon_2020
    # html
    # plain_text
    # This follows the procedure in the Court Listener documentation.

    df["text"] = (
        df["html_with_citations"]
        .combine_first(df["html_columbia"])
        .combine_first(df["html_lawbox"])
        .combine_first(df["xml_harvard"])
        .combine_first(df["html_anon_2020"])
        .combine_first(df["html"])
    )

Merge columns based on Court Listener documentation:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should re-write this comment. "Merge" isn't the name for this operation. Something like:

For each row, select the first non-null value when iterating over columns in this order: ...
This follows the procedure in the Court Listener documentation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coalesce is what is used in spark.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no opinion on this. We can add that comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

    # For each row, select the first non-null value when iterating over columns in this order:
    # html_with_citations
    # html_columbia
    # html_lawbox
    # xml_harvard
    # html_anon_2020
    # html
    # plain_text
    # This follows the procedure in the Court Listener documentation.

    df["text"] = (
        df["html_with_citations"]
        .combine_first(df["html_columbia"])
        .combine_first(df["html_lawbox"])
        .combine_first(df["xml_harvard"])
        .combine_first(df["html_anon_2020"])
        .combine_first(df["HTML"])
    )

html_with_citations
html_columbia
html_lawbox
xml_harvard
html_anon_2020
html
plain_text
"""
df["text"] = (
df["html_with_citations"]
.combine_first(df["html_columbia"])
.combine_first(df["html_lawbox"])
.combine_first(df["xml_harvard"])
.combine_first(df["html_anon_2020"])
.combine_first(df["html"])
)

# keep only the text columns and drop null values
df = df.drop(
columns=[
"html",
"html_anon_2020",
"html_lawbox",
"html_columbia",
"xml_harvard",
"html_with_citations",
]
).dropna(subset=["text"])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drop NA is a very different operation than dropping columns. We should separate them into two different lines so the dropna doesn't get lost in a refactor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do and add comments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about something like this:

    # drop the columns used in combine_first to avoid redundancy
    df = df.drop(
        columns=[
            "html",
            "html_anon_2020",
            "html_lawbox",
            "html_columbia",
            "xml_harvard",
            "html_with_citations",
        ]
    )

    # drop null values in the text column
    df = df.dropna(subset=["text"])


# extract text from html and xml following Harvard CAP
# They used r"<.+?>", ""
df["text"] = df["text"].apply(lambda x: re.sub(r"<.+?>", "", x))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can get vectorized via built in pandas instead of using apply, i.e.,:

df["text"] = df["text"].str.replace(r"<.+?>", "", regex=True)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is dependent on the extractor we choose.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any auditing tools for the output? It can be ok if you have a strict well-defined subset of things you are looking for, but generally you shouldn't parse HTML/XML with a regex https://stackoverflow.com/a/1732454

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trafilatura is the extractor that is commonly used.

I left what Harvard CAP did with regex.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what it would be with trafilatura:

    # extract text from html and xml using trafilatura
    df["text"] = df["text"].apply(lambda x: extract(x))


# combine merge plain text and extracted text
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, merge is a bad name for this operation.

Copy link
Contributor Author

@conceptofmind conceptofmind Jun 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can list it as coalesce instead.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure coalesce is a good name either, that also implies you are creating one value from multiple column values. I hate that SQL picked such a bad name lol

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes LOL. It is a bad name. I will comment after thinking about it a bit.

Copy link
Contributor Author

@conceptofmind conceptofmind Jun 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe something like this:

    # Combine DataFrame objects by filling null values 
    # with non-null values from the other selected DataFrames.

Adapted from docs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

    # Combine DataFrame objects by filling null values 
    # with non-null values from the other selected DataFrames.
    df["text"] = df["text"].combine_first(df["plain_text"])

df["text"] = df["text"].combine_first(df["plain_text"])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line doesn't do anything, there is already a .dropna(subset=["text"]) call so there won't be any line where the text column is None and the plain_text column is non-null.

Additionally, .combine_first doesn't consider an empty string as null so even if there was a text column that was totally removed based on the regex above, it wouldn't get replaced by the plain_text column.

The plain_text column should be part of the initial .combine_first chain or the .dropna above should be removed and rows with an empty string for text column should be converted to a None so the plain_text column will actually replace the text column

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an artifact of using a different extractor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trafilatura will over-filter and cause null if not handled explicitly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the case with standard regex.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If Trafilatura is so overzelous that it turns real data into null that sounds like a red flag. Also if the plain_text column is good enough for those cases is there a reason it isn't good enough to always use it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it is a red flag. I am writing my own explicit extractor for this dataset but that will not be out until later.

Trafilatura ranks as the highest-quality extractor given a general task.

The extractor should be run on every column which is not plain text. Plain text is considered the lowest quality data in terms of coalescing. It should only be used if the other columns do not exist and can not be extracted.


# drop plain text column and text null values
df = df.drop(columns=["plain_text"]).dropna(subset=["text"])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, these are super different operations so we should explicitly separate them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can split and add comments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

    # drop the plain text column
    df = df.drop(columns=["plain_text"])
    
    # drop null values in the text column
    df = df.dropna(subset=["text"])


# return a dictionary for each row - dolma format
return df.to_dict(orient="records")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't result in the dolma formatted data. See https://github.com/allenai/dolma/blob/main/docs/data-format.md#dolma-document-format

Issues include:

  • It has a "download_url" key top-level. This should be removed or moved into metadata as "url"
  • It has "date_created" instead of "created"
  • It is missing an "added" field
  • It has the value of "metadata" as the license string. It should be a dict with a "license" key instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will create a metadata dict for the recommended changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know if this is what you were looking for:

    df = pd.read_csv(file_path)

    df["source"] = args.source_name
    df["added"] = datetime.utcnow().isoformat()

    # rename the column date_created -> created
    df = df.rename(columns={"date_created": "created"})

    # add the metadata column including the author_str, license, and url
    df["metadata"] = df.apply(
        lambda x: {
            "license": str(PermissiveLicenses.PD),
            "url": x["download_url"],
            "author": x["author_str"],
        },
        axis=1,
    )



def main(args):
example = process_court_listener(args.input_file)
output_file_base_name = os.path.basename(args.input_file).replace(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we don't have multiple dated files we should have the base filename as an cli argument with a default like courtlistener.jsonl.gz instead of basing it on the input file

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to_dolma(example, args.output_dir, args.output_file_base_name, args.shard_size)

...

parser.add_argument(
  "--output_file_base_name",
  default="courtlistener.jsonl.gz",
  help="The base name of the output file.",
)

".csv", ".jsonl.gz"
)
to_dolma(example, args.output_dir, output_file_base_name, args.shard_size)
logger.info(f"Saved {args.input_file} as dolma shared files at {args.output_dir}")


if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Convert csv data to dolma.")
parser.add_argument(
"--output_dir",
default=f"data/courtlistener/v0",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be "data/courtlistener/v0/documents"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

    parser.add_argument(
        "--output_dir",
        default="data/courtlistener/v0/documents",
        help="Where the dolma formatted data goes.",
    )

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't need to be an f-string

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

    parser.add_argument(
        "--output_dir",
        default="data/courtlistener/v0/documents",
        help="Where the dolma formatted data goes.",
    )

help="Where the dolma formatted data goes.",
)
parser.add_argument(
"--shard_size",
default=1000,
help="The number of documents to store in each shard.",
)
parser.add_argument(
"--input_file",
default="./data/courtlistener/raw/opinions-2022-08-02.csv",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't have a default for this, especially as the date doesn't match the latest date. Lets use required=True and remove the default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parser.add_argument(
    "--input_file",
    required=True,
    help="The path to the csv file to convert.",
)

help="The path to the csv file to convert.",
)
args = parser.parse_args()
main(args)
Loading