Create an overview doc of Ex. 21 formats/layouts #3447

katie-lamb · 2024-03-06T22:28:13Z

As a first step in the extraction process we need to scope the different formats that the Ex. 21 documents come in so we can have a better understanding of what layouts to try to extract from first. Create a document with the following:

Tasks

Give feedback

Target variables and which ones we can access through each layout
Screenshots of each common layout
Rough distribution of documents to each layout
Experiment with off the shelf extraction tools
List modeling method pros and cons and create issues for model experiments to try
Create issue for extracting company name and address information from 10K's (main filing not Ex. 21)
Figure out what the difference is between different quarters of filings. What's in the Q1 filing that is not contained in the other quarters?
Options

Once we've conducted this overview, we'll have a better idea of how to create a representative sample of documents to begin extracting from, or which layout to start with.

katie-lamb · 2024-03-08T00:23:03Z

Began creating a google doc to track this, so far main issue is that sometimes Ex. 21 tables are within an HTML table while other times they're part of a HTML body. Even when using pandas read_html to read in from an HTML table there needs to be some serious reformatting to get the schema right. This might require categorizing into the different formats/schemas.

katie-lamb added this to Catalyst Megaproject Mar 6, 2024

katie-lamb converted this from a draft issue Mar 6, 2024

katie-lamb self-assigned this Mar 6, 2024

katie-lamb added the mozilla_sec_to_eia Mozilla AI for EJ grant to link SEC utility ownership data to EIA operational data label Mar 6, 2024

jdangerx moved this from In progress to Done in Catalyst Megaproject Mar 25, 2024

katie-lamb closed this as completed Apr 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create an overview doc of Ex. 21 formats/layouts #3447

Create an overview doc of Ex. 21 formats/layouts #3447

katie-lamb commented Mar 6, 2024 •

edited

Loading

Tasks

katie-lamb commented Mar 8, 2024

Create an overview doc of Ex. 21 formats/layouts #3447

Create an overview doc of Ex. 21 formats/layouts #3447

Comments

katie-lamb commented Mar 6, 2024 • edited Loading

Tasks

katie-lamb commented Mar 8, 2024

katie-lamb commented Mar 6, 2024 •

edited

Loading