Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ELM-based ordinance retrieval and extraction #13

Merged
merged 170 commits into from
May 1, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
170 commits
Select commit Hold shift + click to select a range
8139ab8
Add "dev" option that includes pytest install
ppinchuk Jan 8, 2024
df11ebd
Add service queues + tests
ppinchuk Jan 8, 2024
016077c
Restructure
ppinchuk Jan 8, 2024
b8c128a
Add custom exceptions + tests
ppinchuk Jan 8, 2024
5f85781
Add coverage setup to tests
ppinchuk Jan 8, 2024
66c65da
Add base service class and provider + tests
ppinchuk Jan 8, 2024
0910e28
Add missing exception
ppinchuk Jan 8, 2024
57d3b45
rename file
ppinchuk Jan 9, 2024
56c948e
Add usage utilities + tests
ppinchuk Jan 9, 2024
ecd58cf
Add init file
ppinchuk Jan 9, 2024
bb2c346
Add rate limited service base class + tests
ppinchuk Jan 9, 2024
22495c7
Add more exceptions
ppinchuk Jan 9, 2024
ec4ae71
Add retry decorators
ppinchuk Jan 10, 2024
8ff88da
rename class
ppinchuk Jan 10, 2024
e1ea0b2
Rename utilities module
ppinchuk Jan 11, 2024
aefb9fe
Move usage functions to services
ppinchuk Jan 11, 2024
a456910
Fix import
ppinchuk Jan 11, 2024
45abe58
Add `UsageTracker` class
ppinchuk Jan 11, 2024
b8ac41a
Remove extra import
ppinchuk Jan 11, 2024
b5886b1
Add `llm_response_as_json` function
ppinchuk Jan 11, 2024
9649fb1
Fill in missing docstring
ppinchuk Jan 11, 2024
7d067d8
Minor formatting
ppinchuk Jan 11, 2024
34fbfba
Add basic OpenAI service utilities
ppinchuk Jan 11, 2024
7065c06
Update usage to allow `None` model responses
ppinchuk Jan 11, 2024
dca6b0b
Rename parameter
ppinchuk Jan 12, 2024
dd18f05
Add property for bool
ppinchuk Jan 12, 2024
c9febce
Use fixture for tests
ppinchuk Jan 12, 2024
6b54855
New base interface
ppinchuk Jan 12, 2024
3248fa9
Test for error
ppinchuk Jan 12, 2024
b541bb0
Add `sample_openai_response`
ppinchuk Jan 13, 2024
5d6b38f
Add `OpenAIService` + tests
ppinchuk Jan 13, 2024
8bcafca
Add integrated test for calling openai
ppinchuk Jan 13, 2024
128e431
Add test for provider to submit jobs while it can
ppinchuk Jan 18, 2024
202d653
Fix tests
ppinchuk Jan 18, 2024
b39eddd
Typo fix
ppinchuk Jan 18, 2024
d050406
Move `service_base_class` to conftest
ppinchuk Jan 18, 2024
a9ec87b
Add missing import
ppinchuk Jan 18, 2024
536f479
Add queued logging functionality
ppinchuk Jan 18, 2024
e5c2a34
use time.monotonic()
ppinchuk Jan 18, 2024
3b90640
FIx queue handler cleanup
ppinchuk Jan 19, 2024
d7642ea
Fix test with monotonic time
ppinchuk Jan 21, 2024
a26c1b5
Add beginning of web module
ppinchuk Jan 21, 2024
688085d
Add google search functionality
ppinchuk Jan 21, 2024
d13adca
Moved retry to to-level utilities
ppinchuk Jan 31, 2024
bd0ecc3
Start parsing utilities
ppinchuk Jan 31, 2024
ebb5e13
Add `remove_blank_pages` function
ppinchuk Jan 31, 2024
e4a7dc1
Add `format_html_tables` function
ppinchuk Jan 31, 2024
ae33fbc
Update reqs to match repo status
ppinchuk Jan 31, 2024
7f5dc64
Moved `clean_headers` to utilities
ppinchuk Jan 31, 2024
78c3a7e
Add more utility functions
ppinchuk Jan 31, 2024
6c43bc8
Added `remove_empty_lines_or_page_footers`
ppinchuk Jan 31, 2024
f47ad02
Rename function
ppinchuk Jan 31, 2024
fe3e693
Updated docs
ppinchuk Jan 31, 2024
84418b5
Add `html_to_text` parsing function
ppinchuk Feb 2, 2024
c9056e6
Add document class
ppinchuk Feb 2, 2024
a4ec891
source -> metadata
ppinchuk Feb 4, 2024
4233403
Add `read_pdf` function
ppinchuk Feb 4, 2024
0783b71
Add test for bad file
ppinchuk Feb 4, 2024
daf1f16
Add class properties to documents
ppinchuk Feb 5, 2024
976a425
Add read HTML with PW utility
ppinchuk Feb 5, 2024
c7f8a36
Add file loader class
ppinchuk Feb 5, 2024
86c2393
Bring reqs up-to-date
ppinchuk Feb 5, 2024
f751199
Cached fn now stored in doc metadata
ppinchuk Feb 5, 2024
89f5b01
add `compute_fn_from_url` function
ppinchuk Feb 5, 2024
f1585a9
Add `write_url_doc_to_file`
ppinchuk Feb 5, 2024
7811aec
Add max concurrent jobs per service
ppinchuk Feb 5, 2024
9b97f6f
Allow services to acquire and release resources
ppinchuk Feb 5, 2024
f72aa8f
Add `TempFileCache`
ppinchuk Feb 5, 2024
ca7f9a4
Added OCR methods
ppinchuk Feb 5, 2024
c95fd55
Add encoding to read call
ppinchuk Feb 5, 2024
7275c62
Add option for text splitter
ppinchuk Feb 12, 2024
0a66e94
Add county validation logic
ppinchuk Feb 14, 2024
0b791cf
Add `StructuredLLMCaller` and `ValidationWithMemory`
ppinchuk Feb 14, 2024
b4f8b81
Fix logging calls
ppinchuk Feb 14, 2024
ae8b07f
Add `possibly_mentions_wind` heuristic check
ppinchuk Feb 14, 2024
3069c65
Restructured
ppinchuk Feb 14, 2024
1443060
Add `merge_overlapping_texts` function
ppinchuk Feb 14, 2024
258b699
Add extraction logic
ppinchuk Feb 27, 2024
86a9b69
Add ordinance download function
ppinchuk Feb 28, 2024
cc9a41b
Typo fix
ppinchuk Feb 28, 2024
a482574
Refactor some download functions
ppinchuk Feb 28, 2024
00735ad
Add `replace_excessive_newlines` function
ppinchuk Feb 28, 2024
6842f9f
Moved `StructuredLLMCaller` class
ppinchuk Feb 28, 2024
eebbbe0
`usage_sub_label` now passed during function call
ppinchuk Feb 28, 2024
4ce1d66
Rename class/function
ppinchuk Feb 28, 2024
6d07c05
Add `OrdinanceExtractor`
ppinchuk Feb 28, 2024
bd740f7
Separate LLM call from tree parsing
ppinchuk Feb 28, 2024
103c7ec
Update logic for `_close_autofill_suggestions`
ppinchuk Mar 1, 2024
404c24f
Add kwargs to pass through to init Playwright Google Search
ppinchuk Mar 1, 2024
cd54db1
Use `kwargs` instead of LLMCaller instances
ppinchuk Mar 1, 2024
3a3a1f4
Add `AsyncDecisionTree` + graph setup functions
ppinchuk Mar 2, 2024
5b8248e
Minor refactor of graph setup functions
ppinchuk Mar 2, 2024
012ad9e
Add `StructuredOrdinanceParser`
ppinchuk Mar 2, 2024
81f46ee
Add `extract_ordinance_values` and refactor other functions to use **…
ppinchuk Mar 2, 2024
cb22559
Add ngram check and retry for hallucinations
ppinchuk Mar 4, 2024
95580da
Add county info data
ppinchuk Mar 8, 2024
b9321de
Add `load_counties_from_fp` function
ppinchuk Mar 8, 2024
834bc96
Renamed module to threaded services
ppinchuk Mar 8, 2024
ee23f32
`add_to` method now includes totals
ppinchuk Mar 8, 2024
c448310
Extend functionality and add tests for `County` class
ppinchuk Mar 8, 2024
35c5744
Fix import
ppinchuk Mar 8, 2024
8db3235
Fix logger message
ppinchuk Mar 21, 2024
ae6c142
Add threaded file writing and moving
ppinchuk Mar 21, 2024
8f80778
`County` can now hold a FIPS code
ppinchuk Mar 21, 2024
33aa5ed
Totals now won't crash if dict has entries that are not dicts
ppinchuk Mar 21, 2024
6855dac
Add CPU-bound processors to run in process pool
ppinchuk Mar 21, 2024
e6a14ae
Formatting update
ppinchuk Mar 21, 2024
cb284d4
Add option for `file_loader_kwargs`
ppinchuk Mar 21, 2024
b6d0ad8
Typo fix
ppinchuk Mar 21, 2024
a2a7d81
Add full processing logic
ppinchuk Mar 21, 2024
8ae1278
Add cli
ppinchuk Mar 21, 2024
a2c45da
Add ords cli
ppinchuk Mar 22, 2024
68596a8
Add missing nltk dependency for ngram check
ppinchuk Mar 22, 2024
812f51e
Add elm ords install instructions
ppinchuk Mar 22, 2024
86f1cf8
Bug fix
ppinchuk Mar 22, 2024
2de5844
Add guarded import
ppinchuk Mar 22, 2024
e3f6e7b
Bump min Python version to 3.9
ppinchuk Mar 22, 2024
02e76f7
Add ords tests with their own installation
ppinchuk Mar 22, 2024
4bb2b79
Prompt and response messages are now logged together
ppinchuk Mar 22, 2024
a9eaa45
Base graph no longer expected to return JSON
ppinchuk Mar 22, 2024
b7c4f5e
Total time now logged to terminal
ppinchuk Mar 22, 2024
726730f
Move prompt to global var
ppinchuk Mar 22, 2024
6b9e119
fix tree execution bug
ppinchuk Mar 22, 2024
5fbd0d4
Format exception message
ppinchuk Mar 22, 2024
416f810
Fix participating vs non-participating bug that was in validation
ppinchuk Mar 22, 2024
c26b8ff
Merge remote-tracking branch 'origin/main' into pp/ords
ppinchuk Mar 22, 2024
e047376
Add back missing argument
ppinchuk Mar 25, 2024
3505600
Linter fix
ppinchuk Mar 25, 2024
ce99e93
Linter fix
ppinchuk Mar 25, 2024
aaabdd3
Linter fix
ppinchuk Mar 25, 2024
070d706
Sphinx doc updates
ppinchuk Mar 26, 2024
121d621
Bump version
ppinchuk Mar 26, 2024
b0ed6e8
Add verbose option to `read_pdf` functions
ppinchuk Mar 26, 2024
b351919
Add `NoLocationFilter`
ppinchuk Mar 26, 2024
fbf6bce
pytesseract command now propagated to sub processes
ppinchuk Mar 26, 2024
06cf166
Docstring cleanup and better logging
ppinchuk Mar 26, 2024
3a08d7f
Remove excessive logging
ppinchuk Mar 26, 2024
b482493
Add `empty` prop to check if doc has pages
ppinchuk Mar 26, 2024
53b92b2
Failures during load now return empty doc
ppinchuk Mar 26, 2024
84a97f3
Download function now discards empty docs
ppinchuk Mar 26, 2024
f932a7f
Loading HTML with PW can now be limited using semaphores
ppinchuk Mar 26, 2024
68bb741
Updated creation of null semaphore
ppinchuk Mar 26, 2024
e64b482
Download can now limit PW browser instances via `browser_semaphore` i…
ppinchuk Mar 26, 2024
09e55f9
Change logging verbosity for PDF loaders
ppinchuk Mar 26, 2024
a724ea9
Handle `IndexError` when parsing headers
ppinchuk Mar 26, 2024
9d2aed2
Add logging statement
ppinchuk Mar 26, 2024
f2ed051
Add minor logging
ppinchuk Mar 26, 2024
ff030d9
Add control for max number of browser instances
ppinchuk Mar 26, 2024
13dfbfb
Double timeout if detected in kwargs
ppinchuk Mar 27, 2024
1ddc0d0
Add `OrdDBFileWriter`
ppinchuk Mar 27, 2024
58ef1e0
No processing if expected text is empty
ppinchuk Mar 27, 2024
56479b2
More robust ngram validation
ppinchuk Mar 27, 2024
bf05e84
Protect against divide by zero
ppinchuk Mar 27, 2024
83691c0
Heuristic check now uses ngrams
ppinchuk Mar 27, 2024
27fb4b2
Write intermediate county dbs to file
ppinchuk Mar 27, 2024
ba31b8f
Add main logging file for uncaught exceptions during processing
ppinchuk Mar 27, 2024
11362ed
Fix typo bug
ppinchuk Mar 27, 2024
f3fe5b7
Add better check for ordinance values
ppinchuk Mar 27, 2024
5c61553
Revised prompts
ppinchuk Mar 27, 2024
8d4390b
Add to ords example
ppinchuk Mar 27, 2024
5be301b
Ords tests use conda
ppinchuk Mar 27, 2024
6ae9bb8
Fix test for unix
ppinchuk Mar 27, 2024
6842f6b
Update README.rst based on PR review
ppinchuk Apr 29, 2024
dc00eca
Update README.md based on PR review
ppinchuk Apr 29, 2024
cbfdc77
Update README.md based on PR review
ppinchuk Apr 29, 2024
ec22094
Fix mac tests
ppinchuk Apr 29, 2024
99deafe
Log note about `AttributeError`
ppinchuk Apr 29, 2024
9554168
Log message about number of ords found
ppinchuk Apr 29, 2024
4194cf6
Ordinance documents now stored in separate sub-directory
ppinchuk Apr 29, 2024
8c32394
Add extension guidance
ppinchuk Apr 29, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 29 additions & 0 deletions .coveragerc
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
[run]
branch = True

[report]
# Regexes for lines to exclude from consideration
exclude_lines =
# Have to re-enable the standard pragma
pragma: no cover

# Don't complain about missing debug-only code:
def __repr__
if self\.debug

# Don't complain if tests don't hit defensive assertion code:
raise AssertionError
raise NotImplementedError

# Don't complain if non-runnable code isn't run:
if __name__ == .__main__.:

# Don't complain about abstract methods, they aren't run:
@(abc\.)?abstractmethod


omit =
# omit test files
tests/*
# omit setup file
setup.py
18 changes: 4 additions & 14 deletions .github/workflows/pytest.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,12 +9,12 @@ jobs:
fail-fast: false
matrix:
os: [ubuntu-latest, macos-latest, windows-latest]
python-version: ['3.10']
python-version: [3.11]
include:
- os: ubuntu-latest
python-version: 3.9
python-version: '3.10'
- os: ubuntu-latest
python-version: 3.8
python-version: 3.9

steps:
- uses: actions/checkout@v2
Expand All @@ -34,14 +34,4 @@ jobs:
python -m pip install .
- name: Run pytest and Generate coverage report
run: |
python -m pytest -v --disable-warnings --cov=./ --cov-report=xml:coverage.xml
- name: Upload coverage to Codecov
uses: codecov/codecov-action@v1
with:
token: ${{ secrets.CODECOV_TOKEN }}
file: ./coverage.xml
flags: unittests
env_vars: OS,PYTHON
name: codecov-umbrella
fail_ci_if_error: false
verbose: true
python -m pytest --ignore=tests/ords --ignore=tests/utilities --ignore=tests/web -v --disable-warnings
49 changes: 49 additions & 0 deletions .github/workflows/pytest_ords.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
name: pytests-ords

on: pull_request

jobs:
build:
runs-on: ${{ matrix.os }}
strategy:
fail-fast: false
matrix:
os: [ubuntu-latest, macos-latest, windows-latest]
python-version: [3.11]

steps:
- uses: actions/checkout@v2
with:
ref: ${{ github.event.pull_request.head.ref }}
fetch-depth: 1
- name: Set up Python ${{ matrix.python-version }}
uses: conda-incubator/setup-miniconda@v2
with:
auto-update-conda: true
python-version: ${{ matrix.python-version }}
miniconda-version: "latest"
- name: Install dependencies'
shell: bash -l {0}
run: |
conda install -c conda-forge poppler
python -m pip install --upgrade pip
python -m pip install pdftotext
python -m pip install pytest
python -m pip install pytest-mock
python -m pip install pytest-cov
python -m pip install .
playwright install
- name: Run pytest and Generate coverage report
shell: bash -l {0}
run: |
python -m pytest -v --disable-warnings --cov=./ --cov-report=xml:coverage.xml
- name: Upload coverage to Codecov
uses: codecov/codecov-action@v1
with:
token: ${{ secrets.CODECOV_TOKEN }}
file: ./coverage.xml
flags: unittests
env_vars: OS,PYTHON
name: codecov-umbrella
fail_ci_if_error: false
verbose: true
3 changes: 3 additions & 0 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,9 @@ Installing ELM

.. inclusion-install

NOTE: If you are installing ELM to run ordinance scraping and extraction,
see the `ordinance-specific installation instructions <https://github.com/NREL/elm/blob/main/elm/ords/README.md>`_.

Option #1 (basic usage):

#. ``pip install NREL-elm``
Expand Down
8 changes: 8 additions & 0 deletions docs/source/_cli/cli.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
.. _cli-docs:

Command Line Interfaces (CLIs)
==============================

.. toctree::

elm
3 changes: 3 additions & 0 deletions docs/source/_cli/elm.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
.. click:: elm.cli:main
:prog: elm
:nested: full
2 changes: 2 additions & 0 deletions docs/source/examples.ordinance_gpt.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
.. include:: ../../examples/ordinance_gpt/README.rst
:start-line: 0
1 change: 1 addition & 0 deletions docs/source/examples.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,4 @@ Examples
.. toctree::

examples.energy_wizard.rst
examples.ordinance_gpt.rst
1 change: 1 addition & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,5 +5,6 @@
Installation <installation.rst>
Examples <examples.rst>
API reference <_autosummary/elm>
CLI reference <_cli/cli>

.. include:: ../../README.rst
47 changes: 47 additions & 0 deletions elm/cli.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
# -*- coding: utf-8 -*-
# fmt: off
"""ELM Ordinances CLI."""
import sys
import json
import click
import asyncio
import logging

from elm.version import __version__
from elm.ords.process import process_counties_with_openai


@click.group()
@click.version_option(version=__version__)
@click.pass_context
def main(ctx):
"""ELM ordinances command line interface."""
ctx.ensure_object(dict)


@main.command()
@click.option("--config", "-c", required=True, type=click.Path(exists=True),
help="Path to ordinance configuration JSON file. This file "
"should contain any/all the arguments to pass to "
":func:`elm.ords.process.process_counties_with_openai`.")
@click.option("-v", "--verbose", is_flag=True,
help="Flag to show logging on the terminal. Default is not "
"to show any logs on the terminal.")
def ords(config, verbose):
"""Download and extract ordinances for a list of counties."""
with open(config, "r") as fh:
config = json.load(fh)

if verbose:
logger = logging.getLogger("elm")
logger.addHandler(logging.StreamHandler(stream=sys.stdout))
logger.setLevel(config.get("log_level", "INFO"))

# asyncio.run(...) doesn't throw exceptions correctly for some reason...
loop = asyncio.get_event_loop()
loop.run_until_complete(process_counties_with_openai(**config))


if __name__ == "__main__":
# pylint: disable=no-value-for-parameter
main(obj={})
10 changes: 10 additions & 0 deletions elm/exceptions.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# -*- coding: utf-8 -*-
"""Custom Exceptions and Errors for ELM. """


class ELMError(Exception):
"""Generic ELM Error."""


class ELMRuntimeError(ELMError, RuntimeError):
"""ELM RuntimeError."""
27 changes: 27 additions & 0 deletions elm/ords/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# Welcome to Energy Language Model - OrdinanceGPT

The ordinance web scraping and data extraction portion of this codebase required a few extra dependencies that do not come out-of-the-box with the base ELM software.
To set up ELM for ordinances, first create a conda environment. Then, _before installing ELM_, run the poppler installation:

$ conda install -c conda-forge poppler

Then, install `pdftotext`:

$ pip install pdftotext

(OPTIONAL) If you want to have access to Optical Character Recognition (OCR) for PDF parsing, you should also install pytesseract during this step:

$ pip install pytesseract pdf2image

At this point, you can install ELM per the [front-page README](https://github.com/NREL/elm/blob/main/README.rst) instructions, e.g.:

$ pip install -e .

After ELM installs successfully, you must instantiate the playwright module, which is used for web scraping.
To do so, simply run:

$ playwright install

Now you are ready to run ordinance retrieval and extraction. See the [example](https://github.com/NREL/elm/blob/main/examples/ordinance_gpt/README.rst) to get started. If you get additional import errors, just install additional packages as necessary, e.g.:

$ pip install beautifulsoup4 html5lib
1 change: 1 addition & 0 deletions elm/ords/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
"""ELM ordinance document download and structured data extraction. """
Loading
Loading