Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sp/researcher hub #15

Merged
merged 32 commits into from
May 15, 2024
Merged

Sp/researcher hub #15

merged 32 commits into from
May 15, 2024

Conversation

spodgorny9
Copy link
Collaborator

Add rhub.py and osti.py to web directory

elm/web/rhub.py Outdated

logger = logging.getLogger(__name__)
init_logger(__name__, log_level='DEBUG')
init_logger('elm', log_level='INFO')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should never initialize loggers in module files. Our logging theory is this: every module file includes a logger object (line 13 here) and does a bunch of logging. Anyone building software on our repo or running scripts with our code can then selectively choose which messages to turn on. We should not turn on logging by default for others, that is annoying.

"""Class to handle publications portion of the NREL researcher hub."""
BASE_URL = "https://research-hub.nrel.gov/en/publications/?page=0"

def __init__(self, url, n_pages=1, txt_dir='./ew_txt'):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please include a docstring for dunder methods. It's not required by pylint because dunder methods are "hidden" but they're often the most important methods. For init, you don't need the one line description, just:

"""
Parameters
----------
url : str
    description
...
"""

elm/web/rhub.py Outdated

Parameters
----------
soup_inst : obj
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What kind of object is this? Run type(soup_inst) if you don't know

elm/web/rhub.py Outdated

Returns
-------
author names (str): all authors that contributed to publication
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returns doc strings need to follow the same format, so this should be:

        Returns
        -------
        authors : list
            A list of all authors (strings) that contributed to publication

elm/web/rhub.py Outdated
"""
Description
----------
Downloads a pdf for a given link
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need the Description header, just have the one-line description on the first line of the docstring immediately following the """

The Energy Wizard - Research Hub
********************************

This example demonstrates how to scrape publication abstracts and researcher profiles,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a note that this is for the NREL researcher hub only (remember our repo is open source!)

elm/web/rhub.py Outdated
for p in range(0, n_pages):
url = url + f"?page={p}"
with urlopen(url) as page:
html = page.read().decode("utf-8")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay here's the tricky test request. Can you add a basic test that mocks the URL page read response, gives back some basic html data, and then verifies that the ResearchOutputs class and ResearcherProfiles classes can retrieve some basic info? I can help you do this but here are the basics:

  1. make this page.read() call into a separate function that can be mocked
  2. Run it in real life and copy the output text then save to a .txt file
  3. You can then make a dummy function in your test file that returns this cached data and then do a "mock patch" which will swap the real function for your dummy function with cached data. See example here: https://github.com/NREL/elm/blob/main/tests/test_embed.py#L36
  4. Given the cached good html response, you can then run a test to make sure the classes you wrote can parse things out appropriately.
  5. Make sure you don't put any sensitive information into the cached html test data (phone numbers, emails, other stuff).

elm/web/rhub.py Outdated
Returns
-------
doi link (str)
pdf link (str): link to pdf if it exists
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was a comment that the returns doc string should have a different format. That comment applied to all Returns docstrings. Please update everywhere to this format:

doi : str
    Description of doi 

elm/web/rhub.py Outdated

def scrape_abstract(self, out_dir, fn, soup_inst):
"""
Description
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was a comment that you dont need a description header with ------- break. That comment applied everywhere. Please update all Description header things to just have the description sentence on the first line of the docstring.

elm/web/rhub.py Outdated
fn: str
File name for saving the file.
soup_inst : bs4.BeautifulSoup
Active beautiful soup instance used for scraping.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The correct term is "instantiated" not "active". Recommended fixing everywhere.

assert 'title' in meta.columns
assert 'url' in meta.columns
assert meta['title'].isna().sum() == 0
assert meta['url'].isna().sum() == 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good!

@spodgorny9 spodgorny9 merged commit 05320bc into main May 15, 2024
12 checks passed
@spodgorny9 spodgorny9 deleted the sp/researcher_hub branch May 15, 2024 17:32
github-actions bot pushed a commit that referenced this pull request May 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants