-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sp/researcher hub #15
Conversation
elm/web/rhub.py
Outdated
|
||
logger = logging.getLogger(__name__) | ||
init_logger(__name__, log_level='DEBUG') | ||
init_logger('elm', log_level='INFO') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should never initialize loggers in module files. Our logging theory is this: every module file includes a logger object (line 13 here) and does a bunch of logging. Anyone building software on our repo or running scripts with our code can then selectively choose which messages to turn on. We should not turn on logging by default for others, that is annoying.
"""Class to handle publications portion of the NREL researcher hub.""" | ||
BASE_URL = "https://research-hub.nrel.gov/en/publications/?page=0" | ||
|
||
def __init__(self, url, n_pages=1, txt_dir='./ew_txt'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please include a docstring for dunder methods. It's not required by pylint because dunder methods are "hidden" but they're often the most important methods. For init, you don't need the one line description, just:
"""
Parameters
----------
url : str
description
...
"""
elm/web/rhub.py
Outdated
|
||
Parameters | ||
---------- | ||
soup_inst : obj |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What kind of object is this? Run type(soup_inst)
if you don't know
elm/web/rhub.py
Outdated
|
||
Returns | ||
------- | ||
author names (str): all authors that contributed to publication |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Returns doc strings need to follow the same format, so this should be:
Returns
-------
authors : list
A list of all authors (strings) that contributed to publication
elm/web/rhub.py
Outdated
""" | ||
Description | ||
---------- | ||
Downloads a pdf for a given link |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need the Description header, just have the one-line description on the first line of the docstring immediately following the """
The Energy Wizard - Research Hub | ||
******************************** | ||
|
||
This example demonstrates how to scrape publication abstracts and researcher profiles, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a note that this is for the NREL researcher hub only (remember our repo is open source!)
elm/web/rhub.py
Outdated
for p in range(0, n_pages): | ||
url = url + f"?page={p}" | ||
with urlopen(url) as page: | ||
html = page.read().decode("utf-8") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay here's the tricky test request. Can you add a basic test that mocks the URL page read response, gives back some basic html data, and then verifies that the ResearchOutputs class and ResearcherProfiles classes can retrieve some basic info? I can help you do this but here are the basics:
- make this page.read() call into a separate function that can be mocked
- Run it in real life and copy the output text then save to a .txt file
- You can then make a dummy function in your test file that returns this cached data and then do a "mock patch" which will swap the real function for your dummy function with cached data. See example here: https://github.com/NREL/elm/blob/main/tests/test_embed.py#L36
- Given the cached good html response, you can then run a test to make sure the classes you wrote can parse things out appropriately.
- Make sure you don't put any sensitive information into the cached html test data (phone numbers, emails, other stuff).
elm/web/rhub.py
Outdated
Returns | ||
------- | ||
doi link (str) | ||
pdf link (str): link to pdf if it exists |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a comment that the returns doc string should have a different format. That comment applied to all Returns docstrings. Please update everywhere to this format:
doi : str
Description of doi
elm/web/rhub.py
Outdated
|
||
def scrape_abstract(self, out_dir, fn, soup_inst): | ||
""" | ||
Description |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a comment that you dont need a description header with ------- break. That comment applied everywhere. Please update all Description header things to just have the description sentence on the first line of the docstring.
elm/web/rhub.py
Outdated
fn: str | ||
File name for saving the file. | ||
soup_inst : bs4.BeautifulSoup | ||
Active beautiful soup instance used for scraping. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The correct term is "instantiated" not "active". Recommended fixing everywhere.
assert 'title' in meta.columns | ||
assert 'url' in meta.columns | ||
assert meta['title'].isna().sum() == 0 | ||
assert meta['url'].isna().sum() == 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good!
Add rhub.py and osti.py to web directory