Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create WAF traversal function #4558

Closed
1 task
rshewitt opened this issue Dec 15, 2023 · 4 comments
Closed
1 task

Create WAF traversal function #4558

rshewitt opened this issue Dec 15, 2023 · 4 comments
Assignees
Labels
H2.0/Harvest-Runner Harvest Source Processing for Harvesting 2.0

Comments

@rshewitt
Copy link
Contributor

User Story

In order to collect files from a WAF, datagov wants create a function in the harvesting logic repo which can traverse a WAF and pull all the files.

Acceptance Criteria

  • GIVEN a WAF
    WHEN the function is invoked
    THEN all the files in the tree are collected and downloaded.

Background

[Any helpful contextual notes or links to artifacts/evidence, if needed]

Security Considerations (required)

[Any security concerns that might be implicated in the change. "None" is OK, just be explicit here!]

Sketch

def traverse_waf( url, files=[], to_find=".xml' ):
  # resp = requests.get( url ) 
  # bs4 parse response 
  # get me all the "to_find" files and store in "files"
  # if you find a folder then traverse_waf( folder_url, files ) 

  return files

def download_files( files ):
  for file in files:
    #download the file
@rshewitt rshewitt added the H2.0/Harvest-General General Harvesting 2.0 Issues label Dec 15, 2023
@rshewitt rshewitt self-assigned this Dec 15, 2023
@rshewitt rshewitt moved this to 🏗 In Progress [8] in data.gov team board Dec 15, 2023
@rshewitt
Copy link
Contributor Author

draft pr

@jbrown-xentity
Copy link
Contributor

Not that we want to do it the way it's currently done necessarily, but the current code lives here: https://github.com/ckan/ckanext-spatial/blob/master/ckanext/spatial/harvesters/waf.py#L52

@jbrown-xentity
Copy link
Contributor

Watch out for looping; many WAFs have a folder structure that allows you to go "up" a level; if you don't exclude those you can end up in an indefinite search. It doesn't matter if we do a Breadth First Search or a Depth First Search, but we do need to be able to look at sub-folders and list/download them as well.

@rshewitt
Copy link
Contributor Author

rshewitt commented Dec 16, 2023

i'm checking for the parent here. the pr is a work in progress. i included a filter list to the function as a way to exclude directories we don't want to open which could be source specific.

@rshewitt rshewitt moved this from 🏗 In Progress [8] to 👀 Needs Review [2] in data.gov team board Dec 18, 2023
@rshewitt rshewitt moved this from 👀 Needs Review [2] to ✔ Done in data.gov team board Dec 19, 2023
@btylerburton btylerburton moved this from ✔ Done to 🗄 Closed in data.gov team board Dec 21, 2023
@btylerburton btylerburton added H2.0/Harvest-Runner Harvest Source Processing for Harvesting 2.0 and removed H2.0/Harvest-General General Harvesting 2.0 Issues H2.0/Extract labels Jan 5, 2024
@github-project-automation github-project-automation bot moved this from 🗄 Closed to ✔ Done in data.gov team board Feb 16, 2024
@btylerburton btylerburton moved this from ✔ Done to 🗄 Closed in data.gov team board Feb 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
H2.0/Harvest-Runner Harvest Source Processing for Harvesting 2.0
Projects
Archived in project
Development

No branches or pull requests

3 participants