Skip to content

Commit

Permalink
Version 0.42 - add search functionality, tests and documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
mcarans committed Sep 15, 2016
1 parent 296042d commit 864cb2b
Show file tree
Hide file tree
Showing 8 changed files with 520 additions and 279 deletions.
535 changes: 284 additions & 251 deletions .idea/workspace.xml

Large diffs are not rendered by default.

25 changes: 21 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ For more about the purpose and design philosophy, please visit [HDX Python Libra
- [Operations on HDX Objects](#operations-on-hdx-objects)
- [Dataset Specific Operations](#dataset-specific-operations)
- [Working Example](#working-example)
- [ACLED Example](#acled-example)

## Usage
The library has detailed API documentation:
Expand Down Expand Up @@ -44,7 +45,7 @@ The first task is to create an API key file. By default this is assumed to be ca

To include the HDX Python library in your project, pip install the line below or add the following to your `requirements.txt` file:

git+git://github.com/ocha-dap/hdx-python-api.git@v0.41#egg=hdx-python-api
git+git://github.com/ocha-dap/hdx-python-api.git@v0.42#egg=hdx-python-api

If you get errors, it is probably the dependencies of the cryptography package that are missing eg. for Ubuntu: python-dev, libffi-dev and libssl-dev. See [cryptography dependencies](https://cryptography.io/en/latest/installation/#building-cryptography-on-linux)

Expand All @@ -67,7 +68,7 @@ Let's start with a simple example that also ensures that the library is working
source test/bin/activate
4. Install the HDX Python library:

pip install git+git://github.com/ocha-dap/hdx-python-api.git@v0.41#egg=hdx-python-api
pip install git+git://github.com/ocha-dap/hdx-python-api.git@v0.42#egg=hdx-python-api
5. If you get errors, it is probably the [dependencies of the cryptography package](#installing-the-library)
6. Launch python:

Expand All @@ -92,7 +93,11 @@ Let's start with a simple example that also ensures that the library is working

dataset['dataset_date'] = '06/25/2016'
dataset.update_in_hdx()
12. Exit and remove virtualenv:
12. You can search for datasets on HDX:

datasets = Dataset.search_in_hdx(configuration, 'ACLED')
print(datasets)
13. Exit and remove virtualenv:

exit()
deactivate
Expand Down Expand Up @@ -212,6 +217,12 @@ You can read an existing HDX object with the static `read_from_hdx` method whi

dataset = Dataset.read_from_hdx(configuration, 'DATASET_ID_OR_NAME')

You can search for datasets and resources in HDX using the `search_in_hdx` method which takes a configuration and a query parameter and returns the a list of objects of the appropriate HDX object type eg. `list[Dataset]` eg.

datasets = Dataset.search_in_hdx(configuration, 'QUERY')

The query parameter takes a different format depending upon whether it is for a [dataset](http://lucene.apache.org/core/3_6_0/queryparsersyntax.html) or a [resource](http://docs.ckan.org/en/ckan-2.3.4/api/index.html#ckan.logic.action.get.resource_search).

You can create an HDX Object, such as a dataset, resource or gallery item by calling the constructor with a configuration, which is required, and an optional dictionary containing metadata. For example:

from hdx.data.dataset import Dataset
Expand Down Expand Up @@ -354,6 +365,12 @@ Create a file `my_code.py` and copy into it the code below:

You can then fill out the function `generate_dataset` as required.

## ACLED Example

A complete example can be found here: [https://github.com/mcarans/hdxscraper-acled-africa](https://github.com/mcarans/hdxscraper-acled-africa)

In particular, take a look at the files `run.py`, `acled_africa.py` and the `config` folder.
In particular, take a look at the files `run.py`, `acled_africa.py` and the `config` folder.

The ACLED scraper creates a dataset in HDX for [ACLED realtime data](https://data.humdata.org/dataset/acled-conflict-data-for-africa-realtime-2016) if it doesn't already exist, populating all the required metadata. It then creates resources that point to urls of [Excel and csv files for Realtime 2016 All Africa data](http://www.acleddata.com/data/realtime-data-2016/) (or updates the links and metadata if the resources already exist). Finally it creates a gallery item that points to these [dynamic maps and graphs](http://www.acleddata.com/visuals/maps/dynamic-maps/).

The first iteration of the ACLED scraper was written without the HDX Python library and it became clear looking at this and previous work by others that there are operations that are frequently required and which add unnecessary complexity to the task of coding against HDX. Simplifying the interface to HDX drove the development of the Python library and the second iteration of the scraper was built using it. With the interface using HDX terminology and mapping directly on to datasets, resources and gallery items, the ACLED scraper was faster to develop and is much easier to understand for someone inexperienced in how it works and what it is doing. The challenge with ACLED is that sometimes the urls that the resources point to have not been updated and hence do not work. In this situation, the extensive logging and transparent communication of errors is invaluable and enables action to be taken to resolve the issue as quickly as possible. The static metadata for ACLED is held in human readable files so if it needs to be modified, it is straightforward. This is another feature of the HDX Python library that makes putting data programmatically into HDX a breeze.
54 changes: 43 additions & 11 deletions hdx/data/dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,6 @@
"""
import logging
from os.path import join

from typing import Any, List, Optional

from hdx.configuration import Configuration
Expand Down Expand Up @@ -47,7 +46,8 @@ def actions() -> dict:
'show': 'package_show',
'update': 'package_update',
'create': 'package_create',
'delete': 'package_delete'
'delete': 'package_delete',
'search': 'package_search'
}

def __setitem__(self, key: Any, value: Any) -> None:
Expand Down Expand Up @@ -255,6 +255,20 @@ def read_from_hdx(configuration: Configuration, identifier: str) -> Optional['Da
return dataset
return None

def _dataset_create_resources_gallery(self) -> None:
"""Creates resource and gallery item objects in dataset
"""

if 'resources' in self.data:
self.old_data['resources'] = self._copy_hdxobjects(self.resources, Resource)
self.separate_resources()
if self.include_gallery:
success, result = self._read_from_hdx('gallery', self.data['id'], 'id', GalleryItem.actions()['list'])
if success:
self.data['gallery'] = result
self.old_data['gallery'] = self._copy_hdxobjects(self.gallery, GalleryItem)
self.separate_gallery()

def _dataset_load_from_hdx(self, id_or_name: str) -> bool:
"""Loads the dataset given by either id or name from HDX
Expand All @@ -267,15 +281,7 @@ def _dataset_load_from_hdx(self, id_or_name: str) -> bool:

if not self._load_from_hdx('dataset', id_or_name):
return False
if 'resources' in self.data:
self.old_data['resources'] = self._copy_hdxobjects(self.resources, Resource)
self.separate_resources()
if self.include_gallery:
success, result = self._read_from_hdx('gallery', self.data['id'], GalleryItem.actions()['list'])
if success:
self.data['gallery'] = result
self.old_data['gallery'] = self._copy_hdxobjects(self.gallery, GalleryItem)
self.separate_gallery()
self._dataset_create_resources_gallery()
return True

def check_required_fields(self, ignore_fields: List[str] = list()) -> None:
Expand Down Expand Up @@ -422,3 +428,29 @@ def delete_from_hdx(self) -> None:
None
"""
self._delete_from_hdx('dataset', 'id')

@staticmethod
def search_in_hdx(configuration: Configuration, query: str) -> List['Dataset']:
"""Searches for datasets in HDX
Args:
configuration (Configuration): HDX Configuration
query (str): Query
Returns:
List[Dataset]: List of datasets resulting from query
"""

datasets = []
dataset = Dataset(configuration)
success, result = dataset._read_from_hdx('dataset', query, 'q')
if result:
count = result.get('count', None)
if count:
for datasetdict in result['results']:
dataset = Dataset(configuration)
dataset.old_data = dict()
dataset.data = datasetdict
dataset._dataset_create_resources_gallery()
datasets.append(dataset)
return datasets
25 changes: 15 additions & 10 deletions hdx/data/hdxobject.py
Original file line number Diff line number Diff line change
Expand Up @@ -82,29 +82,34 @@ def update_json(self, path: str):
"""
self.data = load_json_into_existing_dict(self.data, path)

def _read_from_hdx(self, object_type: str, id_field: str, action: Optional[str] = None) -> Tuple[bool, dict]:
"""Checks if the hdx object exists in HDX.
def _read_from_hdx(self, object_type: str, value: str, fieldname: Optional[str] = 'id',
action: Optional[str] = None) -> Tuple[bool, dict]:
"""Makes a read call to HDX passing in given parameter.
Args:
object_type (str): Description of HDX object type (for messages)
id_field (str): HDX object identifier
action (Optional[str]): Replacement CKAN url to use. Defaults to None.
value (str): Value of HDX field
fieldname (Optional[str]): HDX field name. Defaults to id.
action (Optional[str]): Replacement CKAN action url to use. Defaults to None.
Returns:
(bool, dict): (True/False, HDX object metadata/Error)
"""
if not id_field:
raise HDXError("Empty %s identifier!" % object_type)
if not value:
raise HDXError("Empty %s value!" % object_type)
if action is None:
action = self.actions()['show']
if fieldname == 'query' or fieldname == 'q':
action = self.actions()['search']
else:
action = self.actions()['show']
try:
result = self.hdxpostsite.call_action(action, {'id': id_field},
result = self.hdxpostsite.call_action(action, {fieldname: value},
requests_kwargs={'auth': self.configuration._get_credentials()})
return True, result
except NotFound as e:
return False, "%s not found!" % id_field
return False, "%s=%s: not found!" % (fieldname, value)
except Exception as e:
raise HDXError('HTTP Get failed when trying to read %s' % id_field) from e
raise HDXError('HTTP Get failed when trying to read: %s=%s' % (fieldname, value)) from e

def _load_from_hdx(self, object_type: str, id_field: str) -> bool:
"""Helper method to load the HDX object given by identifier from HDX
Expand Down
27 changes: 25 additions & 2 deletions hdx/data/resource.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,6 @@
"""Resource class containing all logic for creating, checking, and updating resources."""
import logging
from os.path import join

from typing import Optional, List

from hdx.configuration import Configuration
Expand Down Expand Up @@ -35,7 +34,8 @@ def actions() -> dict:
'show': 'resource_show',
'update': 'resource_update',
'create': 'resource_create',
'delete': 'resource_delete'
'delete': 'resource_delete',
'search': 'resource_search'
}

def update_yaml(self, path: str = join('config', 'hdx_resource_static.yml')) -> None:
Expand Down Expand Up @@ -113,6 +113,29 @@ def delete_from_hdx(self) -> None:
"""
self._delete_from_hdx('resource', 'id')

@staticmethod
def search_in_hdx(configuration: Configuration, query: str) -> List['Resource']:
"""Searches for resources in HDX
Args:
configuration (Configuration): HDX Configuration
query (str): Query
Returns:
List[Resource]: List of resources resulting from query
"""

resources = []
resource = Resource(configuration)
success, result = resource._read_from_hdx('resource', query, 'query')
if result:
count = result.get('count', None)
if count:
for resourcedict in result['results']:
resource = Resource(configuration, resourcedict)
resources.append(resource)
return resources

def create_datastore(self) -> None:
"""TODO"""
pass
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@

setup(
name='hdx-python-api',
version='0.41',
version='0.42',
packages=find_packages(exclude=['ez_setup', 'tests', 'tests.*']),
url='http://data.humdata.org/',
license='PSF',
Expand Down
41 changes: 41 additions & 0 deletions tests/hdx/data/test_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@
from hdx.data.dataset import Dataset
from hdx.data.hdxobject import HDXError
from hdx.utilities.dictionary import merge_two_dictionaries
from hdx.utilities.loader import load_yaml


class MockResponse:
Expand Down Expand Up @@ -83,6 +84,7 @@ def json(self):
'solr_additions': '{"countries": ["Algeria", "Zimbabwe"]}',
'dataset_date': '06/04/2016'}

searchdict = load_yaml(join('fixtures', 'search_results.yml'))

def mockshow(url, datadict):
if 'show' not in url and 'related_list' not in url:
Expand Down Expand Up @@ -113,6 +115,28 @@ def mockshow(url, datadict):
'{"success": false, "error": {"message": "Not found", "__type": "Not Found Error"}, "help": "http://test-data.humdata.org/api/3/action/help_show?name=dataset_show"}')


def mocksearch(url, datadict):
if 'search' not in url and 'related_list' not in url:
return MockResponse(404,
'{"success": false, "error": {"message": "TEST ERROR: Not search", "__type": "TEST ERROR: Not Search Error"}, "help": "http://test-data.humdata.org/api/3/action/help_show?name=package_search"}')
if 'related_list' in url:
result = json.dumps(TestDataset.gallery_data)
return MockResponse(200,
'{"success": true, "result": %s, "help": "http://test-data.humdata.org/api/3/action/help_show?name=related_list"}' % result)
result = json.dumps(searchdict)
if datadict['q'] == 'ACLED':
return MockResponse(200,
'{"success": true, "result": %s, "help": "http://test-data.humdata.org/api/3/action/help_show?name=package_search"}' % result)
if datadict['q'] == '"':
return MockResponse(404,
'{"success": false, "error": {"message": "Validation Error", "__type": "Validation Error"}, "help": "http://test-data.humdata.org/api/3/action/help_show?name=package_search"}')
if datadict['q'] == 'ajyhgr':
return MockResponse(200,
'{"success": true, "result": {"count": 0, "results": []}, "help": "http://test-data.humdata.org/api/3/action/help_show?name=package_search"}')
return MockResponse(404,
'{"success": false, "error": {"message": "Not found", "__type": "Not Found Error"}, "help": "http://test-data.humdata.org/api/3/action/help_show?name=package_search"}')


class TestDataset():
dataset_data = {
'name': 'MyDataset1',
Expand Down Expand Up @@ -290,6 +314,15 @@ def mockreturn(url, data, headers, files, allow_redirects, auth):

monkeypatch.setattr(requests, 'post', mockreturn)

@pytest.fixture(scope='function')
def search(self, monkeypatch):
def mockreturn(url, data, headers, files, allow_redirects, auth):
datadict = json.loads(data.decode('utf-8'))
return mocksearch(url, datadict)

monkeypatch.setattr(requests, 'post', mockreturn)


@pytest.fixture(scope='class')
def configuration(self):
hdx_key_file = join('fixtures', '.hdxkey')
Expand Down Expand Up @@ -460,3 +493,11 @@ def test_add_update_delete_gallery(self, configuration, post_delete):
dataset.delete_galleryitem('NOTEXIST')
dataset.delete_galleryitem('d59a01d8-e52b-4337-bcda-fceb1d059bef')
assert len(dataset.gallery) == 0

def test_search_in_hdx(self, configuration, search):
datasets = Dataset.search_in_hdx(configuration, 'ACLED')
assert len(datasets) == 10
datasets = Dataset.search_in_hdx(configuration, 'ajyhgr')
assert len(datasets) == 0
with pytest.raises(HDXError):
Dataset.search_in_hdx(configuration, '"')
Loading

0 comments on commit 864cb2b

Please sign in to comment.