Skip to content

Commit

Permalink
V0.3.0 (#86)
Browse files Browse the repository at this point in the history
* add use_chardet extra to solve requests import error

* bump requests lower bound for compatibility w/chardet extra

* pull rectypes into VariableDescriptions

* add hierarchical extract test files

* parse extract rectypes from ddi

* added tests for rectype at extract and var level

* grab rectype id var and rectype key var from ddi

* _read_hierarchical_microdata method

* retain only records of the relevant type in _read_hierarchical_microdata

* return dict-style hierarchical data from read_microdata()

* make read_hierarchical_microdata() its own method

* both types of hierarchical returns work, but they are both slow

* maintain original data types when creating a single hierarchical data frame

* check data types for rectangular extracts

* add tests for reading hierarchical extracts

* test subset functionality

* raise an error if RECTYPE isn't included in subset list for hierarchical extracts

* use read_microdata_chunked to read hierarchical extracts

* remove as_dict arg from read_microdata

* default api_version = v2

* oops v2 -> 2

* snake_case -> camelCase

* re-generated expired test extract - needed b/c extract request schema changes required re-recording vcrpy cassettes

* extract request schema updates

* re-record vcrpy cassetttes; all tests passing

* tests for API v2 extract definition specifications

* support beta and v2 schemas in build() method

* fix key error in test extract jsons

* test cli with v2 extract schema

* paint it black

* warn around old api versions; stop supporting beta extract schema

* test NotImplementedError for unsupported api versions

* run readers tests against v2 extract definition files

* api version travels with extract and is assigned at submit time if not specified in extract definition

* api_version moves with CpsExtract objects too

* Sample and Variable data classes

* grab collection id and api version from the extract response

* black

* api version needs to be specified as integer

* helps if you add the test data

* Attach characteristics (#83)

* attach characteristics

* attach characteristics

* remove Variable and Sample from_* classmethods

* enforce acceptable case for Variable and Sample

* docstrings for attach_characteristics

* pull variable feature updating logic into private method and add test

* enable add_data_quality_flags and select_cases

* fix extract_api_version bug

* New data formats (#84)

support hierarchical extract requests

* More api v2 support (#85)

* IpumsiExtract class

* disallow duplicate samples and varaibles

* retrieve_previous_extracts -> get_previous_extracts

* get_extract_by_id() convenience method

* raise warnings when API returns a modified extract definition

* warn when returned extract definition has been modified and support for extract-level data quality flag flag

* include variable notes in VariableDescription objects

* limit->pageSize param

* remove define_extract_from_ddi()

* retrieve sample ids in the api/core module instead of utilities

* ipums microdata extract api -> ipums api, remove beta links

* add IPUMS I to table

* remove obsolete import

* undocument removed methods

* more docstrings

* add Variables and Samples to API reference

* allow a list for add_data_quality_flags

* more docs

* default to dict return from read_hieararchical_microdata()

* add hierarchical reader to the toctree

* still more docs stuff

* docs about hierarchical files

* purged -> expired

* there is no extract resubmission, only extract submission

* switch positions of args in get_extract_by_id() to be in line with others

* docs update for resubmit removal

* better error messaging for expired extract object passed to download_extract()

* page generator

* make get_pages private and add extract_history iterator

* type hints

* correctly handle nested dicts in extract definitions with variable-level features

* docs for extract histories

* print statement clean up

* case_select_who example

* version bump and change log updates

* black
  • Loading branch information
renae-r authored Apr 9, 2023
1 parent 3ffe299 commit f124086
Show file tree
Hide file tree
Showing 54 changed files with 477,105 additions and 3,918 deletions.
4 changes: 2 additions & 2 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
Thank you for considering improving this project! By participating, you
agree to abide by the [code of conduct](https://github.com/ipums/ipumspy/blob/master/CONDUCT.md).

# Issues (Reporting a problem or suggestion)
## Issues (Reporting a problem or suggestion)
If you've experience a problem with the package, or have a suggestion for it,
please post it on the [issues tab](https://github.com/ipums/ipumspy/issues).
This space is meant for questions directly related to the python package, so questions
Expand All @@ -15,7 +15,7 @@ much detail about your problem as possible including the code and error message,
the project the extract is from, the variables you have selected, file type, etc.
We'll do our best to answer your question.

# Pull Requests (Making changes to the package)
## Pull Requests (Making changes to the package)
We appreciate pull requests that follow these guidelines:
1) Make sure that tests pass (and add new ones if possible).

Expand Down
2 changes: 1 addition & 1 deletion conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,6 @@ def vcr_config():
return {
"filter_headers": ["authorization"],
"ignore_localhost": True,
"record_mode": "none",
"record_mode": "once",
"match_on": ["uri", "method"],
}
35 changes: 34 additions & 1 deletion docs/source/change-log.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,39 @@ This project adheres to `Semantic Versioning`_.

.. _Semantic Versioning: http://semver.org/

0.3.0
-----
2023-04-08

* Breaking Changes

* This release marks the beginning of support for IPUMS API version 2 and ipumspy no longer supports requests to version 1 or version beta of the IPUMS API. This means that extract definitions created and saved to files using previous versions of ipumspy can no longer be submitted as-is to the IPUMS API using this library! These definitions can be modified for use with v0.3.0 of ipumspy and IPUMS API version 2 by changing the ``data_format`` key to ``dataFormat`` and the ``data_structure`` key to ``dataStructure``. More information on `versioning of the IPUMS API <https://developer.ipums.org/docs/apiprogram/versioning/>`_ and `breaking changes in version 2 <https://developer.ipums.org/docs/apiprogram/changelog/>`_ can be found at the IPUMS developer portal.
* The ``resubmit_purged_extract()`` method has been removed; use :py:meth:`~ipumspy.api.IpumsApiClient.submit_extract()` instead.
* The ``extract_was_purged()`` method has been renamed to :py:meth:`~ipumspy.api.IpumsApiClient.extract_is_expired()`.
* The ``CollectionInformation`` class has been removed. To retrieve information about available samples in a collection, use :py:meth:`~ipumspy.api.IpumsApiClient.get_all_sample_info()`
* The ``define_extract_from_ddi()`` method has been removed.
* The ``retrieve_previous_extracts()`` method has been renamed to :py:meth:`~ipumspy.api.IpumsApiClient.get_previous_extracts()`

* New Features

* Support for IPUMS API version 2 features!

* Added :py:meth:`~ipumspy.api.BaseExtract.attach_characteristics()`
* Added :py:meth:`~ipumspy.api.BaseExtract.select_cases()`
* Added :py:meth:`~ipumspy.api.BaseExtract.add_data_quality_flags()`
* Added optional ``data_quality_flags`` keyword argument to IPUMS extract classes to include all available data quality flags for variables in the extract
* Added optional ``select_case_who`` keyword argument to IPUMS extract classes to specify that the extract should include all individuals in households that contain a person with the specified :py:meth:`~ipumspy.api.BaseExtract.select_cases()` characteristics.
* Added support for requesting hierarchical extracts: ``{"hierarchical": {}}`` is now an acceptable value for ``data_structure``
* Added :py:class:`~ipumspy.api.extract.IpumsiExtract` class to support IPUMS International extract requests
* Added :py:meth:`~ipumspy.api.IpumsApiClient.get_extract_history()` generator to allow for perusal of extract histories

* Added :py:meth:`~ipumspy.api.IpumsApiClient.get_extract_by_id()` which creates a new (unsubmited) extract object from an IPUMS collection a previously submitted extract id number
* Added support for reading hierarchical extract files in :py:meth:`~ipumspy.readers.read_hierarchical_microdata()`

* Bug Fixes

* The ``subset`` argument for :py:meth:`~ipumspy.readers.read_microdata()` now functions correctly.

0.2.2-alpha.1
-------------
2023-03-06
Expand Down Expand Up @@ -44,7 +77,7 @@ This project adheres to `Semantic Versioning`_.
* Added :py:meth:`~ipumspy.api.exceptions.IpumsExtractNotSubmitted` exception. This will be raised when attempting to retrieve an extract id or download link from a extract that has not been submitted to the IPUMS extract engine.
* Added :py:meth:`~ipumspy.ddi.Codebook.get_all_types()` method to access all types of ddi codebook variables in an easy way.
* Added parameter `string_pyarrow` to :py:meth:`~ipumspy.ddi.Codebook.get_all_types()` method. If this parameter is set to True and used in conjunction
with parameter `type_format="pandas_type"` or `type_format`="pandas_type_efficient"`, then the string column dtype (pandas.StringDtype()) is overriden with pandas.StringDtype(storage="pyarrow"). Useful for
with parameter `type_format="pandas_type"` or `type_format="pandas_type_efficient"`, then the string column dtype (pandas.StringDtype()) is overriden with pandas.StringDtype(storage="pyarrow"). Useful for
users who want to convert an IPUMS extract in csv format to parquet format.
The dictionary returned by this method can then be used in the dtype argument of :py:meth:`~ipumspy.readers.read_microdata()` or :py:meth:`~ipumspy.readers.read_microdata_chunked()`.
* Added :py:meth:`~ipumspy.ddi.VariableDescription.pandas_type_efficient`. This type format is more efficient than `pandas_type`
Expand Down
194 changes: 181 additions & 13 deletions docs/source/extracts.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,12 +16,11 @@ An extract is defined by:
2. A list of IPUMS sample IDs from that collection
3. A list of IPUMS variable names from that collection

IPUMS metadata is not currently accessible via API.
Sample IDs and IPUMS variable names can be browsed via the data collection's website.
See the table below for data collection abreviations and links to sample IDs and variable browsing.
Note that not all IPUMS data collections are currently available via API. The table below will be
IPUMS metadata is not currently accessible via API. Sample IDs and IPUMS variable names can be browsed via the data collection's website. See the table below for data collection abreviations and links to sample IDs and variable browsing. Note that not all IPUMS data collections are currently available via API. The table below will be
filled in as new IPUMS data collections become accessible via API.

.. _collection availability table:

.. list-table:: IPUMS data collections metadata resources
:widths: 25 25 25 25
:header-rows: 1
Expand All @@ -39,12 +38,16 @@ filled in as new IPUMS data collections become accessible via API.
- cps
- `cps samples <https://cps.ipums.org/cps-action/samples/sample_ids>`__
- `cps variables <https://cps.ipums.org/cps-action/variables/group>`__
* - IPUMS International
- ipumsi
- `ipumsi samples <https://international.ipums.org/international-action/samples/sample_ids>`__
- `ipumsi variables <https://international.ipums.org/international-action/variables/group>`__


Extract Objects
---------------

Each IPUMS data collection that is accessible via API (currently just IPUMS USA and IPUMS CPS) has its own extract class.
Using this class to create your extract object obviates the need to specify a data collection.
Each IPUMS data collection that is accessible via API has its own extract class. Using this class to create your extract object obviates the need to specify a data collection.

For example:

Expand All @@ -57,6 +60,16 @@ For example:
instantiates a UsaExtract object for the IPUMS USA data collection that includes the us2012b (2012 PRCS) sample, and the variables AGE and SEX.

IPUMS extracts can be requested as rectangular or hierarchical files. The ``data_structure`` argument defaults to ``{"rectangular": {"on": "P"}}`` to request a rectangular, person-level extract. The code snippet below requests a hierarchical USA extract.

.. code:: python
extract = UsaExtract(
["us2012b"],
["AGE", "SEX"],
data_structure={"hierarchical": {}}
)
Users also have the option to specify a data format and an extract description when creating an extract object.

.. code:: python
Expand All @@ -68,6 +81,8 @@ Users also have the option to specify a data format and an extract description w
description="My first IPUMS USA extract!"
)
Once an extract object has been created, the extract must be submitted to the API.

.. code:: python
Expand Down Expand Up @@ -118,11 +133,11 @@ returns:
'started'
While IPUMS retains all of a user's extract definitions, after a certain period, the extract data and syntax files are purged from the IPUMS cache. Importantly, if an extract's data and syntax files have been purged, the extract is still considered to have been completed, and :meth:`.extract_status()` will return "completed."
While IPUMS retains all of a user's extract definitions, after a certain period, the extract data and syntax files are purged from the IPUMS cache - these extracts are said to be "expired". Importantly, if an extract's data and syntax files have been removed, the extract is still considered to have been completed, and :meth:`.extract_status()` will return "completed."

.. code:: python
# extract number 1 has been purged
# extract number 1 has expired
ipums.extract_status(collection="usa", extract="1")
returns:
Expand All @@ -131,11 +146,11 @@ returns:
'completed'
If an extract has been purged:
If an extract has expired:

.. code:: python
ipums.extract_was_purged(collection="usa", extract="1")
ipums.extract_is_expired(collection="usa", extract="1")
returns:
Expand All @@ -144,11 +159,15 @@ returns:
True
For extracts that have had their files purged, the data collection name and extract ID number can be used to resubmit the old extract. Note that resubmitting a purged extract results in a new extract with its own unique ID number!
For extracts that have expired, the data collection name and extract ID number can be used to re-create and re-submit the old extract. **Note that re-creating and re-submitting a expired extract results in a new extract with its own unique ID number!**

.. code:: python
resubmitted_extract = ipums.resubmit_purged_extract(collection="usa", extract="1")
# create a UsaExtract object from the expired extract definition
renewed_extract = ipums.get_extract_by_id(collection="usa", extract_id=1)
# submit the renewed extract to re-generate the data and syntax files
resubmitted_extract = ipums.submit_extract(renewed_extract)
resubmitted_extract.extract_id
Expand All @@ -158,12 +177,161 @@ returns:
2
Extract Features
----------------

IPUMS Extract features can be added or updated before an extract request is submitted. This section demonstrates adding features to the following IPUMS CPS extract.

.. code:: python
extract = CPSExtract(
["cps2022_03s"],
["AGE", "SEX", "RACE"],
)
Attach Characteristics
~~~~~~~~~~~~~~~~~~~~~~

IPUMS allows users to create variables that reflect the characteristics of other household members. The example below uses the :meth:`.attach_characteristics()` method to attach the spouse's AGE value, creating a new variable called SEX_SP in the extract that will contain the age of a person's spouse if they have one and be 0 otherwise. The :meth:`.attach_characteristics()` method takes the name of the variable to attach and the household member whose values the new variable will include. Valid household members include "spouse", "mother", "father", and "head".

.. code:: python
extract.attach_characteristics("SEX", ["spouse"])
The following would add variables for the RACE value of both parents:

.. code:: python
extract.attach_characteristics("RACE", ["mother", "father"])
Select Cases
~~~~~~~~~~~~

IPUMS allows users to limit their extract based on values of included variables. The code below uses the :meth:`.select_cases()` to select only the female records in the example IPUMS CPS extract. This method takes a variable name and a list of values for that variable for which to include records in the extract. Note that the variable must be included in the IPUMS extract object in order to use this feature; also note that this feature is only available for categorical varaibles.

.. code:: python
extract.select_cases("SEX", ["2"])
The :meth:`.select_cases()` method defaults to using "general" codes to select cases. Some variables also have detailed codes that can be used to select cases. Consider the following example extract of the 2021 ACS data from IPUMS USA:

.. code:: python
extract = UsaExtract(
["us2021a"],
["AGE", "SEX", "RACE"]
)
In IPUMS USA, the `RACE <https://usa.ipums.org/usa-action/variables/race#codes_section>`_ variable has both general and detailed codes. A user interested in respondents who identify themselves with two major race groups can use general codes:

.. code:: python
extract.select_cases("RACE", ["8"])
A user interested in respondents who identify as both White and Asian can use detailed case selection to only include those chose White and another available Asian cateogry. To do this, in addition to specifying the correct detailed codes, set the `general` flag to `False`:

.. code:: python
extract.select_cases("RACE",
["810", "811", "812", "813", "814", "815", "816", "818"],
general=False)
By default, case selection includes only individuals with the specified values for the specified variables. In the previous example, only persons who identified as both White and Asian are included in the extract. To make an extract that contains individuals in households that include an individual who identifies as both White and Asian, set the ``case_select_who`` flag to ``True`` when instantiating the extract object. The code snippet below creates such an extract. Note that whether to select individuals or households must be specified at the extract level, while what values to select on and whether these values are general or detailed codes is specified at the variable level.

.. code:: python
extract = UsaExtract(
["us2021a"],
["AGE", "SEX", "RACE"],
case_select_who = "households"
)
extract.select_cases("RACE",
["810", "811", "812", "813", "814", "815", "816", "818"],
general=False)
Add Data Quality Flags
~~~~~~~~~~~~~~~~~~~~~~

Data quality flags can be added to an extract on a per-variable basis or for the entire extract. The CPS extract example above could be re-defined as follows in order to add all available data quality flags:

.. code:: python
extract = CpsExtract(
["cps2022_03s"],
["AGE", "SEX", "RACE"],
data_quality_flags=True
)
This extract specification will add data quality flags for all variables in the variable list to the extract for which data quality flags exist in the sample(s) in the samples list.

Data quality flags can also be selected for specific variables using the :meth:`.add_data_quality_flags()` method.

.. code:: python
# add the data quality flag for AGE to the extract
extract.add_data_quality_flags("AGE")
# note that this method will also accept a list!
extract.add_data_quality_flags(["AGE", "SEX"])
.. _Using Variable Objects to Include Extract Features:

Using Variable Objects to Include Extract Features
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

It is also possible to define all variable-level extract features when the IPUMS extract object is first defined using :class:`ipumspy.api.extract.Variable` objects. The example below defines an IPUMS CPS extract that includes a variable for the age of the spouse (``attached_characteristics``), limits the sample to women (``case_selections``), and includes the data quality flag for RACE (``data_quality_flags``).

.. code:: python
fancy_extract = CpsExtract(
["cps2022_03s"],
[
Variable(name="AGE",
attached_characteristics=["spouse"]),
Variable(name="SEX",
case_selections={"general": ["2"]}),
Variable(name="RACE",
data_quality_flags=True)
]
)
Unsupported Features
--------------------

Not all features available through the IPUMS extract web UI are currently supported for extracts made via API.
For a list of currently unsupported features, see `the developer documentation <https://beta.developer.ipums.org/docs/apiprogram/apis/usa/>`__.
For a list of currently unsupported features, see `the developer documentation <https://beta.developer.ipums.org/docs/apiprogram/apis/>`__.
This list will be updated as more features become available.

Extract Histories
-----------------
``ipumspy`` offers several ways to peruse your extract history for a given IPUMS data collection.

:meth:`.get_previous_extracts()` can be used to retrieve your 10 most recent extracts for a given collection. The limit can be set to a custom n of most recent previous extracts.

.. code:: python
from ipumspy import IpumsApiClient
ipums = IpumsApiClient("YOUR_API_KEY")
# get my 10 most-recent USA extracts
recent_extracts = ipums.get_previous_extracts("usa")
# get my 20 most-recent CPS extracts
more_recent_extracts = ipums.get_previous_extracts("cps", limit=20)
The :meth:`.get_extract_history()` generator makes it easy to filter your extract history to pull out extracts with certain variables, samples, features, file formats, etc. By default, this generator returns pages extract definitions of the maximum possible size, 2500. Page size can be set to a lower number using the ``page_size`` argument.

.. code:: python
# make a list of all of my extracts from IPUMS CPS that include the variable STATEFIP
extracts_with_state = []
# get pages with 50 CPS extracts per page
for page in ipums.get_extract_history("cps", page_size=100):
for ext in page["data"]:
extract_obj = CpsExtract(**ext["extractDefinition"])
if "STATEFIP" in [var.name for var in extract_obj.variables]:
extracts_with_state.append(extract_obj)
Loading

0 comments on commit f124086

Please sign in to comment.