Skip to content
David Megginson edited this page Jan 17, 2015 · 28 revisions

These recipes are part of the HXL cookbook. Each recipe describes a general problem related to data privacy, then shows how to use the HXL standard and the command-line tools to solve the problem.

Normally, you will put these recipes in a script or batchfile to automate or semi-automate privacy tasks.

Recipes

P.1. Remove personally-identifiable information

Problem: a humanitarian cluster is managing a 3W (who-what-where) data that includes contact information for internal use. They want to share that data publicly, but without the personally-identifiable information (PII).

Recipe: automatically remove the names, email addresses, and phone numbers from an internal 3W dataset every time it is released to the public.

The first approach is to use hxlcut (command) to remove any columns with the #name, #email, or #phone HXL hashtags:

hxlcut --exclude name,email,phone <3W-in.csv >3W-out.csv

This is called a blacklist approach: the command will remove those three hashtags, but will leave any others in the original dataset. There is, however, the risk that you might add new hashtags with personally-identifiable information (PII) in the future, and that the script won't remove those.

An alternative is the whitelist approach of listing only the hashtags you want included, and removing everything else by default:

hxlcut --include org,sector,subsector,country,adm1,period_date \
  <3W-in.csv >3W-out.csv

This command whitelists the hashtags #org (who), #sector and #subsector (what), #country and #adm1 (where), and #period_date (when), and deletes all other columns in your source dataset.

P.2. Aggregate data for anonymity

Problem: an NGO is tracking age and gender data down to the village level (ADM5), and is concerned that data with that level of granularity could help identify individual families or people and put them at risk.

Recipe: before sharing publicly, use hxlcount (command) to aggregate the data up to a higher level of detail, so that individuals and families are identifiable. For example, the following command aggregates population data from the ADM5 up to the ADM3 level:

hxlcount --tags country,adm1,adm2,adm3,agesex --aggregate people_num \
  <input-data.csv >output-data.csv

The -t option specifies all of the HXL tags that should appear in the output data. The -a option specifies which number to aggregate. The resulting dataset (output-data.csv) will list the total number of people for each ADM3.

P.3. Detect privacy leaks

Problem: a UN agency is exporting public HXL data from its internal data systems, and wants to add an additional layer of security to make sure that private data does not begin leaking as internal data structures and code changes.

Recipe: Create a HXL schema that strictly excludes personally-identifiable fields, and regularly validate the output against that schema nightly using a [[job scheduler|HXL cookbook#23-automating-actions-with-job-schedulers]. If the validation fails, generate an automatic email to the system administrator.

The schema should set the #x_max_occurs field to 0 for every hashtag that represents personally-identifiable information in the context (e.g. #name, #email, #phone, #adm5). The script can validate data directly from a RESTful API using hxlvalidate (command):

wget -q -O - http://example.org/data?format=hxl | \
  hxlvalidate --schema privacy-schema.csv \
  >/tmp/validation-output.txt

The script can then check the exit status of the hxlvalidate command (e.g. using the $? variable in the Bourne/Bash shell), and send an email (including the contents of /tmp/validation-output.txt) if there was an error.

P.4. Restrict e-mail address domains

Problem: an organisation has an internal contact list that includes work addresses for both its own employees and members of other, partner organisations. The organisation wants to share the only the contact information for its own employees.

Solution: use hxlselect (command) with a regular-expression filter to remove any rows of data that contain email addresses from outside the organisation's domain.

For example, assume that all email addresses for the organisation belong to the "un.org" domain. The regular expression matching that will be "@un.org$", and the following command will create a new copy of the dataset containing only those addresses in the #email column:

hxlselect --query 'email~@un\.org$' \
  <contacts-in.csv >contacts-out.csv

Note the use of the "" operator instead of "=". The "=" operator means that the values match exactly, while the "" operator means that the values match a pattern.