UQ Library 2024-09-16
OpenRefine is an open source program self-described as “a power tool for dealing with messy data”.
It can help you:
- Get an overview of a data set, and explore subsets of it
- Resolve inconsistencies in a data set, for example standardising date formatting or fixing typos
- Split data up into more granular parts, for example splitting up cells with multiple authors into separate cells
- Match local data up to other data sets, for example in matching local subjects against the Library of Congress Subject Headings
- Enhance a data set populating it with data from other sources
Some common scenarios might be:
- Where you want to know how many times a particular value (name, publisher, subject) appears in a column in your data
- Where you want to know how values are distributed across your whole data set
- Where you have a list of dates which are formatted in different ways, and want to change all the dates in the list to a single common date format
- Where you have a list of names or terms that differ from each other but refer to the same people, places or concepts
- Where you have several bits of data combined together in a single column, and you want to separate them out into new columns.
- Import and export data
- What “facets” and “filters” are
- Edit cells
- Edit columns
- Group and transform data
- Reconcile a column with Wikidata
This lesson was designed using OpenRefine 3.8 but should be compatible with other supported versions.
OpenRefine requires to have a Java Runtime Environment (JRE) installed. Depending on your operating system, you might need to install one from the Java website, or use the Windows kit with embedded Java.
To install and run OpenRefine on your operating system, please follow the instructions on the official website.
OpenRefine can deal with formats like TSV, CSV, ODS, XLS and XLSX, JSON, RDF, XML and more.
Exercise 1 – Create a project and import data
- In the “Create project” screen that opens when starting OpenRefine,
select
Web Addresses (URL)
- Enter the URL to our data: https://raw.githubusercontent.com/LibraryCarpentry/lc-open-refine/gh-pages/data/doaj-article-sample.csv
- Click “Next”
- Click in the
Character encoding
box and selectUTF-8
- This screen is also where you can ignore rows, name the project, assign tags…
- Click on
Create Project
You might not have to change the character encoding, but if you see inaccuracies in how special characters are displayed, you will have to find the right one for your data.
How is data organised? How do I navigate it?
OpenRefine displays data in a tabular format: rows usually are records, and columns are a type of information.
You can decide how many rows you want to show at once, and navigate the pages with the controls above the table.
Drop-down menus next to the column headers allow you to do operations on the whole column.
You can reorder and remove columns by clicking the drop-down menu at
the top of the first column (labelled ‘All’), and choosing
Edit columns > Re-order / remove columns…
.
You can then drag and drop column names to reorder the columns, or
remove columns completely if they are not required. For example, move
the Publisher
column closer to the beginning, and remove the ISSNs
column.
You can also sort the rows by clicking on the drop-down menu for the
column you want to sort on, and choosing Sort
.The sorting is not
permanent, and a new Sort
drop-down menu appears at the top, which
allows you to modify the current sort, remove it, or make the sort
permanent.
What are filters and facets in OpenRefine? How can they help me navigate data? How can I correct common data issues?
Facets and filters will allow you to gain insights on your data, zero-in onto issues you want to fix by hand, or only apply automated edits to one subset of your dataset.
Facet and filter information always appears in the left-hand panel in the OpenRefine interface, and they can be combined for more complex searches.
You can apply a text filter to your data to do a search and restrict the displayed items to it.
In the column drop-down menu, you can use the Text filter
item and
specify any string you want to filter for.
For example, try filtering the Subjects
column with the term physic
.
The number of matching rows is displayed above the table.
A facet groups all the values that appear in a column, and allows you to subset the data and edit values across many records at the same time.
The simplest type of Facet is called a ‘Text facet’. It groups all the text values in a column and lists each value with the number of records it appears in.
Let’s create a Text Facet for a column:
- Click on the drop-down menu at the top of the
Publisher
column and chooseFacet > Text Facet
- Filter the data by clicking on a value
- Add more values with
include
- You can
invert
your selection - Click
reset
to deselect all values.
Exercise 2 – Explore licenses
Which licences are used for articles in this dataset? Which one is the most common? How many articles don’t have a licence?
- In the
License
column drop-down menu, chooseFacet > Text facet
CC BY
has the highest count- 6 articles don’t have a
License
value, marked(blank)
- Notice that filtering the data with this facet will update the
values in the
Publisher
facet. This makes it easy to see which publishers use which licences. - Close both facets with the “Remove All” button
There are more facets available, like numeric and timeline facets that produce graphs, or even custom facets for more advanced operations, like facets that return a logical value (“true” or “false”).
Exercise 3 – Facet by blank
Find all the publications without a DOI (digital object identifier) with
the Facet by blank
. How many are there?
Facet > Customized facet > Facet by blank
This is more efficient than a text facet, as it will only give two categories: “true” for blank cells, “false” for cells with a value. You could then focus on the blank cells to populate the missing data by hand (an “Edit” button appears when you hover over a cell).
When you filter your data with facets or text filters, remember that any operation that you carry out will only apply to the matches.
Using a text facet, you can edit values for a whole subset in one action
with the edit
option. This is very useful for fixing small typos or
variations in spelling.
Exercise 4 – Correct values
Use a text facet on the Language
column to replace English
values
with the language code EN
.
Facet > Text facet
- Hover over the term
English
and clickedit
- Replace it with
EN
and clickApply
All the occurrences of English
are replaced by EN
, and the facet
auto-updates to only show the remaining values.
How do I deal with cells that contain multiple values?
Author names are currently separated with the pipe symbol |
. We can
use the Split multi-valued cells
function to put them in their own
cells.
Exercise 5 – Split multi-valued cells
- Open the drop-down menu for the
Authors
column - Select
Edit cells
>Split multi-valued cells
- Use the pipe character
|
as a separator - Click OK
Note that we now have a lot more rows, and that they are still numbered sequentially.
How would we join data from several cells together, for example after
cleaning the data up? We can do the opposite operation, with
Edit cells
> Join multi-valued cells
, and deciding which character
we want to use to separate values.
Make sure you choose the right delimiter for the kind of data you deal with!
Exercise 6 – Practise splitting and joining
How would you split and then join again the data in the Subjects
column?
OpenRefine has two view modes for viewing data: Rows
and
Records
. In Records
mode, OpenRefine can link together multiple rows
as belonging to the same record.
Switch to the records
mode. Note how the numbering has changed:
several rows are related to the same record, because of the splitting we
just did.
If you now filter for an author name, notice how the “rows” and “records” view will change how much is displayed? We can stick to “records” to have all the associated information.
What can I use to identify and replace varying forms of the same data?
Fixing variations by hand like we did for the languages works fine for small numbers of values, but if there are hundreds or more, it becomes very impractical.
The Cluster function uses an algorithm to group together similar, but inconsistent values in a given column and lets you merge these inconsistent values into a single value you choose.
This is very effective where you have data with minor variations in values, e.g. names of people, organisations, places, classification terms…
Exercise 7 – clean up the author names
- Make sure your authors are already split into different rows
- Choose
Edit cells > Cluster and edit
from theAuthors
column - Using the defaults (
key collision
Method andfingerprint
Keying Function), click “Cluster” and work through the clusters of values, merging them into a single value where appropriate
By default, OpenRefine uses the most common value to merge the data, but we can choose which value we prefer by clicking on it, or we can write our own replacement value.
- You can also try changing the clustering method being used – which ones work well? Do they identify extra clusters?
The best clustering algorithm will depend on the kind of data your are processing. For more information on available algorithms, see the documentation page on clustering methods.
Facets, filters and clustering already allow us to explore and clean up our data.
To do more advanced operations, like splitting data into several columns, standardising a format without losing the original values, or extracting a data type from a string, we need transformations.
Transformations are ways of manipulating data in columns. They are normally written in a special language called GREL (General Refine Expression Language).
Some common transformations are accessible directly in the menus, for example to change the case of the values, or to remove bounding whitespace characters.
Exercises 8 – Correct publisher data
- Create a text facet on the
Publisher
column - Note that in the values there are two that look identical – why does this value appear twice?
- On the
Publisher
column, use the drop-down menu to selectEdit cells > Common transforms > Collapse consecutive whitespace
- Look at the publisher facet now – has it changed? (if it hasn’t
changed, try clicking the
Refresh
option to make sure it updates)
The transformation screen is available through
Edit cells > Transform...
, in which you can write your custom GREL
command with various functions and preview its effect.
GREL supports two types of syntax:
value.function(options)
function(value, options)
Either is valid, and which is used is completely down to personal preference. In these notes the first syntax is used, as it is easier to read when writing long commands.
The most simple “transformation” is value
, which will keep the data as
it is. We need to add extra functions to the command to actually modify
the original value.
Exercise 9 – Put titles into Title Case
Use Facets and the GREL function toTitlecase()
to put some titles in
Title Case:
- Facet by
Publisher
- Select “Akshantala Enterprises” and “Society of Pharmaceutical Technocrats” (To select multiple values in the facet, use the “include” link that appears to the right of the facet.)
- See that the Titles for these are all in uppercase
- Click the drop-down menu on the
Title column
- Choose
Edit cells > Transform...
- In the Expression box type
value.toTitlecase()
- In the
Preview
tab, see the effect of running this will be - Click “OK”
Note that this transformation is also available in the
Common transforms
.
Back in the Transform dialog, you can now see a history of transformations, and save your favourite ones by clicking on the star next to them. The
Help
tab is also a helpful reference.
In the left-hand panel, you can open the Undo / Redo
tab to access all
the steps taken so far, and undo steps by clicking the last step you
want to preserve.
The Extract...
button allows you to save a set of steps as a JSON
(JavaScript Object Notation) script, to re-apply them later on this or
any other dataset. To execute an extracted set of steps, click the
Apply
button.
How do I transform different data types?
Data types and regular expressions will allow you to write more complex GREL transformations.
All data in OpenRefine has a “type”. The most common is “string”, which is a piece of text. Data types supported are:
- String
- Number
- Date-time
- Boolean (logical)
- Array
Date and numbers: we currently have a Date
column where the data
is represented as a string. If we wanted to sort according to this data,
it would not end up chronological. We therefore need to convert the
values to a date data type for OpenRefine to interpret them properly.
Exercise 10 – Reformat the date
- Make sure you remove all Facets and Filters
- On the
Date
column, use the drop-down menu to selectEdit cells > Common transforms > To date
- Note how the values are now displayed in green and follow a standard convention for their display format (ISO8601) - this indicates they are now stored as date-time data types in OpenRefine. We can now carry out functions that are specific to Dates.
- In the
Date
column drop-down, selectEdit column > Add column based on this column
. Using this dialog, you can create a new column, while preserving the old column. - In the ‘New column name’ type “Formatted Date”
- In the ‘Expression’ box, type the GREL expression
value.toString("dd MMMM yyyy")
Having a column with the proper data type will also allow us to use features like the timeline facet.
Booleans: a boolean is a binary value that can either be “true” or “false”. They are often used in GREL expressions, for example:
value.contains("test")
… generates a boolean value which depends on whether the cell value contains the string “test” anywhere.
That allows use to build more complex transformations, for example by carrying out a transformation only if a test is successful:
if(value.contains("test"),"replacement text",value)
… replaces a cell value with the string “replacement text” only if the value in the cell contains the string “test” anywhere.
There is another inconsistency in our “Authors” column: a few names are
using the order <last name>, <first name>
, and all others use the
<first name> <last name>
order. We’d like to stick to the same order
for all, but to do that, we need to talk about arrays.
How do I use arrays in a GREL expression?
An array is a list of values. Arrays can be sorted, de-duplicated and manipulated in other ways in GREL expressions. They usually are the result of a transformation, and cells can’t contain them.
For example, if a cell contains the following string:
“Monday,Tuesday,Wednesday,Thursday,Friday,Saturday,Sunday”
… it can be transformed into an array with the split
function:
value.split(",")
… which would result in the following array:
[“Monday”,”Tuesday”,”Wednesday”,”Thursday”,”Friday”,”Saturday”,”Sunday”]
Combining the split
function with a sort
function like so:
value.split(",").sort()
… would result in an array of days of the week sorted alphabetically:
[“Friday”,”Monday”,”Saturday”,”Sunday”,”Thursday”,”Tuesday”,”Wednesday”]
You can output a value from an array by specifying its position in the
list. To extract the first value of the array created by the split
function:
value.split(",")[0]
Many programming languages, like GREL, start indexing at 0, which is why we have to use the 0th position to return the first element of the array.
If you want to go back to one single string, in order to store the
result in cells, you can use the join
function:
value.split(",").sort().join(",")
Exercise 11 – Reverse author names
We can focus on the rows that have a comma in their author name:
- Using the “Authors” column drop-down menu, use a
Text filter
- Filter for the comma character: “,”.
Now that we have narrowed down to the lines with a comma in the author
name, we can use the split()
function.
- On the Authors column use the drop-down menu and select
Edit cells > Transform
- In the Expression box type
value.split(", ")
. This will create an array of separate last name and first name, removing the comma and space.
To get the author name in the natural order, you can reverse the array and join it back together with a space to create the string you need:
- In the Expression box, add to the existing expression until it reads
value.split(", ").reverse().join(" ")
- In the Preview section, you should be able see this has reversed the array, and joined it back into a string, without any comma
- Click
OK
to apply your transformation, and notice how your filter shows 0 results, because all commas have been removed in the “Authors” column.
For more elaborate array creation, we can use the
match()
function along with Regular Expressions. Regular Expressions, or “Regex”, use a syntax to match pretty much any pattern of characters.
We can associate one of our columns to an external database by using OpenRefine’s reconciliation feature:
Exercise 12 – Reconcile publisher data against Wikidata
- On the “Publisher” column, select
Reconcile > Start reconciling...
. The default service listed is Wikidata. - The tool will automatically try to match the column to a Wikidata type, but we can specify exactly the one we want to use.
- After checking the existing matches, and completing by hand the
missing ones, we can add extra data to our project by using
Edit column > Add columns for reconciled values
. For example, try adding the headquarters location of the publishers into a new column.
How do I export transformed data?
OpenRefine can export a transformed dataset into a variety of
formats. You can access the menu with the top-right Export
button. The
Custom Tabular Exporter
offers extra options, like selecting a subset
of columns, reordering them, removing headers, selecting a format and
uploading as a Google Spreadsheet.
Your project is automatically saved every 15 minutes and can be
reopened with Open Project
when you next start OpenRefine. However, if
you need to share or save a snapshot of your project, you can use
Export > Export project
to download an archive.
When you quit OpenRefine, make sure you interrupt the OpenRefine process properly with Ctrl + C (for both Windows and Linux) or Command + Q (for MacOS) in the terminal, to guarantee that your project changes are saved. See the official documentation on starting and exiting OpenRefine.
OpenRefine learning:
- Full lesson this one is based on:
https://librarycarpentry.github.io/lc-open-refine/
- “Advanced functions” chapter: https://librarycarpentry.github.io/lc-open-refine/13-looking-up-data/index.html
- Video walk throughs: http://openrefine.org/
- Video tutorial on reconciliation and Wikidata contribution: https://www.youtube.com/watch?v=wfS1qTKFQoI
- Getting started with OpenRefine by Thomas Padilla: http://thomaspadilla.org/dataprep/
- Cleaning Data with OpenRefine by Seth van Hooland, Ruben Verborgh and Max De Wilde: http://programminghistorian.org/lessons/cleaning-data-with-openrefine
- Free your Metadata website: http://freeyourmetadata.org/
- OpenRefine Blog: http://openrefine.org/category/blog.html
- Official OpenRefine documentation:
- GREL documentation: https://docs.openrefine.org/manual/expressions#grel-general-refine-expression-language
- Clustering in Depth: https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth
- A compilation of “OpenRefine recipes”: https://github.com/OpenRefine/OpenRefine/wiki/Recipes
Regular Expressions:
- Understanding Regular Expressions: https://github.com/OpenRefine/OpenRefine/wiki/Understanding-Regular-Expressions
- RegEx + GREL cheatsheet: https://code4libtoronto.github.io/2018-10-12-access/GoogleRefineCheatSheets.pdf
Other data sources for reconciliation:
Examples of OpenRefine uses:
- Identifying potential headings for Authority work using III Sierra, MS Excel and OpenRefine: http://epublications.marquette.edu/lib_fac/81/
- Data Munging Tools in Preparation for RDF: Catmandu and LODRefine by Christina Harlow: http://journal.code4lib.org/articles/11013
- Blog posts on using OpenRefine from Owen Stephens: http://www.meanboyfriend.com/overdue_ideas/tag/openrefine/?orderby=date&order=ASC
Another dataset to play around with:
This lesson is based on a Library Carpentry lesson, available here: https://librarycarpentry.github.io/lc-open-refine/ It is also released under a Creative Commons Attribution license. The following is a human-readable summary of (and not a substitute for) the full legal text of the CC BY 4.0 license.
You are free:
- to Share – copy and redistribute the material in any medium or format
- to Adapt – remix, transform, and build upon the material
for any purpose, even commercially.
Under the following terms:
- Attribution – You must give appropriate credit (mentioning that your work is derived from work that is Copyright © Stéphane Guillou and, where practical, linking to https://gitlab.com/stragu/DSH), provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.