-
Notifications
You must be signed in to change notification settings - Fork 16
/
openrefine.qmd
441 lines (271 loc) · 22.7 KB
/
openrefine.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
---
title: "OpenRefine: clean up and augment data"
format: gfm
editor: source
date: today
author: UQ Library
---
## OpenRefine?
[OpenRefine](https://openrefine.org/) is an open source program self-described as "a power tool for dealing with messy data".
It can help you:
* Get an **overview** of a data set, and explore subsets of it
* **Resolve inconsistencies** in a data set, for example standardising date formatting or fixing typos
* **Split data up** into more granular parts, for example splitting up cells with multiple authors into separate cells
* **Match** local data up to other data sets, for example in matching local subjects against the Library of Congress Subject Headings
* **Enhance** a data set populating it with data from other sources
Some common scenarios might be:
* Where you want to know how many times a particular value (name, publisher, subject) appears in a column in your data
* Where you want to know how values are distributed across your whole data set
* Where you have a list of dates which are formatted in different ways, and want to change all the dates in the list to a single common date format
* Where you have a list of names or terms that differ from each other but refer to the same people, places or concepts
* Where you have several bits of data combined together in a single column, and you want to separate them out into new columns.
## What you will learn today
1. Import and export data
1. What "facets" and "filters" are
1. Edit cells
1. Edit columns
1. Group and transform data
1. Reconcile a column with Wikidata
## Installation
This lesson was designed using OpenRefine 3.8 but should be compatible with other supported versions.
OpenRefine requires to have a Java Runtime Environment (JRE) installed. Depending on your operating system, you might need to install one from the [Java website](https://java.com/en/), or use the Windows kit with embedded Java.
To install and run OpenRefine on your operating system, please follow the [instructions on the official website](https://openrefine.org/download).
## Create a project
OpenRefine can deal with formats like TSV, CSV, ODS, XLS and XLSX, JSON, RDF, XML and more.
**Exercise 1** – Create a project and import data
1. In the "Create project" screen that opens when starting OpenRefine, select `Web Addresses (URL)`
1. Enter the URL to our data: https://raw.githubusercontent.com/LibraryCarpentry/lc-open-refine/gh-pages/data/doaj-article-sample.csv
1. Click "Next"
1. Click in the `Character encoding` box and select `UTF-8`
1. This screen is also where you can ignore rows, name the project, assign tags...
1. Click on `Create Project`
You might not have to change the character encoding, but if you see inaccuracies in how special characters are displayed, you will have to find the right one for your data.
## Layout of OpenRefine
> How is data organised? How do I navigate it?
OpenRefine displays data in a tabular format: rows usually are records, and columns are a type of information.
You can decide how many rows you want to show at once, and navigate the pages with the controls above the table.
Drop-down menus next to the column headers allow you to do operations on the whole column.
## Reorganising the data
You can **reorder and remove columns** by clicking the drop-down menu at the top of the first column (labelled ‘All’), and choosing `Edit columns > Re-order / remove columns…`.
You can then drag and drop column names to reorder the columns, or remove columns completely if they are not required. For example, move the `Publisher` column closer to the beginning, and remove the `ISSNs` column.
You can also **sort the rows** by clicking on the drop-down menu for the column you want to sort on, and choosing `Sort`.The sorting is not permanent, and a new `Sort` drop-down menu appears at the top, which allows you to modify the current sort, remove it, or make the sort permanent.
## Facetting and filtering
> What are filters and facets in OpenRefine? How can they help me navigate data?
> How can I correct common data issues?
Facets and filters will allow you to gain insights on your data, zero-in onto issues you want to fix by hand, or only apply automated edits to one subset of your dataset.
Facet and filter information always appears in the left-hand panel in the OpenRefine interface, and they can be combined for more complex searches.
### Filters
You can apply a **text filter** to your data to do a search and restrict the displayed items to it.
In the column drop-down menu, you can use the `Text filter` item and specify any string you want to filter for.
For example, try filtering the `Subjects` column with the term `physic`. The number of matching rows is displayed above the table.
### Facets
A **facet** groups all the values that appear in a column, and allows you to subset the data and edit values across many records at the same time.
The simplest type of Facet is called a ‘Text facet’. It groups all the text values in a column and lists each value with the number of records it appears in.
Let's create a Text Facet for a column:
1. Click on the drop-down menu at the top of the `Publisher` column and choose `Facet > Text Facet`
1. Filter the data by clicking on a value
1. Add more values with `include`
1. You can `invert` your selection
1. Click `reset` to deselect all values.
**Exercise 2** – Explore licenses
> Which licences are used for articles in this dataset? Which one is the most common? How many articles don't have a licence?
1. In the `License` column drop-down menu, choose `Facet > Text facet`
1. `CC BY` has the highest count
1. 6 articles don't have a `License` value, marked `(blank)`
1. Notice that filtering the data with this facet will update the values in the `Publisher` facet. This makes it easy to see which publishers use which licences.
1. Close both facets with the "Remove All" button
There are more facets available, like numeric and timeline facets that produce graphs, or even custom facets for more advanced operations, like facets that return a logical value ("true" or "false").
**Exercise 3** – Facet by blank
Find all the publications without a DOI (digital object identifier) with the `Facet by blank`. How many are there?
1. `Facet > Customized facet > Facet by blank`
This is more efficient than a text facet, as it will only give two categories: "true" for blank cells, "false" for cells with a value. You could then focus on the blank cells to populate the missing data by hand (an "Edit" button appears when you hover over a cell).
> When you filter your data with facets or text filters, remember that any operation that you carry out will only apply to the matches.
## Amending data
Using a text facet, you can edit values for a whole subset in one action with the `edit` option. This is very useful for fixing small typos or variations in spelling.
**Exercise 4** – Correct values
Use a text facet on the `Language` column to replace `English` values with the language code `EN`.
1. `Facet > Text facet`
1. Hover over the term `English` and click `edit`
1. Replace it with `EN` and click `Apply`
All the occurrences of `English` are replaced by `EN`, and the facet auto-updates to only show the remaining values.
## Split multi-valued cells
> How do I deal with cells that contain multiple values?
Author names are currently separated with the pipe symbol `|`. We can use the `Split multi-valued cells` function to put them in their own cells.
**Exercise 5** – Split multi-valued cells
1. Open the drop-down menu for the `Authors` column
1. Select `Edit cells` > `Split multi-valued cells`
1. Use the pipe character `|` as a separator
1. Click OK
Note that we now have a lot more rows, and that they are still numbered sequentially.
How would we join data from several cells together, for example after cleaning the data up? We can do the opposite operation, with `Edit cells` > `Join multi-valued cells`, and deciding which character we want to use to separate values.
> Make sure you choose the right delimiter for the kind of data you deal with!
**Exercise 6** – Practise splitting and joining
How would you split and then join again the data in the `Subjects` column?
## View modes
OpenRefine has two **view modes** for viewing data: `Rows` and `Records`. In `Records` mode, OpenRefine can link together multiple rows as belonging to the same record.
Switch to the `records` mode. Note how the numbering has changed: several rows are related to the same record, because of the splitting we just did.
If you now filter for an author name, notice how the "rows" and "records" view will change how much is displayed? We can stick to "records" to have all the associated information.
## Clustering
> What can I use to identify and replace varying forms of the same data?
Fixing variations by hand like we did for the languages works fine for small numbers of values, but if there are hundreds or more, it becomes very impractical.
The **Cluster** function uses an algorithm to group together similar, but inconsistent values in a given column and lets you merge these inconsistent values into a single value you choose.
This is very effective where you have data with minor variations in values, e.g. names of people, organisations, places, classification terms...
**Exercise 7** – clean up the author names
1. Make sure your authors are already split into different rows
1. Choose `Edit cells > Cluster and edit` from the `Authors` column
1. Using the defaults (`key collision` Method and `fingerprint` Keying Function), click "Cluster" and work through the clusters of values, merging them into a single value where appropriate
By default, OpenRefine uses the most common value to merge the data, but we can choose which value we prefer by clicking on it, or we can write our own replacement value.
4. You can also try changing the clustering method being used – which ones work well? Do they identify extra clusters?
> The best clustering algorithm will depend on the kind of data your are processing. For more information on available algorithms, see the [documentation page on clustering methods](https://docs.openrefine.org/manual/cellediting#clustering-methods).
## Transformations
Facets, filters and clustering already allow us to explore and clean up our data.
To do more advanced operations, like splitting data into several columns, standardising a format without losing the original values, or extracting a data type from a string, we need **transformations**.
Transformations are ways of manipulating data in columns. They are normally written in a special language called **GREL** (General Refine Expression Language).
Some common transformations are accessible directly in the menus, for example to change the case of the values, or to remove bounding whitespace characters.
**Exercises 8** – Correct publisher data
1. Create a text facet on the `Publisher` column
2. Note that in the values there are two that look identical – why does this value appear twice?
3. On the `Publisher` column, use the drop-down menu to select `Edit cells > Common transforms > Collapse consecutive whitespace`
4. Look at the publisher facet now – has it changed? (if it hasn’t changed, try clicking the `Refresh` option to make sure it updates)
### Writing transformations
The transformation screen is available through `Edit cells > Transform...`, in which you can write your **custom GREL command** with various **functions** and preview its effect.
GREL supports two types of syntax:
* `value.function(options)`
* `function(value, options)`
Either is valid, and which is used is completely down to personal preference. In these notes the first syntax is used, as it is easier to read when writing long commands.
The most simple "transformation" is `value`, which will keep the data as it is. We need to add extra functions to the command to actually modify the original value.
**Exercise 9** – Put titles into Title Case
Use Facets and the GREL function `toTitlecase()` to put some titles in Title Case:
1. Facet by `Publisher`
1. Select “Akshantala Enterprises” and “Society of Pharmaceutical Technocrats” (To select multiple values in the facet, use the "include" link that appears to the right of the facet.)
1. See that the Titles for these are all in uppercase
1. Click the drop-down menu on the `Title column`
1. Choose `Edit cells > Transform...`
1. In the Expression box type `value.toTitlecase()`
1. In the `Preview` tab, see the effect of running this will be
1. Click "OK"
Note that this transformation is also available in the `Common transforms`.
> Back in the Transform dialog, you can now see a history of transformations, and save your favourite ones by clicking on the star next to them. The `Help` tab is also a helpful reference.
### Undo and redo actions
In the left-hand panel, you can open the `Undo / Redo` tab to access all the steps taken so far, and **undo steps** by clicking the last step you want to preserve.
The `Extract...` button allows you to **save a set of steps** as a JSON (JavaScript Object Notation) script, to re-apply them later on this or any other dataset. To execute an extracted set of steps, click the `Apply` button.
### Transforming strings, numbers, dates and boolean
> How do I transform different data types?
**Data types** and **regular expressions** will allow you to write more complex GREL transformations.
All data in OpenRefine has a "type". The most common is "string", which is a piece of text. Data types supported are:
* String
* Number
* Date-time
* Boolean (logical)
* Array
**Date and numbers**: we currently have a `Date` column where the data is represented as a string. If we wanted to sort according to this data, it would not end up chronological. We therefore need to convert the values to a date data type for OpenRefine to interpret them properly.
**Exercise 10** – Reformat the date
1. Make sure you remove all Facets and Filters
1. On the `Date` column, use the drop-down menu to select `Edit cells > Common transforms > To date`
1. Note how the values are now displayed in green and follow a standard convention for their display format (ISO8601) - this indicates they are now stored as date-time data types in OpenRefine. We can now carry out functions that are specific to Dates.
1. In the `Date` column drop-down, select `Edit column > Add column based on this column`. Using this dialog, you can create a new column, while preserving the old column.
1. In the ‘New column name’ type “Formatted Date”
1. In the ‘Expression’ box, type the GREL expression `value.toString("dd MMMM yyyy")`
Having a column with the proper data type will also allow us to use features like the timeline facet.
**Booleans**: a boolean is a binary value that can either be "true" or "false". They are often used in GREL expressions, for example:
```
value.contains("test")
```
... generates a boolean value which depends on whether the cell value contains the string "test" anywhere.
That allows use to build more complex transformations, for example by carrying out a transformation _only if a test is successful_:
```
if(value.contains("test"),"replacement text",value)
```
... replaces a cell value with the string "replacement text" only if the value in the cell contains the string "test" anywhere.
### Handling arrays
There is another inconsistency in our "Authors" column: a few names are using the order `<last name>, <first name>`, and all others use the `<first name> <last name>` order. We'd like to stick to the same order for all, but to do that, we need to talk about arrays.
> How do I use arrays in a GREL expression?
An **array** is a list of values. Arrays can be sorted, de-duplicated and manipulated in other ways in GREL expressions. They usually are the _result_ of a transformation, and cells can't contain them.
For example, if a cell contains the following string:
```
“Monday,Tuesday,Wednesday,Thursday,Friday,Saturday,Sunday”
```
... it can be transformed into an array with the `split` function:
```
value.split(",")
```
... which would result in the following array:
```
[“Monday”,”Tuesday”,”Wednesday”,”Thursday”,”Friday”,”Saturday”,”Sunday”]
```
Combining the `split` function with a `sort` function like so:
```
value.split(",").sort()
```
... would result in an array of days of the week sorted alphabetically:
```
[“Friday”,”Monday”,”Saturday”,”Sunday”,”Thursday”,”Tuesday”,”Wednesday”]
```
You can output a value from an array by specifying its position in the list. To extract the first value of the array created by the `split` function:
```
value.split(",")[0]
```
> Many programming languages, like GREL, start indexing at 0, which is why we have to use the 0th position to return the first element of the array.
If you want to go back to one single string, in order to store the result in cells, you can use the `join` function:
```
value.split(",").sort().join(",")
```
**Exercise 11** – Reverse author names
We can focus on the rows that have a comma in their author name:
1. Using the "Authors" column drop-down menu, use a `Text filter`
1. Filter for the comma character: ",".
Now that we have narrowed down to the lines with a comma in the author name, we can use the `split()` function.
3. On the Authors column use the drop-down menu and select `Edit cells > Transform`
4. In the Expression box type `value.split(", ")`. This will create an array of separate last name and first name, removing the comma and space.
To get the author name in the natural order, you can reverse the array and join it back together with a space to create the string you need:
5. In the Expression box, add to the existing expression until it reads `value.split(", ").reverse().join(" ")`
6. In the Preview section, you should be able see this has reversed the array, and joined it back into a string, without any comma
7. Click `OK` to apply your transformation, and notice how your filter shows 0 results, because all commas have been removed in the "Authors" column.
> For more elaborate array creation, we can use the `match()` function along with Regular Expressions. Regular Expressions, or "Regex", use a syntax to match pretty much any pattern of characters.
## Reconciliation
We can associate one of our columns to an external database by using OpenRefine's reconciliation feature:
**Exercise 12** – Reconcile publisher data against Wikidata
* On the "Publisher" column, select `Reconcile > Start reconciling...`. The default service listed is Wikidata.
* The tool will automatically try to match the column to a Wikidata type, but we can specify exactly the one we want to use.
* After checking the existing matches, and completing by hand the missing ones, we can add extra data to our project by using `Edit column > Add columns for reconciled values`. For example, try adding the headquarters location of the publishers into a new column.
## Exporting the data and saving a project
> How do I export transformed data?
OpenRefine can **export a transformed dataset** into a variety of formats. You can access the menu with the top-right `Export` button. The `Custom Tabular Exporter` offers extra options, like selecting a subset of columns, reordering them, removing headers, selecting a format and uploading as a Google Spreadsheet.
Your project is **automatically saved every 15 minutes** and can be reopened with `Open Project` when you next start OpenRefine. However, if you need to share or save a snapshot of your project, you can use `Export > Export project` to download an archive.
> When you quit OpenRefine, make sure you interrupt the OpenRefine process properly with <kbd>Ctrl</kbd> + <kbd>C</kbd> (for both Windows and Linux) or <kbd>Command</kbd> + <kbd>Q</kbd> (for MacOS) in the terminal, to guarantee that your project changes are saved.
> See the official [documentation on starting and exiting OpenRefine](https://docs.openrefine.org/manual/running#starting-and-exiting).
## Useful links
OpenRefine learning:
* Full lesson this one is based on: https://librarycarpentry.github.io/lc-open-refine/
* "Advanced functions" chapter: https://librarycarpentry.github.io/lc-open-refine/13-looking-up-data/index.html
* Video walk throughs: http://openrefine.org/
* Video tutorial on reconciliation and Wikidata contribution: https://www.youtube.com/watch?v=wfS1qTKFQoI
* Getting started with OpenRefine by Thomas Padilla: http://thomaspadilla.org/dataprep/
* Cleaning Data with OpenRefine by Seth van Hooland, Ruben Verborgh and Max De Wilde: http://programminghistorian.org/lessons/cleaning-data-with-openrefine
* Free your Metadata website: http://freeyourmetadata.org/
* OpenRefine Blog: http://openrefine.org/category/blog.html
* Official OpenRefine documentation:
* GREL documentation: https://docs.openrefine.org/manual/expressions#grel-general-refine-expression-language
* Clustering in Depth: https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth
* A compilation of "OpenRefine recipes": https://github.com/OpenRefine/OpenRefine/wiki/Recipes
Regular Expressions:
* Understanding Regular Expressions: https://github.com/OpenRefine/OpenRefine/wiki/Understanding-Regular-Expressions
* RegEx + GREL cheatsheet: https://code4libtoronto.github.io/2018-10-12-access/GoogleRefineCheatSheets.pdf
Other data sources for reconciliation:
* https://github.com/OpenRefine/OpenRefine/wiki/Reconcilable-Data-Sources
Examples of OpenRefine uses:
* Identifying potential headings for Authority work using III Sierra, MS Excel and OpenRefine: http://epublications.marquette.edu/lib_fac/81/
* Data Munging Tools in Preparation for RDF: Catmandu and LODRefine by Christina Harlow: http://journal.code4lib.org/articles/11013
* Blog posts on using OpenRefine from Owen Stephens: http://www.meanboyfriend.com/overdue_ideas/tag/openrefine/?orderby=date&order=ASC
Another dataset to play around with:
* http://www.thomaspadilla.org/data/dataprep/authors-people.csv
## Licence
This lesson is based on a Library Carpentry lesson, available here: https://librarycarpentry.github.io/lc-open-refine/
It is also released under a [Creative Commons Attribution license](https://creativecommons.org/licenses/by/4.0/). The following is a human-readable summary of
(and not a substitute for) the [full legal text of the CC BY 4.0 license](https://creativecommons.org/licenses/by/4.0/legalcode).
You are free:
* to **Share** – copy and redistribute the material in any medium or format
* to **Adapt** – remix, transform, and build upon the material
for any purpose, even commercially.
Under the following terms:
* **Attribution** – You must give appropriate credit (mentioning that your work is derived from work that is Copyright © Stéphane Guillou and, where practical, linking to https://gitlab.com/stragu/DSH), provide a [link to the license](https://creativecommons.org/licenses/by/4.0/), and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.