-
-
Notifications
You must be signed in to change notification settings - Fork 11
/
Copy pathdatatools.qmd
136 lines (88 loc) · 5.61 KB
/
datatools.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
---
title: "Introductory exercise on Using ASReview Datatools"
author: "The ASReview Academy Team"
---
## Introduction
The **goal** of this exercise is to learn how to use ASReview Datatools for processing a dataset.
When performing a systematic review, it is vital to proparly prepare your dataset. A dataset needs to contain at least a 'title' and 'abstract' column. Another column that comes in handy is for example 'doi'. After you created such a dataset, it can be the case that some preprocessing or - after screening the literature - postprocessing is needed. It can for example be that the dataset contains a lot of duplicates.
In this exercise, you will learn how to describe, compose and convert datasets. If writing scripts is new to you, don’t worry, you will be guided through the process in the command-line interface.
Enjoy!
### Step 1: Create a folder structure
We recommend you to create a designated folder to run a simulation in,
to keep your project organised. Create a folder and give it a name, for
example “Data_preprocessing_PTSD”. Within this folder, create two
sub-folders called ‘data’ and ‘output’. Also using NotePad (on Windows)
or Text Edit (on Mac/OS) to create an empty text file called `jobs.txt`
in the main folder. This is where you will save the scripts you run.
### Step 2: Choose a dataset
Before we get started, we need a dataset that is already labeled. You can use a dataset from the SYNERGY dataset via the synergy-dataset Python package.
1. *Open the command prompt, by typing ‘cmd’ (for Windows) or
‘terminal’ (for MacOS) in your computer’s search bar;*
2. *Navigate to the folder structure;*
You can navigate to your folder structure with:
``` bash
cd [file_path]
```
For MacOS:
1. Right-click on created folder.
2. Click “New Terminal at Folder”.
Then the data set will be automatically be put into the designated data folder.
3. *Install the synergy-dataset Python package with:*
``` bash
pip install synergy-dataset
```
4. *Build the dataset.*
To download and build the SYNERGY dataset, run the following command in the command line:
``` bash
python -m synergy_dataset get -o data
```
You can also choose to use your own dataset(s). For the purpose of this exercise, do make sure that the
dataset consists of at least a 'title' and 'abstract' and 'doi' column.
### Step 3: Write and run the script
For the next steps, you first need to install ASReview datatools. You can do so with:
``` bash
pip install asreview-datatools
```
Note: if you installed ASReview datatools before, make sure you install the latest version. You can do so with:
``` bash
pip install --upgrade asreview-datatools
```
Before you are going to write your script, here is a general explanation of how you can get started with ASReview datatools. The parts between the brackets need to be filled out by you when using ASReview Datatools.
```bash
asreview data [name_of_tool]
```
where `[name_of_tool]` is the name of the tool you want to use (`describe`, `convert`, `dedup`, `vstack`, or `compose`)
followed by positional arguments and optional arguments.
Each tool has its own help description which is available with:
```bash
asreview data [name_of_tool] -h
```
### Tool 1: Describe
With the describe tool you get information about the properties of your dataset. For example, it shows you how many total, relevant, irrelevant and unlabeled records your dataset contains.
You can describe the content of a dataset with:
``` bash
asreview data describe data/[name_of_your_data.csv]
```
Note: 'data/' is in front of the name of your dataset, since the datasets are stored in the 'data' folder you created in Step 1 and not in the directory you are currently working from.
### Tool 2: Compose
Next, you are going to compose a dataset out of two datasets; a labeled and an unlabeled dataset. The code starts similarly to the code of the Describe tool (`asreview data compose`), but now the output path should always be specified (`output/[name_of_your_output_file.csv]`) and you need to assign a corresponding property to the datasets you want to merge. To the labeled dataset you need to assign the argument `-l`, and to the unlabeled dataset you need to assign a `-u`.
Furthermore, in case of conflict, you need to determine which label to keep (`-c`). Let's say you want to keep one of the labels, you can do so with the `keep_one` command. By default the hierarchy in determining which label to keep is: relevant, irrelevant, unlabeled.
See the [Datatools README](https://github.com/asreview/asreview-datatools#data-compose-experimental){target="_blank"} for other possible arguments.
You can compose a dataset with:
``` bash
asreview data compose output/[name_of_your_output_file.csv] -l data/[name_of_your_data.csv] -u data/[name_of_your_data.csv] -c keep_one
```
### Tool 3: Convert
Now you are going to convert the format of your dataset. You are going to convert a RIS dataset into a CSV dataset.
Find a RIS file on you laptop, put it in the data folder you created in Step 1 and try the following command:
``` bash
asreview data convert data/[name_of_your_data].ris output/[name_of_your_output_file].csv
```
If you don't have a RIS file, you can practice with the code by converting a CSV file from the Synergy dataset into an Excel file:
``` bash
asreview data convert data/[name_of_your_data].csv output/[name_of_your_output_file].xlsx
```
## What else?
We introduced some of the datatools, but there are more tools available in the [ASReview Datatools](https://github.com/asreview/asreview-datatools){target="_blank"} package.
**If you like the functionality of Datatools, don’t forget to give it a
star on GitHub!**