Skip to content

Commit

Permalink
Add dataset train-test split.
Browse files Browse the repository at this point in the history
  • Loading branch information
dmacjam committed Nov 20, 2023
1 parent 0362d91 commit a109b8f
Show file tree
Hide file tree
Showing 7 changed files with 30,755 additions and 5,724 deletions.
4 changes: 3 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,9 @@ The dataset is released publicly.

# Dataset
The dataset is available in the data folder. It contains all 2861 conversations.
For full dataset see `data/mathdial.tsv` (tsv format) or `data/mathdial.jsonl` (jsonl format) and for a small sample see `data/example.jsonl`
The dataset is split into train and test - see `data/train.csv` and `data/test.csv` (for csv format).
For jsonl format, see `data/train.jsonl` and `data/test.jsonl`. To see a small sample of the dataset, look at `data/example.jsonl`
Please note that each row in the file consists of full conversations between a teacher and a student delimited with special `|EOM|` notation.

## Data Structure
- `qid` - unique identifier of the problem
Expand Down
2,861 changes: 0 additions & 2,861 deletions data/mathdial.jsonl

This file was deleted.

2,862 changes: 0 additions & 2,862 deletions data/mathdial.tsv

This file was deleted.

5,894 changes: 5,894 additions & 0 deletions data/test.csv

Large diffs are not rendered by default.

599 changes: 599 additions & 0 deletions data/test.jsonl

Large diffs are not rendered by default.

21,997 changes: 21,997 additions & 0 deletions data/train.csv

Large diffs are not rendered by default.

2,262 changes: 2,262 additions & 0 deletions data/train.jsonl

Large diffs are not rendered by default.

0 comments on commit a109b8f

Please sign in to comment.