Skip to content

Commit

Permalink
Prepare packaging (#1)
Browse files Browse the repository at this point in the history
* Add type hints

* Add poetry setup

* Fix bad character in tmdb data

* Add dev dependencies

* Make more packagable

* Fix formatting error in tmdb

* Further improve data handling

* use logger for printing

* Use click

* Adapt README

* Remove wrongfull call of create_graph_data

* Adapt CI

* Fix problem with creating data before loading

* Install package in nox

* Use utf8 encoding everywhere

* Avoid double ci runs on PRs
  • Loading branch information
dobraczka authored Jul 18, 2022
1 parent c99586e commit 074a82d
Show file tree
Hide file tree
Showing 17 changed files with 1,556 additions and 564 deletions.
8 changes: 4 additions & 4 deletions .github/workflows/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,9 @@ name: Tests
# Controls when the action will run.
on:
push:
branches: [ master ]
branches:
- "master"
pull_request:
branches: [ master ]

# Allows you to run this workflow manually from the Actions tab
workflow_dispatch:
Expand All @@ -20,8 +20,8 @@ jobs:
strategy:
fail-fast: false
matrix:
python-version: [3.7, 3.8, 3.9]
os: [ubuntu-latest, macos-latest]
python-version: [3.7, '3.10']
os: [ubuntu-latest, macos-latest, windows-latest]

steps:
- uses: actions/checkout@v2
Expand Down
44 changes: 27 additions & 17 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,30 +2,40 @@
Due to licensing we are not allowed to distribute the IMDB datasets (more info on their license can be found [here](https://help.imdb.com/article/imdb/general-information/can-i-use-imdb-data-in-my-software/G5JTRESSHJBBHTGX?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=3aefe545-f8d3-4562-976a-e5eb47d1bb18&pf_rd_r=2TNAA9FRS3TJWM3AEQ2X&pf_rd_s=center-1&pf_rd_t=60601&pf_rd_i=interfaces&ref_=fea_mn_lk1#))
What we can do is let you build the IMDB side of the entity resolution datasets yourself. Please be aware, that the mentioned license applies to the IMDB data you produce.

# Dependencies
The only dependency is `requests`, although with `tqdm` you will have nice progress bars (this is optional).

# Getting the data
The TMDB and TVDB datasets are already provided in this repo and where created from the public APIs of [TheMovieDB](https://www.themoviedb.org/documentation/api) and [TVDB](https://www.thetvdb.com/api-information). What you have to do is create the IMDB data.
# Usage
You can simply install the package via pip:
```bash
pip install moviegraphbenchmark
```
and then run
```bash
moviegraphbenchmark
```
which will create the data in the default data path `~/.data/moviegraphbenchmark/data`

If you love one-liners and trust random people on the internet (that promise to be nice) you can simply run:
You can also define a specific folder if you want with
```bash
curl -sSL https://raw.githubusercontent.com/ScaDS/MovieGraphBenchmark/master/src/main.py | python3 -
moviegraphbenchmark --data-path anotherpath
```

This will download this repo, execute `python src/create_graph.py`, which downloads the IMDB data and creates the missing datasets. Furthermore it cleans up and only leaves a `ScaDSMovieGraphBenchmark` in your current directory with the datasets.
For ease-of-usage in your project you can also use this library for loading the data (this will create the data if it's not present):

You can also specify a specific directory where data should go:
```python
from moviegraphbenchmark import load_data
ds = load_data()
# by default this will load `imdb-tmdb`
print(ds.attr_triples_1)

```bash
curl -sSL https://raw.githubusercontent.com/ScaDS/MovieGraphBenchmark/master/src/main.py | python3 - mypath/benchmarkfolder
```
# specify other pair and specific data path
ds = load_data(pair="imdb-tmdb",data_path="anotherpath")

If you don't like piping scripts from the internet (or you use windows) you can do the steps by yourself:
```
git clone https://github.com/ScaDS/MovieGraphBenchmark.git
cd MovieGraphBenchmark
python3 src/create_graph.py
# the dataclass contains all the files loaded as pandas dataframes
print(ds.attr_triples_2)
print(ds.rel_triples_1)
print(ds.rel_triples_2)
print(ds.ent_links)
for fold in in ds.folds:
print(fold)
```

# Dataset structure
Expand Down
2 changes: 1 addition & 1 deletion data/imdb-tmdb/attr_triples_2
Original file line number Diff line number Diff line change
Expand Up @@ -10925,7 +10925,7 @@ https://www.scads.de/movieBenchmark/resource/TMDB/person1336352 http://xmlns.com
https://www.scads.de/movieBenchmark/resource/TMDB/episode1282716 http://dbpedia.org/ontology/releaseDate 2017-03-14^^<http://www.w3.org/2001/XMLSchema#string>
https://www.scads.de/movieBenchmark/resource/TMDB/episode311214 http://dbpedia.org/ontology/episodeNumber 3^^<http://www.w3.org/2001/XMLSchema#integer>
https://www.scads.de/movieBenchmark/resource/TMDB/tvSeries75891 http://dbpedia.org/ontology/abstract Orphan Remi (13) goes on a incredible journey to find his family.^^<http://www.w3.org/2001/XMLSchema#string>
https://www.scads.de/movieBenchmark/resource/TMDB/movie167366 http://dbpedia.org/ontology/abstract After airing more than 100 MTV Unplugged specials, MTV wanted to bring back the series, in order to expose them to a younger generation. The channel recruited various mainstream and popular artists to perform as part of the series, including Perry, who particularly expressed interest in the idea as it would allow her to showcase herself as an artist and share the stories behind her songs. The extended play includes rearrangements of five songs from Perry's album One of the Boys (2008), a previously unreleased original song and a cover version of a song by Fountains of Wayne. Alongside the audio disc, the album includes a DVD with the video recording of her performance and an exclusive interview.^^<http://www.w3.org/2001/XMLSchema#string>
https://www.scads.de/movieBenchmark/resource/TMDB/movie167366 http://dbpedia.org/ontology/abstract After airing more than 100 MTV Unplugged specials, MTV wanted to bring back the series, in order to expose them to a younger generation. The channel recruited various mainstream and popular artists to perform as part of the series, including Perry, who particularly expressed interest in the idea as it would allow her to showcase herself as an artist and share the stories behind her songs. The extended play includes rearrangements of five songs from Perry's album One of the Boys (2008), a previously unreleased original song and a cover version of a song by Fountains of Wayne. Alongside the audio disc, the album includes a DVD with the video recording of her performance and an exclusive interview.^^<http://www.w3.org/2001/XMLSchema#string>
https://www.scads.de/movieBenchmark/resource/TMDB/tvSeries70115 http://dbpedia.org/ontology/releaseDate 2016-11-18^^<http://www.w3.org/2001/XMLSchema#date>
https://www.scads.de/movieBenchmark/resource/TMDB/episode513361 http://dbpedia.org/ontology/episodeNumber 13^^<http://www.w3.org/2001/XMLSchema#integer>
https://www.scads.de/movieBenchmark/resource/TMDB/episode182364 http://dbpedia.org/ontology/abstract While the Monarch schemes to dispose of Dr. Venture once and for all, Henchman 21 seeks revenge on Brock for the death of 24.^^<http://www.w3.org/2001/XMLSchema#string>
Expand Down
2 changes: 1 addition & 1 deletion data/tmdb-tvdb/attr_triples_1
Original file line number Diff line number Diff line change
Expand Up @@ -10925,7 +10925,7 @@ https://www.scads.de/movieBenchmark/resource/TMDB/person1336352 http://xmlns.com
https://www.scads.de/movieBenchmark/resource/TMDB/episode311214 http://dbpedia.org/ontology/episodeNumber 3^^<http://www.w3.org/2001/XMLSchema#integer>
https://www.scads.de/movieBenchmark/resource/TMDB/episode1282716 http://dbpedia.org/ontology/releaseDate 2017-03-14^^<http://www.w3.org/2001/XMLSchema#string>
https://www.scads.de/movieBenchmark/resource/TMDB/tvSeries75891 http://dbpedia.org/ontology/abstract Orphan Remi (13) goes on a incredible journey to find his family.^^<http://www.w3.org/2001/XMLSchema#string>
https://www.scads.de/movieBenchmark/resource/TMDB/movie167366 http://dbpedia.org/ontology/abstract After airing more than 100 MTV Unplugged specials, MTV wanted to bring back the series, in order to expose them to a younger generation. The channel recruited various mainstream and popular artists to perform as part of the series, including Perry, who particularly expressed interest in the idea as it would allow her to showcase herself as an artist and share the stories behind her songs. The extended play includes rearrangements of five songs from Perry's album One of the Boys (2008), a previously unreleased original song and a cover version of a song by Fountains of Wayne. Alongside the audio disc, the album includes a DVD with the video recording of her performance and an exclusive interview.^^<http://www.w3.org/2001/XMLSchema#string>
https://www.scads.de/movieBenchmark/resource/TMDB/movie167366 http://dbpedia.org/ontology/abstract After airing more than 100 MTV Unplugged specials, MTV wanted to bring back the series, in order to expose them to a younger generation. The channel recruited various mainstream and popular artists to perform as part of the series, including Perry, who particularly expressed interest in the idea as it would allow her to showcase herself as an artist and share the stories behind her songs. The extended play includes rearrangements of five songs from Perry's album One of the Boys (2008), a previously unreleased original song and a cover version of a song by Fountains of Wayne. Alongside the audio disc, the album includes a DVD with the video recording of her performance and an exclusive interview.^^<http://www.w3.org/2001/XMLSchema#string>
https://www.scads.de/movieBenchmark/resource/TMDB/tvSeries70115 http://dbpedia.org/ontology/releaseDate 2016-11-18^^<http://www.w3.org/2001/XMLSchema#date>
https://www.scads.de/movieBenchmark/resource/TMDB/episode513361 http://dbpedia.org/ontology/episodeNumber 13^^<http://www.w3.org/2001/XMLSchema#integer>
https://www.scads.de/movieBenchmark/resource/TMDB/episode182364 http://dbpedia.org/ontology/abstract While the Monarch schemes to dispose of Dr. Venture once and for all, Henchman 21 seeks revenge on Brock for the death of 24.^^<http://www.w3.org/2001/XMLSchema#string>
Expand Down
11 changes: 3 additions & 8 deletions noxfile.py
Original file line number Diff line number Diff line change
@@ -1,16 +1,11 @@
import nox


@nox.session
def tests_without_tqdm(session):
session.install("pytest")
session.install("requests")
session.run("pytest")


@nox.session
def tests(session):
session.install("pytest")
session.install(".")
session.install("requests")
session.install("tqdm")
session.install("pandas")
session.install("pystow")
session.run("pytest")
Loading

0 comments on commit 074a82d

Please sign in to comment.