Ružička: Authorship Verification in Python

Introduction

The code in this repository offers an implementation of a number of routines in authorship studies, with a focus on authorship verification. It is named after the inventor of the "minmax" measure (M. Ružička). The repository offers a generic implementation of two commonly used verification systems. The first system is an intrinsic verifier, depending on a first-order metric (O1), close to the one described in:

Potha, N. and E. Stamatatos. A Profile-based Method for Authorship Verification
In Proc. of the 8th Hellenic Conference on Artificial Intelligence
(SETN), LNCS, 8445, pp. 313-326, 2014.

The second system is an extrinsic verifier with second-order metrics (O2), based the General Imposters framework as described in:

M. Koppel and Y. Winter (2014), Determining if Two Documents are by the Same
Author, JASIST, 65(1): 178-187.

The package additionally offers a number of useful implementations of common vector space models and evaluation metrics. The code in this repository was used to produce the results in a paper which is currently under submission.

Quickstart

While the code in this repository was tailored towards our needs for a specific paper, the code folder includes an IPython notebook, which will guide you through some of the main functionality offered. In the code itself, we try to offer comprehensive documentation in the form of docstrings. All experiments in our paper can be repeated using the following scripts under code:

01pan_experiments.py
02latin_dev_o1.py
03latin_dev_o2.py
04latin_test_o2.py
05latin_testviz.py

Data sets

This repository includes 6 multilingual benchmark datasets for authorship verification (under data/), which were used as the official competition data in the 2014 track on authorship verification of the annual PAN evaluation lab on uncovering plagiarism, authorship, and social software misuse. The survey paper by Stamatatos et al. provides detailed information on the provenance, structure and nature of these corpora (together with baselines figures etc.). The competition data for this competition covered the following text varieties:

Dutch essays
Dutch reviews
English essays
English novels
Spanish articles
Greek articles

Additionally, this repository includes a novel benchmark dataset for Latin authors from Antiquity (under data/latin/), which were mainly selected from the Latin Library. This data set has a similar structure as the PAN corpora.

Dependencies

This code requires Python 2.7+ (Python 3 has not been tested). The repository is dependent on a number of well-known third-party Python libraries, including:

numpy
scipy
scikit-learn
matplotlib
seaborn
numba

and preferably (for GPU acceleration and/or JIT-compilation):

theano
numbapro

We recommend installing Continuum's excellent Anaconda Python framework, which comes bundled with most of these dependencies. Additionally, this code integrates a number of scripts by Vincent van Asch to statistically compare the output of different classifiers, using Approximate Randomization Testing (under ruzicka/: art.py, combinations.py and confusionmatrix.py).

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
code		code
data		data
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Ružička: Authorship Verification in Python

Introduction

Quickstart

Data sets

Dependencies

About

Releases

Packages

Contributors 2

Languages

License

mikekestemont/ruzicka

Folders and files

Latest commit

History

Repository files navigation

Ružička: Authorship Verification in Python

Introduction

Quickstart

Data sets

Dependencies

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages