Stylometry

Let's find out who wrote that!

This repository aims to distinguish different styles of writing from a corpus of texts of varying authorship.

The main program consists of two different classes: Text and Corpus. The Text class requires a single text from a single author (or the feature sought) and the Corpus class requires a list of Texts.

The Corpus class has several methods that help to find the best fit between a text and its author. The loading of texts can be done through the script included in read_data.py.

To test the project, 21 novels by three 19th century Spanish authors have been used: Juan Valera, Emilia Pardo Bazán and Benito Pérez Galdós.

The repository has the following folders:

app: This folder contains a streamlit application with a simplified model. It uses the five most common words in the corpus to estimate the author of the text from simple input data. It is not yet stable.
docs: Contains Powerpoint files for the presentation of the project.
notebooks: Jupyter notebooks used to generate and develop the project. It contains:
- The notebooks presentation, data_extraction1, data_extraction2 and test_SVM are intermediate notebooks without much relevance.
- The notebook dataframes_generator generates and saves relevant dataframes in the folder ./data/processed.
- The notebook models_generator generates and saves the generated models in pickle format.
- The notebook plot generates the images and tables used in the presentation.
src: Python scripts with the classes and functions that perform the model calculations.

The project is still under development and may have several stability or inconsistency issues.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Stylometry

Files

README.md

Latest commit

History

README.md

File metadata and controls

Stylometry