Skip to content

Latest commit

 

History

History
28 lines (16 loc) · 1.66 KB

README.md

File metadata and controls

28 lines (16 loc) · 1.66 KB

Stylometry

Let's find out who wrote that!

This repository aims to distinguish different styles of writing from a corpus of texts of varying authorship.

The main program consists of two different classes: Text and Corpus. The Text class requires a single text from a single author (or the feature sought) and the Corpus class requires a list of Texts.

The Corpus class has several methods that help to find the best fit between a text and its author. The loading of texts can be done through the script included in read_data.py.

To test the project, 21 novels by three 19th century Spanish authors have been used: Juan Valera, Emilia Pardo Bazán and Benito Pérez Galdós.

The repository has the following folders:

  • app: This folder contains a streamlit application with a simplified model. It uses the five most common words in the corpus to estimate the author of the text from simple input data. It is not yet stable.

  • docs: Contains Powerpoint files for the presentation of the project.

  • notebooks: Jupyter notebooks used to generate and develop the project. It contains:

    • The notebooks presentation, data_extraction1, data_extraction2 and test_SVM are intermediate notebooks without much relevance.
    • The notebook dataframes_generator generates and saves relevant dataframes in the folder ./data/processed.
    • The notebook models_generator generates and saves the generated models in pickle format.
    • The notebook plot generates the images and tables used in the presentation.
  • src: Python scripts with the classes and functions that perform the model calculations.

The project is still under development and may have several stability or inconsistency issues.