Skip to content

Commit

Permalink
Merge pull request #8 from LeidenUniversityLibrary/documentation
Browse files Browse the repository at this point in the history
Add install and usage documentation
  • Loading branch information
bencomp authored Sep 24, 2023
2 parents 9adab09 + 9d46f79 commit b4e39bd
Show file tree
Hide file tree
Showing 6 changed files with 147 additions and 0 deletions.
28 changes: 28 additions & 0 deletions docs/docx-to-gfm.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,36 @@ title: Convert docx to GitHub-flavoured Markdown

# Install Pandoc

Follow the [Pandoc installation instructions](https://pandoc.org/installing.html)
to install Pandoc.

# Run conversion

Our script accepts a directory of input files, a directory to store the output
files and calls Pandoc for each input file.

The input files may be located in subdirectories.
The output files will all be in a single directory.
To make sure that filenames do not collide in the output directory,
their names are hashes of the input filenames.

!!! note
The conversion script does not exclude overview files.

Run with [Hatch]:

```sh
hatch run nexis convert -i INPUT_DIRECTORY -o OUTPUT_DIRECTORY
```

If you installed the package, you can run:

```sh
nexis convert -i INPUT_DIRECTORY -o OUTPUT_DIRECTORY
```

[Hatch]: https://hatch.pypa.io/latest/

!!! note
The package has not yet been published in PyPI, so the way to install it is
by building it locally and installing the resulting package.
29 changes: 29 additions & 0 deletions docs/extract-metadata.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
---
title: Extract metadata from converted files
# SPDX-FileCopyrightText: 2023-present Leiden University Libraries <[email protected]>
# SPDX-License-Identifier: CC-BY-4.0
---

Nexis Uni includes fairly standardised metadata for each article, such as the
title, name of the publication, publication date, a byline and the number of
words in the article body.
This command creates a CSV file that includes these metadata for each input
file.

# Usage

The input directory must contain the Markdown files.
The output directory (defaults to the input directory if not specified) will
have a file named *analysis-results.csv*.

Run with Hatch:

```sh
hatch run nexis analyse -i INPUT_DIRECTORY -o OUTPUT_DIRECTORY
```

If you installed the package, you can run:

```sh
nexis analyse -i INPUT_DIRECTORY -o OUTPUT_DIRECTORY
```
32 changes: 32 additions & 0 deletions docs/extract-terms.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
---
title: Extract search terms from converted files
# SPDX-FileCopyrightText: 2023-present Leiden University Libraries <[email protected]>
# SPDX-License-Identifier: CC-BY-4.0
---

Nexis Uni marks the phrases or terms that caused the article to match in the
files.
This allows us to find which terms are in which article by their markup.

The result of the command is a CSV file linking filenames and counts of terms.

# Usage

The input directory must contain the Markdown files.
The output directory (defaults to the input directory if not specified) will
have a file named *terms-results.csv*.

Run with Hatch:

```sh
hatch run nexis terms -i INPUT_DIRECTORY -o OUTPUT_DIRECTORY
```

If you installed the package, you can run:

```sh
nexis terms -i INPUT_DIRECTORY -o OUTPUT_DIRECTORY
```

!!! note
Marked phrases and terms are only extracted from the body of the articles.
6 changes: 6 additions & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,12 @@ title: Analysing documents from Nexis Uni

# Introduction

The tools provided here aim to help with the analysis of (news) articles
retrieved from Nexis Uni.
Retrieving these articles is not a goal of this package.

# Overview

- Convert docx files to GitHub-flavoured Markdown
- Extract metadata from the files to a CSV file
- Extract search terms marked by Nexis Uni from the files to a CSV file
48 changes: 48 additions & 0 deletions docs/install.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
---
title: Installation
# SPDX-FileCopyrightText: 2023-present Leiden University Libraries <[email protected]>
# SPDX-License-Identifier: CC-BY-4.0
---

The scripts are written in Python and require Python 3.9 or newer to run.

The scripts have not been published to PyPI, so to install them you can either
install the package from the git repository using pip, or run the scripts with
Hatch.
We describe how to use Hatch below.

# Install Hatch

Follow the [Hatch installation instructions][Hatch] to install Hatch.

[Hatch]: https://hatch.pypa.io/latest/install/

# Clone the git repository

```sh
git clone https://github.com/LeidenUniversityLibrary/nexis-analysis.git
cd nexis-analysis
```

# Run a `nexis` command

After the previous steps, you should be in the `nexis-analysis` directory.
To check that the tool works, run:

```sh
hatch run nexis --help
```

This should show the available commands:

```output
Usage: nexis [OPTIONS] COMMAND [ARGS]...
Options:
--help Show this message and exit.
Commands:
analyse Extract information from GFM documents in a directory
convert Convert .docx files in a directory to GitHub-flavoured Markdown
terms Extract marked-up search terms or phrases from GFM documents...
```
4 changes: 4 additions & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -26,3 +26,7 @@ markdown_extensions:

nav:
- 'index.md'
- 'install.md'
- 'docx-to-gfm.md'
- 'extract-metadata.md'
- 'extract-terms.md'

0 comments on commit b4e39bd

Please sign in to comment.