Merge pull request #8 from LeidenUniversityLibrary/documentation

Add install and usage documentation
LeidenUniversityLibrary · Sep 24, 2023 · b4e39bd · b4e39bd
2 parents 9adab09 + 9d46f79
commit b4e39bd
Show file tree

Hide file tree

Showing 6 changed files with 147 additions and 0 deletions.
diff --git a/docs/docx-to-gfm.md b/docs/docx-to-gfm.md
@@ -6,8 +6,36 @@ title: Convert docx to GitHub-flavoured Markdown
 
 # Install Pandoc
 
+Follow the [Pandoc installation instructions](https://pandoc.org/installing.html)
+to install Pandoc.
 
 # Run conversion
 
 Our script accepts a directory of input files, a directory to store the output
 files and calls Pandoc for each input file.
+
+The input files may be located in subdirectories.
+The output files will all be in a single directory.
+To make sure that filenames do not collide in the output directory,
+their names are hashes of the input filenames.
+
+!!! note
+    The conversion script does not exclude overview files.
+
+Run with [Hatch]:
+
+```sh
+hatch run nexis convert -i INPUT_DIRECTORY -o OUTPUT_DIRECTORY
+```
+
+If you installed the package, you can run:
+
+```sh
+nexis convert -i INPUT_DIRECTORY -o OUTPUT_DIRECTORY
+```
+
+[Hatch]: https://hatch.pypa.io/latest/
+
+!!! note
+    The package has not yet been published in PyPI, so the way to install it is
+    by building it locally and installing the resulting package.
diff --git a/docs/extract-metadata.md b/docs/extract-metadata.md
@@ -0,0 +1,29 @@
+---
+title: Extract metadata from converted files
+# SPDX-FileCopyrightText: 2023-present Leiden University Libraries <[email protected]>
+# SPDX-License-Identifier: CC-BY-4.0
+---
+
+Nexis Uni includes fairly standardised metadata for each article, such as the
+title, name of the publication, publication date, a byline and the number of
+words in the article body.
+This command creates a CSV file that includes these metadata for each input
+file.
+
+# Usage
+
+The input directory must contain the Markdown files.
+The output directory (defaults to the input directory if not specified) will
+have a file named *analysis-results.csv*.
+
+Run with Hatch:
+
+```sh
+hatch run nexis analyse -i INPUT_DIRECTORY -o OUTPUT_DIRECTORY
+```
+
+If you installed the package, you can run:
+
+```sh
+nexis analyse -i INPUT_DIRECTORY -o OUTPUT_DIRECTORY
+```
diff --git a/docs/extract-terms.md b/docs/extract-terms.md
@@ -0,0 +1,32 @@
+---
+title: Extract search terms from converted files
+# SPDX-FileCopyrightText: 2023-present Leiden University Libraries <[email protected]>
+# SPDX-License-Identifier: CC-BY-4.0
+---
+
+Nexis Uni marks the phrases or terms that caused the article to match in the
+files.
+This allows us to find which terms are in which article by their markup.
+
+The result of the command is a CSV file linking filenames and counts of terms.
+
+# Usage
+
+The input directory must contain the Markdown files.
+The output directory (defaults to the input directory if not specified) will
+have a file named *terms-results.csv*.
+
+Run with Hatch:
+
+```sh
+hatch run nexis terms -i INPUT_DIRECTORY -o OUTPUT_DIRECTORY
+```
+
+If you installed the package, you can run:
+
+```sh
+nexis terms -i INPUT_DIRECTORY -o OUTPUT_DIRECTORY
+```
+
+!!! note
+    Marked phrases and terms are only extracted from the body of the articles.
diff --git a/docs/index.md b/docs/index.md
@@ -6,6 +6,12 @@ title: Analysing documents from Nexis Uni
 
 # Introduction
 
+The tools provided here aim to help with the analysis of (news) articles
+retrieved from Nexis Uni.
+Retrieving these articles is not a goal of this package.
+
 # Overview
 
 - Convert docx files to GitHub-flavoured Markdown
+- Extract metadata from the files to a CSV file
+- Extract search terms marked by Nexis Uni from the files to a CSV file
diff --git a/docs/install.md b/docs/install.md
@@ -0,0 +1,48 @@
+---
+title: Installation
+# SPDX-FileCopyrightText: 2023-present Leiden University Libraries <[email protected]>
+# SPDX-License-Identifier: CC-BY-4.0
+---
+
+The scripts are written in Python and require Python 3.9 or newer to run.
+
+The scripts have not been published to PyPI, so to install them you can either
+install the package from the git repository using pip, or run the scripts with
+Hatch.
+We describe how to use Hatch below.
+
+# Install Hatch
+
+Follow the [Hatch installation instructions][Hatch] to install Hatch.
+
+[Hatch]: https://hatch.pypa.io/latest/install/
+
+# Clone the git repository
+
+```sh
+git clone https://github.com/LeidenUniversityLibrary/nexis-analysis.git
+cd nexis-analysis
+```
+
+# Run a `nexis` command
+
+After the previous steps, you should be in the `nexis-analysis` directory.
+To check that the tool works, run:
+
+```sh
+hatch run nexis --help
+```
+
+This should show the available commands:
+
+```output
+Usage: nexis [OPTIONS] COMMAND [ARGS]...
+
+Options:
+  --help  Show this message and exit.
+
+Commands:
+  analyse  Extract information from GFM documents in a directory
+  convert  Convert .docx files in a directory to GitHub-flavoured Markdown
+  terms    Extract marked-up search terms or phrases from GFM documents...
+```
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -26,3 +26,7 @@ markdown_extensions:
 
 nav:
   - 'index.md'
+  - 'install.md'
+  - 'docx-to-gfm.md'
+  - 'extract-metadata.md'
+  - 'extract-terms.md'