diff --git a/CHANGELOG.md b/CHANGELOG.md index 568035cd..412b86d7 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -6,162 +6,162 @@ # [0.3.0] - 2024-01-19 -- Upgrade to net7 and C# 11 (#394) -- Update TesseractLinuxLoaderFix.cs -- Fix up borked RDMP plugin building -- Remove Harmony library +- Upgrade to net7 and C# 11 (#394) +- Update TesseractLinuxLoaderFix.cs +- Fix up borked RDMP plugin building +- Remove Harmony library # [0.2.1] - 2024-01-08 ## Changed -- Check pixel data is valid when PII scan is skipped +- Check pixel data is valid when PII scan is skipped # [0.2.0] - 2024-01-07 ## Added -- Add pre-commit and editorconfig from SmiServices +- Add pre-commit and editorconfig from SmiServices ## Changed -- Support skipping safe pixel data -- Switch to embedded debug symbols -- Tidy csprojs and switch to central package versions -- Enable Nullable -- Rename IsIdentifiableBaseOptions to match variable name and SmiServices -- Change default failure highlight to red -- Remove old class diagram and update docs -- Move Failure into correct namespace -- Rename IsIdentifiableRule -> RegexRule for clarity -- Rename ICustomRule -> IAppliableRule for clarity -- Rename ex-nuspec test, refine, add missing packages +- Support skipping safe pixel data +- Switch to embedded debug symbols +- Tidy csprojs and switch to central package versions +- Enable Nullable +- Rename IsIdentifiableBaseOptions to match variable name and SmiServices +- Change default failure highlight to red +- Remove old class diagram and update docs +- Move Failure into correct namespace +- Rename IsIdentifiableRule -> RegexRule for clarity +- Rename ICustomRule -> IAppliableRule for clarity +- Rename ex-nuspec test, refine, add missing packages ## Removed -- Remove versions from PACKAGES.md -- Remove old LGTM config +- Remove versions from PACKAGES.md +- Remove old LGTM config # [0.1.0] - 2023-04-12 ## Added -- Add CodeQL workflow for GitHub code scanning #240 and #242 -- Add coverage #267 -- Add filesystem abstractions #290 -- Add ability to create `CsvDestination` with an existing `CsvConfiguration` #291 +- Add CodeQL workflow for GitHub code scanning #240 and #242 +- Add coverage #267 +- Add filesystem abstractions #290 +- Add ability to create `CsvDestination` with an existing `CsvConfiguration` #291 ## Fixed -- Explicitly pass `top` view to `Application.Run` #294 -- Cleanup warnings and CodeQL issues #292 -- Validate the date part of each possible CHI before complaining #247 +- Explicitly pass `top` view to `Application.Run` #294 +- Cleanup warnings and CodeQL issues #292 +- Validate the date part of each possible CHI before complaining #247 # [0.0.9] - 2022-11-21 ## Added -- Bump Microsoft.Extensions.Caching.Memory from 6.0.1 to 7.0.0 -- Bump Microsoft.NET.Test.Sdk from 17.3.2 to 17.4.0 -- Bump Microsoft.Extensions.FileSystemGlobbing from 6.0.0 to 7.0.0 -- Bump CsvHelper from 30.0.0 to 30.0.1 -- Bump HIC.RDMP.Plugin from 8.0.5 to 8.0.6 -- Bump Magick.NET-Q16-AnyCPU from 12.2.0 to 12.2.1 -- Bump System.IO.Abstractions from 17.2.3 to 17.2.26 -- Bump NUnit3TestAdapter from 4.3.0 to 4.3.1 -- Fix dependabot config tagging Thomas +- Bump Microsoft.Extensions.Caching.Memory from 6.0.1 to 7.0.0 +- Bump Microsoft.NET.Test.Sdk from 17.3.2 to 17.4.0 +- Bump Microsoft.Extensions.FileSystemGlobbing from 6.0.0 to 7.0.0 +- Bump CsvHelper from 30.0.0 to 30.0.1 +- Bump HIC.RDMP.Plugin from 8.0.5 to 8.0.6 +- Bump Magick.NET-Q16-AnyCPU from 12.2.0 to 12.2.1 +- Bump System.IO.Abstractions from 17.2.3 to 17.2.26 +- Bump NUnit3TestAdapter from 4.3.0 to 4.3.1 +- Fix dependabot config tagging Thomas # [0.0.8] - 2022-11-04 ## Added -- Support for running on 'non dicom' MongoDb databases. This is now the default. Pass `--isdicomfiles` if your MongoDb contains serialized dicom files. -- New flag `--top` to only run on a subset of the data available (e.g. `top 1000`). Currently only supported by relational database and csv runners -- Added ability to ignore whole columns in reviewer by pressing `Del` on the column and confirming -- Support for naming servers in `Targets.yaml` for main `ii` binary instead of connection strings (e.g. `-d myserver` for running on relational dbs) -- You can now pass a directory name to the `file` verb to process all csv files in that directory. -- Added `-g` option to the `file` verb to process multiple csv files e.g. `**/*.csv`. This option is only valid when specifying a directory for `-f` +- Support for running on 'non dicom' MongoDb databases. This is now the default. Pass `--isdicomfiles` if your MongoDb contains serialized dicom files. +- New flag `--top` to only run on a subset of the data available (e.g. `top 1000`). Currently only supported by relational database and csv runners +- Added ability to ignore whole columns in reviewer by pressing `Del` on the column and confirming +- Support for naming servers in `Targets.yaml` for main `ii` binary instead of connection strings (e.g. `-d myserver` for running on relational dbs) +- You can now pass a directory name to the `file` verb to process all csv files in that directory. +- Added `-g` option to the `file` verb to process multiple csv files e.g. `**/*.csv`. This option is only valid when specifying a directory for `-f` ## Fixed -- IsIdentifiable reviewer no longer complains when `Targets.yaml` is missing +- IsIdentifiable reviewer no longer complains when `Targets.yaml` is missing # [0.0.7] - 2022-08-24 ## Added -- Made `--rulesfile` CLI argument default to `Rules.yaml` -- Added TRACE progress logging to `ii` CLI tool -- Stanford NLP daemon now runs self contained -- Bump HIC.FAnsiSql from 2.0.4 to 2.0.5 +- Made `--rulesfile` CLI argument default to `Rules.yaml` +- Added TRACE progress logging to `ii` CLI tool +- Stanford NLP daemon now runs self contained +- Bump HIC.FAnsiSql from 2.0.4 to 2.0.5 ## Fixed -- Fixed missing dlls for running Tesseract OCR on linux +- Fixed missing dlls for running Tesseract OCR on linux # [0.0.6] - 2022-08-17 ## Added -- Added new command line flag `-y somefile.yaml` in `ii` CLI tool to specify a custom config file -- Progress is now logged to Trace and enabled by default in `ii`. Library users can enable this feature by setting `LogProgressEvery` (defaults to null) +- Added new command line flag `-y somefile.yaml` in `ii` CLI tool to specify a custom config file +- Progress is now logged to Trace and enabled by default in `ii`. Library users can enable this feature by setting `LogProgressEvery` (defaults to null) ## Changed -- `ii` startup errors are written to stderr instead of stdout +- `ii` startup errors are written to stderr instead of stdout # [0.0.5] - 2022-07-20 ## Dependencies -- New dependency Equ 2.3.0 -- New dependency fo-dicom.Imaging.ImageSharp 5.0.3 -- Bump CommandLineParser from 2.8.0 to 2.9.1 -- Bump CsvHelper from 27.2.1 to 28.0.1 -- Bump HIC.DicomTypeTranslation from 3.0.0 to 4.0.1 -- Bump HIC.FAnsiSql from 2.0.3 to 2.0.4 -- Bump HIC.RDMP.Plugin from 7.0.7 to 7.0.14 -- Bump MSTest.TestAdapter from 2.2.8 to 2.2.10 -- Bump MSTest.TestFramework from 2.2.8 to 2.2.10 -- Bump Magick.NET-Q16-AnyCPU from 10.0.0 to 11.3.0 -- Bump Microsoft.NET.Test.Sdk from 17.1.0 to 17.2.0 -- Bump Moq from 4.17.1 to 4.18.1 -- Bump NLog from 4.7.14 to 5.0.1 -- Bump NUnit from 3.13.2 to 3.13.3 -- Bump System.IO.Abstractions from 16.1.15 to 17.0.23 -- Bump Terminal.Gui from 1.4.0 to 1.6.4 -- Removed dependency fo-dicom.Drawing 4.0.8 +- New dependency Equ 2.3.0 +- New dependency fo-dicom.Imaging.ImageSharp 5.0.3 +- Bump CommandLineParser from 2.8.0 to 2.9.1 +- Bump CsvHelper from 27.2.1 to 28.0.1 +- Bump HIC.DicomTypeTranslation from 3.0.0 to 4.0.1 +- Bump HIC.FAnsiSql from 2.0.3 to 2.0.4 +- Bump HIC.RDMP.Plugin from 7.0.7 to 7.0.14 +- Bump MSTest.TestAdapter from 2.2.8 to 2.2.10 +- Bump MSTest.TestFramework from 2.2.8 to 2.2.10 +- Bump Magick.NET-Q16-AnyCPU from 10.0.0 to 11.3.0 +- Bump Microsoft.NET.Test.Sdk from 17.1.0 to 17.2.0 +- Bump Moq from 4.17.1 to 4.18.1 +- Bump NLog from 4.7.14 to 5.0.1 +- Bump NUnit from 3.13.2 to 3.13.3 +- Bump System.IO.Abstractions from 16.1.15 to 17.0.23 +- Bump Terminal.Gui from 1.4.0 to 1.6.4 +- Removed dependency fo-dicom.Drawing 4.0.8 # [0.0.4] - 2022-03-03 -- Added IsIdentifiable RDMP plugin +- Added IsIdentifiable RDMP plugin # [0.0.3] - 2022-03-01 -- Added `UpdateStrategy.RedactionWord` to customise the substitution value for PII when updating the database -- Moved redaction code to `IsIdentifiable.Redacting` namespace -- Retargetted at dotnet standard 2.1 -- Removed dependency on Terminal.Gui from library (still part of the ii CLI) +- Added `UpdateStrategy.RedactionWord` to customise the substitution value for PII when updating the database +- Moved redaction code to `IsIdentifiable.Redacting` namespace +- Retargetted at dotnet standard 2.1 +- Removed dependency on Terminal.Gui from library (still part of the ii CLI) # [0.0.2] - 2022-02-10 -- Made it easier to subclass `IsIdentifiableAbstractRunner` and add custom reports +- Made it easier to subclass `IsIdentifiableAbstractRunner` and add custom reports # [0.0.1] - 2022-02-07 Initial version -[Unreleased]: https://github.com/SMI/IsIdentifiable/compare/v0.3.0..main -[0.3.0]: https://github.com/SMI/IsIdentifiable/compare/v0.2.1..v0.3.0 -[0.2.1]: https://github.com/SMI/IsIdentifiable/compare/v0.2.0..v0.2.1 -[0.2.0]: https://github.com/SMI/IsIdentifiable/compare/v0.1.0..v0.2.0 -[0.1.0]: https://github.com/SMI/IsIdentifiable/compare/v0.0.9..v0.1.0 -[0.0.9]: https://github.com/SMI/IsIdentifiable/compare/v0.0.8..v0.0.9 -[0.0.8]: https://github.com/SMI/IsIdentifiable/compare/v0.0.7..v0.0.8 -[0.0.7]: https://github.com/SMI/IsIdentifiable/compare/v0.0.6..v0.0.7 -[0.0.6]: https://github.com/SMI/IsIdentifiable/compare/v0.0.5..v0.0.6 -[0.0.5]: https://github.com/SMI/IsIdentifiable/compare/v0.0.4..v0.0.5 -[0.0.4]: https://github.com/SMI/IsIdentifiable/compare/v0.0.3..v0.0.4 -[0.0.3]: https://github.com/SMI/IsIdentifiable/compare/v0.0.2..v0.0.3 -[0.0.2]: https://github.com/SMI/IsIdentifiable/releases/tag/v0.0.2 [0.0.1]: https://github.com/SMI/IsIdentifiable/releases/tag/v0.0.1 +[0.0.2]: https://github.com/SMI/IsIdentifiable/releases/tag/v0.0.2 +[0.0.3]: https://github.com/SMI/IsIdentifiable/compare/v0.0.2..v0.0.3 +[0.0.4]: https://github.com/SMI/IsIdentifiable/compare/v0.0.3..v0.0.4 +[0.0.5]: https://github.com/SMI/IsIdentifiable/compare/v0.0.4..v0.0.5 +[0.0.6]: https://github.com/SMI/IsIdentifiable/compare/v0.0.5..v0.0.6 +[0.0.7]: https://github.com/SMI/IsIdentifiable/compare/v0.0.6..v0.0.7 +[0.0.8]: https://github.com/SMI/IsIdentifiable/compare/v0.0.7..v0.0.8 +[0.0.9]: https://github.com/SMI/IsIdentifiable/compare/v0.0.8..v0.0.9 +[0.1.0]: https://github.com/SMI/IsIdentifiable/compare/v0.0.9..v0.1.0 +[0.2.0]: https://github.com/SMI/IsIdentifiable/compare/v0.1.0..v0.2.0 +[0.2.1]: https://github.com/SMI/IsIdentifiable/compare/v0.2.0..v0.2.1 +[0.3.0]: https://github.com/SMI/IsIdentifiable/compare/v0.2.1..v0.3.0 +[unreleased]: https://github.com/SMI/IsIdentifiable/compare/v0.3.0..main diff --git a/IsIdentifiable/README.md b/IsIdentifiable/README.md index 0df596c3..d7e7e402 100644 --- a/IsIdentifiable/README.md +++ b/IsIdentifiable/README.md @@ -2,20 +2,20 @@ ## Contents -1. [Overview](#overview) -1. [Setup](#setup) -1. [Optional Downloads](#optional-downloads) -1. [NLP](#nlp) - 1. [SpaCy Classifier](#spacy-classifier) - 2. [Stanford Classifier](#stanford-classifier) -1. [Invocation](#invocation) -1. [Examples](#examples) -1. [Rules](#rules) - 1. [Basic Rules](#basic-rules) - 2. [Socket Rules](#socket-rules) - 3. [Consensus Rules](#consensus-rules) - 4. [Allow List Rules](#allow-list-rules) -1. [Class Diagram](#class-diagram) +1. [Overview](#overview) +1. [Setup](#setup) +1. [Optional Downloads](#optional-downloads) +1. [NLP](#nlp) + 1. [SpaCy Classifier](#spacy-classifier) + 1. [Stanford Classifier](#stanford-classifier) +1. [Invocation](#invocation) +1. [Examples](#examples) +1. [Rules](#rules) + 1. [Basic Rules](#basic-rules) + 1. [Socket Rules](#socket-rules) + 1. [Consensus Rules](#consensus-rules) + 1. [Allow List Rules](#allow-list-rules) +1. [Class Diagram](#class-diagram) ## Overview @@ -101,8 +101,8 @@ This classifier listens on port `1881` IsIdentifiable can be run from the [ii] command line tool: -- To process a DICOM file or a directory of DICOM files -- To process a every row of every column in a database table +- To process a DICOM file or a directory of DICOM files +- To process a every row of every column in a database table You can also link your code to the [nuget package](https://www.nuget.org/packages/IsIdentifiable/). For example to add a new input type or operate as a service that evaluates data on demand. diff --git a/PACKAGES.md b/PACKAGES.md index d3aea94c..d824ed36 100644 --- a/PACKAGES.md +++ b/PACKAGES.md @@ -3,8 +3,8 @@ ### Risk Assessment common to all: 1. Packages on NuGet are virus scanned by the NuGet site. -2. This package is widely used and is actively maintained. -3. It is open source. +1. This package is widely used and is actively maintained. +1. It is open source. | Package | Source Code | License | Purpose | | --------------------------------------- | -------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------- | diff --git a/README.md b/README.md index 10fb125f..c0295a9f 100644 --- a/README.md +++ b/README.md @@ -9,17 +9,17 @@ A tool for detecting identifiable information in data sources. Out of the box supports: -- CSV -- DICOM -- Relational Database Tables (Sql Server, MySql, Postgres, Oracle) -- MongoDb +- CSV +- DICOM +- Relational Database Tables (Sql Server, MySql, Postgres, Oracle) +- MongoDb ![Demo Video](/isidentifiable.gif) Rules base is driven by regular expressions and plugin services (e.g. Natural Language Processing). Also includes a reviewer/redactor tool for processing false positives and updating the rules base. -- [Detector Documentation](./IsIdentifiable/README.md) -- [Reviewer Documentation](./Reviewer/README.md) +- [Detector Documentation](./IsIdentifiable/README.md) +- [Reviewer Documentation](./Reviewer/README.md) There is a [standalone command line tool called ii](./ii/README.md) for running directly or you can use the [nuget package](https://www.nuget.org/packages/IsIdentifiable/) in your own code to evaluate data. diff --git a/ii/README.md b/ii/README.md index abd7115c..87170c76 100644 --- a/ii/README.md +++ b/ii/README.md @@ -64,12 +64,12 @@ Primary Author: [Thomas](https://github.com/tznind) ## Contents -1. [Overview](#1-overview) -2. [Setup / Installation](#2-setup--installation) -3. [Usage](#3-usage) - 1. [Reviewing the output of IsIdentifiable] - 2. [Redacting the database] - 3. [Managing the rulebase] +1. [Overview](#1-overview) +1. [Setup / Installation](#2-setup--installation) +1. [Usage](#3-usage) + 1. [Reviewing the output of IsIdentifiable] + 1. [Redacting the database] + 1. [Managing the rulebase] ## 1. Overview @@ -81,9 +81,9 @@ _The review process of potentially PII_ There are 3 activities that can be undertaken using the reviewer: -- [Reviewing the output of IsIdentifiable] -- [Redacting the database] -- [Managing the rulebase] +- [Reviewing the output of IsIdentifiable] +- [Redacting the database] +- [Managing the rulebase] ## 2. Setup / Installation @@ -127,12 +127,12 @@ The menu `Options | Custom Patterns` menu, when ticked, will provide the opportu The Custom Patterns window provides several options to edit the pattern: -- `x` - clears currently typed pattern -- `F` - creates a regex pattern that matches the full input value -- `G` - creates a regex pattern that matches only the failing part(s) -- `\d` - replaces all digits with regex wildcards -- `\c` - replaces all characters with regex wildcards -- `\d\c` - replaces all digits and characters with regex wildcards +- `x` - clears currently typed pattern +- `F` - creates a regex pattern that matches the full input value +- `G` - creates a regex pattern that matches only the failing part(s) +- `\d` - replaces all digits with regex wildcards +- `\c` - replaces all characters with regex wildcards +- `\d\c` - replaces all digits and characters with regex wildcards ### Redacting the database @@ -150,11 +150,11 @@ _Example targets file_ The following flags should be combined to successfully redact the database: -| Flag | Example | Purpose | -| ---- | ----------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| -f | -f ./ExampleReport.csv | Indicates which [IsIdentifiable] output report to redact. You must have completed the [review process] for this report | -| -u | -u ./misses.csv | Indicates that you want to update the database. The file value must be included and is where reports that are not covered by rules generated in the [review process] are output. If you have completed the [review process] correctly this file should be empty after execution completes | -| -t | -t z:\temp\targets.yaml | Path to a file containing the connection string (and DMBS type) of the relational database server that has the table requiring redaction | +| Flag | Example | Purpose | +| ---- | ------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| -f | -f ./ExampleReport.csv | Indicates which [IsIdentifiable] output report to redact. You must have completed the [review process] for this report | +| -u | -u ./misses.csv | Indicates that you want to update the database. The file value must be included and is where reports that are not covered by rules generated in the [review process] are output. If you have completed the [review process] correctly this file should be empty after execution completes | +| -t | -t z:\\temp\\targets.yaml | Path to a file containing the connection string (and DMBS type) of the relational database server that has the table requiring redaction | ```bash ii.exe review -f ./ExampleReport.csv -u ./misses.csv -t z:\temp\targets.yaml @@ -176,10 +176,10 @@ _Rules Manager View_ | `` | Removes a rule from the rulesbase | | `` | Opens menu (if any) for interacting with rule(s) highlighted | -[IsIdentifiable]: ../IsIdentifiable/README.md -[PII]: https://en.wikipedia.org/wiki/Personal_data -[SmiRunner]: ../Applications.SmiRunner/ -[Managing the rulebase]: #managing-the-rulebase +[isidentifiable]: ../IsIdentifiable/README.md +[managing the rulebase]: #managing-the-rulebase +[pii]: https://en.wikipedia.org/wiki/Personal_data +[redacting the database]: #redacting-the-database [review process]: #reviewing-the-output-of-IsIdentifiable -[Reviewing the output of IsIdentifiable]: #reviewing-the-output-of-isidentifiable -[Redacting the database]: #redacting-the-database +[reviewing the output of isidentifiable]: #reviewing-the-output-of-isidentifiable +[smirunner]: ../Applications.SmiRunner/ diff --git a/nlp/uk.ac.dundee.hic.nerd/README.md b/nlp/uk.ac.dundee.hic.nerd/README.md index 7b4d5321..743101ba 100644 --- a/nlp/uk.ac.dundee.hic.nerd/README.md +++ b/nlp/uk.ac.dundee.hic.nerd/README.md @@ -21,7 +21,7 @@ The Python version recognises a different set of entities, and with different la ## Setup -No setup is required for the Java version, just run the jar file as documented below. The "<&- &" will cause it to disconnect from the terminal once initialised and run as a daemon. (For development use, you can also skip that and terminate it with ctrl-C on the console when finished.) +No setup is required for the Java version, just run the jar file as documented below. The "\<&- &" will cause it to disconnect from the terminal once initialised and run as a daemon. (For development use, you can also skip that and terminate it with ctrl-C on the console when finished.) The Python program now requires a YAML configuration file to determine the log file location. It also requires that the SpaCy and optionally the SciSpaCy packages have been installed, if not globally then into a virtual environment. The same environment must also have the required SpaCy language model installed. Note that SpaCy version 2 (eg. 2.2.1) and SciSpacy version 0.2.4 must be used (as of Feb 2021) because SpaCy v3 uses a new architecture and SciSpaCy has not caught up yet (at least, not for NER).