Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs(pi-cli): Add the anonymization documentation #3

Merged
merged 14 commits into from
Jun 11, 2024
134 changes: 134 additions & 0 deletions modules/cli/pages/actions-anonymization.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,134 @@
= Actions types for anonymization
:description: Description of all the possible actions type for anonymization

== Overview
Action types are mechanisms used to anonymize specific values in the exported data. A value can have one or multiple actions associated with it. If no actions are explicitly specified in the configuration for a particular value, it will default to a predefined action as described in the page xref:configuration-for-anonymization.adoc[Configure the anonymization].

Additionally, you can define a fallback mechanism in case the primary action does not work or does not find a match (for example, using the `regex_replace` action).

== Hash - Hash a value

Replace the values by a Hash, this allows to remove readable values but keeps the cardinality. Safe hash function should be used ( SHA-256 for instance )

*Usage* : Value containing sensitive information but on which you want to have accurate statistic and the ability to see the impact on the process execution.

*Parameters* :

*value* : the algorithm to use for hashing. Optional, default to SHA-512

Sample configuration :
[source,yaml]
----
arch_process_instance:
stringindex1:
actions:
- action: hash
value: SHA-256
----

== Replace - Replace a value

Replace the value with a specified value ( which can be empty )

*Usage* : Value containing sensitive information you want to hide.

*Parameters* :

*value* : The value used to replace the current field. Default value empty

Sample configuration :
[source,yaml]
----
arch_flownode_instance:
displayname:
actions:
- action: REPLACE
value: 'hidden'
----

== Replace_with_other - Replace a value by another column value of the same table

Replace a value by another value of the same table and line.

*Usage* : Value containing sensitive information you want to replace with another non sensitive meaningful value in the same table.

*Parameters* :

*value* : column name of the column to be use to replace the value. No Default , mandatory

Sample configuration :
[source,yaml]
----
arch_flownode_instance:
displayname:
actions:
- action: REPLACE_WITH_OTHER
value: name
----
== Regex_replace - Replace a value matching a regex expression

Replace the value if the regexp match and use the matching group to create the new value.

if multiple part of the value match , they are all replace. ( Behavior similar to Matcher.replaceAll )

If none of the regexp match , allow to configure a fallback which can be any of the other action.

*Usage* : Value containing some sensitive information with other non sensitive one and you want to keep the non sensitive information.

*Parameters* :

*pattern* : The regular expression to match Default , mandatory

*value* : The replacement pattern ( including captured group ) to used to change the value. No Default , mandatory

*fallback* : action to apply if no regexp match.

Sample configuration :
[source,yaml]
----
arch_process_comment:
content:
actions:
- action: regex_replace
pattern: contract (\d+) is ready for user (\S+)\.(\S+)
value: contract XXXX is ready for $2
- action: regex_replace
pattern: The task Allocate repair agent on car (\S+) (is now assigned to .*)
value: The task Allocate repair agent on car *** $2
fallback:
- action: replace
value: hidden comment
----

== Keep - Keep a value

Keep the value, no anonymization done.

*Parameters*: none

Sample configuration :
[source,yaml]
----
arch_flownode_instance:
displayname:
actions:
- action: KEEP
----

== Remove_line - Remove a full line

Remove the whole data line ( only possible on data contract and comment )

*Parameters* :
optional where clause expressed as a regex to match with the value for the configured column.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought: the where clause applies to all actions, so we could document it and give examples in all documented actions
Its configuration also changed recently in PR 352, and it can now be used like this:

      arch_contract_data:
        val:
          actions:
          - action: REMOVE_LINE
            where:
            - column: name
              regex: PurchasedLicenseInput\.bypassSysDate
            - column: name
              regex: PurchasedLicenseInput\.caseCounterStartDate
            - column: name
              regex: PurchasedLicenseInput\.description
            - column: name
              regex: PurchasedLicenseInput\.endDate
            - column: name
              regex: PurchasedLicenseInput\.name
            - column: name
              regex: PurchasedLicenseInput\.numberCases
            - etc. ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed the configuration sample.
For the rest, how do we document it properly the where clause? How can it be used outside of arch_contract_data? Because it's mostly useful for this table because of the keys.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mbource i think it misse some words about maxSize defined by default to 512 as global option. The max size allow to truncate a big text. MasSize could be define also on each action. But it really use for KEEP and Replace.


Sample configuration :
[source,yaml]
----
arch_contract_data:
val:
actions:
- action: remove_line
where:
name: "input\.list\.*"
----
114 changes: 113 additions & 1 deletion modules/cli/pages/configuration-for-anonymization.adoc
Original file line number Diff line number Diff line change
@@ -1,4 +1,116 @@
= How-to configure the anonymization in the CLI
:description: Learn how-to fine-tune the anonymization in the CLI

IMPORTANT: this is a work in progress!!
== Default anonymization
Using the command `export` of the export tool will activate automatically the anonymization.

[NOTE]
====
In case you don't want to activate the anonymization during your export, you can add the argument `-da=true` at the end of the `export` command.
====

Exported data will be anonymized by default to avoid sensitive data leak or with a specified configuration of your own.

=== List of tables anonymized by default

This section lists the tables being exported and identified fields that may contain sensitive information. These fields are handled with specific rules to ensure data security.

==== arch_process_instance

The `arch_process_instance` table contains information related to process instances. While it primarily includes technical data necessary for BPI, certain fields within this table may contain sensitive information and require special handling:

* **description**:
** **Default Handling**: KEEP
** **Details**: This field contains the description of the process instance itself that is added by developers during the development phase. Although it might contain sensitive data (e.g., company names, departments, addresses), this is usually not the case.

* **stringindex1, stringindex2, stringindex3, stringindex4, stringindex5**:
** **Default Handling**: HASH
** **Details**: These fields holds the search indexes that are defined at development phase that can be used to search cases using the APIs and case list. These fields are usually filled with specific Groovy code, which may expose sensitive data. Therefore, they are hashed by default to maintain data security.

==== arch_flownode_instance

The `arch_flownode_instance` table contains information related to the execution of flow node instances (human and automatic tasks, gateways and events). While it primarily includes technical data necessary for BPI, certain fields within this table may contain sensitive information and require special handling:

* **displayname**:
** **Default Handling**: REPLACE("")
** **Details**: This field contains the displayed name of the flow node instance that is added by developers during the development phase, which may include sensitive data. By default, this field will be cleared.

* **description**:
** **Default Handling**: KEEP
** **Details**: This field contains the description of the flow node itself that is added by developers during the development phase. Although it might contain sensitive data (e.g., company names, departments, addresses), this is usually not the case.

* **hitbys, loopdatainputref, loopdataoutputref, datainputitemref, dataoutputitemref**:
** **Default Handling**: KEEP
** **Details**: These fields are technical data of the flow node instance. Those data are not sensitive and can be used for statistic purpose.

==== user_

The `user_` table contains information related to the users. A specific field require some handling:

* **username**:
** **Default Handling**: HASH
** **Details**: This field contains the username of an user, which indeed include sensitive data. By default, this field will be hashed.

==== actor

The `actor` table contains information related to the actors of a process. It defines who can perform a task or start a process. While it primarily includes technical data necessary for BPI, certain fields within this table may contain sensitive information and require special handling:

* **name, displayname, description**:
** **Default Handling**: KEEP
** **Details**: Normally, an actor shoudn't have sensitive data because it represent a department, team, job of a company. Those data are not sensitive and can be used for statistic purpose.


==== arch_process_comment

The `arch_process_comment` table contains information about users who have interact with specific flow nodes, along with other sensitive details. Those comments can include name of the users.

* **content**:
** **Default Handling**: REMOVE_LINE
** **Details**: Without a specific configuration, the anonymization by default will never export the content of this table because of the sensible data in it. You need to have your own anonymization configuration to handle those data.

==== arch_contract_data
The `arch_contract_data` table contains information about contracts, including inputs and constraints. Due to the flexibility in specifying various types of inputs, this table often contains sensitive data.

* **name, val**:
** **Default Handling**: REMOVE_LINE
** **Details**: Without a specific configuration, the anonymization by default will never export the content of this table because of the sensible data in it. You need to have your own anonymization configuration to handle those data.


== Advanced anonymization

=== Generate a sample configuration for data anonymization

Before performing a full export, you can configure the anonymization of specific fields. To assist with this, a command is available in the tool to generate a sample configuration file based on a default setup, allowing you to choose which columns and tables to anonymize.

The command `gen_default_anon_conf` has been added to the export tool to streamline this process. If needed, you can use the `--output` argument to specify the location for the generated file.

[NOTE]
====
The generated file itself is only a sample of the configuration file, the anonymization section. You'll need to copy and paste that part into your own configuration file used by your export tool.
====

The generated configuration will also contains all your data contracts key to allow you a convenient way to anonymize them.

=== Handling contract data anonymization
Process data can include contract data used within your processes, which may contain sensitive information.

[WARNING]
====
By default, if you do not specify how to handle this contract data, the anonymization process will exclude it from export.
====

During the export, contract data will be transformed into CSV lines in the `arch_contract_data.csv` file within the export zip file. Each line represents a key-value pair of contract data. The concept of the key is crucial as it allows you to specify the exact type of anonymization you want for each contract data field.

To specify which inputs of your contract data to anonymize, use the `where` clause in the configuration.

For example, suppose you have a contract named `loanRequestInput` with a field `loanAmount`. If you want to keep this value because it is not sensitive and could be useful in BPI dashboards, you need to override the default removal setting. Specify a `KEEP` action using the `where` clause to retain `loanAmount`. Here is an example configuration extract:

[source,yaml]
----
arch_contract_data:
val:
actions:
- action: KEEP
where:
name: loanRequestInput\.loanAmount
----
51 changes: 50 additions & 1 deletion modules/cli/pages/index.adoc
Original file line number Diff line number Diff line change
@@ -1,4 +1,53 @@
= Command Line tool for Export
:description: Explain how to use and configure the CLI to export data from a Bonita database

IMPORTANT: this is a work in progress!!
== Overview
The Command Line tool is used to provision data extracted from Bonita instances into a Bonita Process Insights environment for deeper process analysis.

The command line tool is delivered in a package, containing a basic documentation about commands and a sample configuration file.

== Export data
=== Configuration
Before exporting, you need to configure the connection to the Bonita Instance from which you want to export the data.

The sample configuration should look like this :
[source,yaml]
----
bonita:
database:
host: localhost
port: 5432
name: bonita
username: bonita
password: bpm
jdbc-url: myJdbcUrl
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: can we avoid to duplicate the documentation here and mention that this is available in the documentation embeeded in the distribution?
This risks to be a maintenance issue if we duplicate the content.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remove the configuration sample part to avoid duplication/maintenance changes. I did keep the jdbc url explanation for the different database, should I remove it too?

----

Export tool can support two type of databases Postgresql and Oracle. The jdbc url must be adapted according to the type of database you will use.

* **Oracle** :
[source,yaml]
----
jdbc-url: jdbc:oracle:thin:@${bonita.database.host}:${bonita.database.port}/${bonita.database.name}?oracle.net.disableOob=true`jdbc-url`
----
* **PostgresSql** :
[source,yaml]
----
jdbc-url: jdbc:postgresql://${bonita.database.host}:${bonita.database.port}/${bonita.database.name}
----

[NOTE]
====
After you finish to configure your file, place it next to the executable jar directory or in a subdirectory named config.
====

=== Exporting data and Importing to BPI
To export your data, use the following command line :
`pi-cli bonita export`

You can add some arguments like `-output` to specify the exact path of the exported zip file.

=== Anonymize exported data
By default, your exported data will be anonymized. It's possible to deactivate the anonymization or adding your own configuration.

For more details, see xref:configuration-for-anonymization.adoc[Configure the anonymization]
1 change: 1 addition & 0 deletions modules/cli/taxonomy.adoc
Original file line number Diff line number Diff line change
@@ -1,2 +1,3 @@
* xref:index.adoc[Command line for Exporting from Bonita]
** xref:configuration-for-anonymization.adoc[Configure the anonymization]
*** xref:actions-anonymization.adoc[Actions types for anonymization]