Home

The PERICLES Extraction Tool

The PET tool is a tool we created in the PERICLES EU project http://www.pericles-project.eu/ to capture Significant Environment Information from live environments, to support better object use and reuse, in the scope of long term preservation of data. The tool is developed by Fabio Corubolo (ULIV) and Anna Eggers (UGOE) and it’s an entirely new development for the PERICLES project.

We want to collect important environment information that could be lost if not gathered at the right time, from the context of use of the data. PET implements various information extraction techniques as plug-in Extraction Modules, as complete implementations or where possible by re-using already existing external tools and libraries. Environment monitoring is supported by specialized monitoring daemons and continuous extraction of relevant information triggered by environment events related to the creation and alteration of digital objects, like, e.g., the alteration of an observed file or directory, opening or closing a specific file, and other system calls.

The tool can be used in a sheer curation scenario, running in the system background under the full control of - but without disrupting - the user. Furthermore a snapshot extraction mode exists for capturing the current state of the environment, which is mainly designed to extract information that doesn't change frequently, as e.g. system resource specifications.

A general scenario for PET capture is described here: The overall objectives that we want to accomplish, each of which depends on the previous one, are:

Use the PET tool to collect environment information when the DOs are used, based on specific profiles
Analyse the information collected to infer new relationships between digital objects
Assign values to the dependencies based on the purpose and significance (significance weights)

The current implementation of the PET tool covers the 1st objective, and starts to address the second.

PET features

The following list outlines the main features of the PET tool:

Extracts information that is usually ignored by current metadata extractors.
The extracted environment information from outside the digital object for re-use possibilities.
Extracts information at the right time and place: within the production environment
Supports continuous extraction in a sheer curation scenario.
Visualizes information change over time.
Information snapshot extractions allow getting a quick overview of extractable information.
Platform independent (needs Java 7).
Modular and extendable architecture that supports specialized needs.
Use profiles allow the parallel usage for different scenarios.
Provides graphical user interface, but can also run without graphics in console mode.
Provides exchangeable storage backend.
Saves results in standardized format: JSON or XML.

Quick start guide

Installation

Requirements

Java 7 or higher Standard installation: https://www.java.com/it/download

Download the PET TOOL from: https://drive.google.com/file/d/0B3LPJpLR6-TwR0pnTk1VZXBGaUE/edit?usp=sharing

and extract the ZIP archive. No further installation needed.

Start the tool

Execution of the tool

Windows: start the PET tool clicking on PET-extractor.jar Mac: start the PET tool by right (or click pressing the ctrl key) on PET-extractor.jar, select Open Commandline: java -jar PET-extractor.jar

First steps

Meet the graphical user interface

At the first tool start PET will open its graphical user interface (GUI). (For executing PET without the GUI use the command line flag “--headless”) The GUI is organized in tabs, as displayed in the screenshot:

The “Monitored files”-tab lists the files that should be investigated
The “Modules”-tab is for configuring which information should be extracted
The “Environment information”-tab shows extracted environment information
The “Events”-tab lists all environment events that were detected by the tool
The “Help”-tab provides some basic information about the tool
Extracted environment information that is file-dependent can also be browsed in the “Monitored files”-tab by selecting the belonging file
Profiles allow the organization of Extraction Modules and files for different scenarios
At the bottom are the buttons to control the extraction modes

Add a file

Open the “Monitored files”-tab and click the “Add Files”-button. A dialogue will let you select the files to be added to the tool. The environment information extraction will be done based on these files, as information will be extracted that is important for these files. However the files won’t be modified by the tool.

Select an Extraction Module

Open the “Modules”-tab. It provides a list of Extraction Modules which define what information to extract. The area to configure a module opens, if it is selected. Click the “Add new modules”-button to add further extraction modules to the tool.

Start a snapshot extraction

Click the “Snapshot”-button at the bottom left of the tool. A single extraction run will be started.

Browse the results

You can browse the extracted environment information, If you open the “Environment information”-tab. The file-dependent information can be found at the “Monitored files”-tab.

Understanding PET

Now we will have a deeper look into the PET functionality and usage. The graphical user interface is completely optional. If you prefer to use the command line interface, the command “help” will print a list of all existing commands. We recommend to use the GUI at the first tool start for a better understanding.

Extraction Profiles and templates

As mentioned PET organizes Extraction Modules and files in Profiles, to be able to load different configurations for different scenarios. At tool start a “default Profile” is loaded. You can create a new empty profile by clicking the “New profile”-button. The existing profiles can be browsed in the “Select profile:”-combo box at the top of the GUI. The CLI command “profiles” shows a list of all profiles in the CLI. With “profile [PROFILE_UUID]” you can browse all information about a specific profile. Clicking the “New profile (from template)”-button at the GUI allows the creation of a preconfigured profile for a specific scenario. It is possible to export own configured profiles to templates, to send them to other PET users for the creation of similar profiles. A profile template contains only the Extraction Module selection and the configuration of the modules, but not the added files, as they are probably not present at other computers. The saved templates can be found at the directory: .../PET_data/config/profile_templates/ . It is possible to edit the profile configurations manually by changing the JSON configuration files in this directory. All templates located in the directory can be loaded as profiles from within the application. To browse the list of template in the CLI enter the command “templates”. The command “template2profile [TEMPLATE_NAME]” will create a profile from a template.

The two different extraction modes

PET provides two extraction modes

Snapshot extraction: This mode executes the Extraction Modules once to capture a snapshot of the current environment information. The environment monitoring daemons won’t be started in this mode. It can be invoked by clicking the “Snapshot”-button at the bottom of the GUI, or by typing the command “extract” into the CLI.
Continuous extraction: This mode can be used in a sheer curation scenario. It starts the environment monitoring daemons and extracts continuously information from the environment based on occurring events. It can be started and stopped by clicking the “Start/Stop monitor”-button at the bottom of the GUI. The CLI command “status” displays if the continuous extraction mode is enabled. It can also be enabled or disabled with the CLI commands “start” and “stop”.

File dependent information and file independent information

Environment information to be extracted can either be file dependent, so it is different for each observed file, or file independent, and therewith the same information for all files in the environment. Extracted file independent information can be browsed in the “Environment Information”-tab, whereas file dependent information is accessible at the “Monitored Files”-tab by selecting one of the files.

Extraction modules and environment monitoring daemons

There are three types of Extraction Modules: Modules that extract file dependent information, modules that extract file independent information, and environment monitoring daemons, which don’t extract any information, but observe the environment for occurring events to trigger the extraction of other modules. The available modules can be browsed in the “modules”-tab by clicking the “Add new modules”-button, or by entering the command “modules” into the CLI. Each module provides a detailed description about the information that it extracts and about configuration possibilities.

Configuration of a module or a daemon

The Extraction Modules including the daemons can be configured in the “Modules”-tab by editing their JSON configuration files. The configuration has to follow strictly the JSON syntax. By clicking the “Save and use configuration”-button the syntax will be verified and a message will be shown in case of an error. You can use the “Reload config file”-button to restore the old configuration.

The “class” option shows the java class which is associated with the displayed configuration and shouldn’t be changed. It is needed for the deserialization of the configuration.
“moduleName” is the module type name and shouldn’t be changed either. If you want to change the displayed name of the module, you can edit “moduleDisplayName” instead.
“moduleVersion” displays the development version of the module and shouldn’t be changed manually.
The “enabled” option shows if a module is currently used for extraction or not. You can also use the radio button at the top-right of the configuration area for fast changing this option. Some modules might disable themselves if they need further configurations before an execution.

“recordEvents” TODO: Is it to write occurring events also to a file?

“supportedSystems” lists the operating systems on that the module will run. “SYSTEM_INDEPENDENT” is the default option. If you configure a module in a way that it can only be executed on a specific operating system, change this variable to one or more of these: "OS_X", "SOLARIS", "BSD", "LINUX", and / or "WINDOWS".
“fileFilter” lets you restrict the file types on that a file-dependent module is executed, or that a daemon module is looking for. It can also be important for some file-independent modules that operate on system files. The five options are described in the following
- “fileExtensionFilter”: Define a regex to filter the file extensions. Therefore use the java regex syntax
- “inclusiveMimeType”: Consider only files of a specific mime type. For example “application/pdf”
- “inclusiveMediaType”: Consider only files of the specific media type. For example “text”
- “exclusiveMediaType”: Consider files of all media types, except that ones that are added here.
- “exclusiveMimeType”: Consider files of all mime types, except that ones that are added here.
“eventAddToProfile” TODO: is this the option to write events to a file?

These are the default options. All further options are customized implementations that can vary depending on the regarded module. The descriptions of the modules should also help with the configuration.

Browsing extraction results

The “Environment information”-tab provides the extracted file-independent information, whereas the extracted file-dependent information can be found at the “Monitored files”-tab by selecting one of the files in the list. By default an “information tree” is displayed that shows the extracted information sorted by extraction dates and extraction modules inserted into a tree structure. A double click on a result will open it in JSON format in a new window. An display mode intended for the visualisation of changes of environment information can be accessed by clicking the “Visualise information change”-button. Therefore an extraction module has to be selected in the combo box above. This mode will show the results of the selected module for all past extraction runs in a table, whereby the first row shows the corresponding date of extraction. The table can be opened in fullscreen mode and exported for further analysis. If no extraction module was selected, and the “Visualise information change”-button is clicked, the last extraction result is compared to the previous result with highlighted differences. The “Show information in JSON”-button displays the same information as the information change display, but uses the JSON format instead of a table.

It highly depends on the actual use case, which of the displaying modes are the appropriate ones.

Startup parameters

The following commands can either be specified in the configuration file .../PET_data/config/extractionPreferences in the format “option=value” with one option per line, or added as flags at tool start in the format “java -jar PET.jar --option1 --option2 …”.

help: Prints out all possible start commands. The tool won’t be executed, if this command is entered.
phelp: Prints out the description of this tool. The tool won’t be executed, if this command is entered.
version: Prints the version of this tool. The tool won’t be executed, if this command is entered.
headless: This option will disable the GUI and the system tray icon to start the tool without any graphics. The command line interface will be started.
storage: This option defines the storage to be used for saving the extraction results. By default the results are saved to an elasticsearch database.
destination: Defines the directory where the “PET_data” directory will be created.
once: Starts a single extraction at start. This will prevent that the tool starts in the continuous extraction mode.

This project has received funding from the European Union’s Seventh Framework Programme for research, technological development and demonstration under grant agreement no FP7- 601138 PERICLES.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly