Skip to content
Anna Eggers edited this page May 5, 2015 · 24 revisions

The PERICLES Extraction Tool

The PET tool is a tool we created in the PERICLES EU project http://www.pericles-project.eu/ to capture Significant Environment Information from live environments, to support better object use and reuse, in the scope of long term preservation of data.

The tool collects important environment information that could be lost if not gathered at the right time, from the context of use of the data. PET implements various information extraction techniques as plug-in Extraction Modules, as complete implementations or where possible by re-using already existing external tools and libraries. Environment monitoring is supported by specialized monitoring daemons and continuous extraction of relevant information triggered by environment events related to the creation and alteration of digital objects, like, e.g., the alteration of an observed file or directory, opening or closing a specific file, and other system calls.

The tool can be used in a sheer curation scenario, running in the system background under the full control of - but without disrupting - the user. Furthermore a snapshot extraction mode exists for capturing the current state of the environment, which is mainly designed to extract information that doesn't change frequently, as e.g. system resource specifications.

A general scenario for PET capture is described here: The overall objectives that we want to accomplish, each of which depends on the previous one, are:

  1. Use the PET tool to collect environment information when the DOs are used, based on specific profiles

  2. Analyse the information collected to infer new relationships between digital objects

  3. Assign values to the dependencies based on the purpose and significance (significance weights)

The current implementation of the PET tool covers the 1st objective, and starts to address the second.

PET features

The following list outlines the main features of the PET tool:

  • Extracts information that is usually ignored by current metadata extractors.
  • The extracted environment information from outside the digital object for re-use possibilities.
  • Extracts information at the right time and place: within the production environment
  • Supports continuous extraction in a sheer curation scenario.
  • Visualizes information change over time.
  • Information snapshot extractions allow getting a quick overview of extractable information.
  • Platform independent (needs Java 7).
  • Modular and extendable architecture that supports specialized needs.
  • Use profiles allow the parallel usage for different scenarios.
  • Provides graphical user interface, but can also run without graphics in console mode.
  • Provides exchangeable storage backend.
  • Saves results in standardized format: JSON or XML.

Implemented Extraction Modules

This section describes the different techniques we are using to extract environment information. The types of extraction module implementations can generally be divided into three groups:

(a) implementations by the use of the programming-language provided Java SE-libraries,

(b) implementations by calling operating system APIs, and

(c) implementations by integrating external programs or libraries.

The sigar library is useful for capturing a lot of environment information independent from the operating system, for example system specifications and system resource usage. Based on this library, we developed extraction modules for capturing an information snapshot of the CPU specification, the available network interfaces, file storage information, file system information, network information and other system resources. Furthermore, modules for the continuous extraction of volatile information were developed. Examples of such information are, among others, the memory usage, CPU usage, process statistics, process parameters, swap usage, the FQDN, TCP calls, and information that can be extracted by system commands emulated by sigar, as uptime and who.

Another type of module (daemon modules) monitor the environment for specific events and report them by triggering new extractions or adding new files to the list of monitored files. Two following modules were created to monitore the resources used by running software.

The ‘lsof’ module runs iteratively the lsof command, available and often included in most UNIX variants, listing its open files and sockets. As we are aware, there are other powerful commands that allow monitoring open files and sockets in Unix variants (such as dtruss and fs_usage for OSX, and strace for Linux); we opted for lsof because for a number of reasons:

  • lsof is cross platform, so one command would work on Linux, OSX and other Unix; while the other commands are platform specific;
  • lsof does not require administrative privileges, as the other command require;
  • lsof is in practice quite reliable at reporting such events.

In the module configuration, the user can define a set of commands to watch and when an event satisfying the monitoring conditions happens, will generate a new event reported in the event system (that can react by storing the event, or adding a file to the list of monitored files) . Such events for example are the opening or closing of a files or sockets by a specific application.

A similar, but less effective solution was also done for Windows, where the lsof command is not available, with a module for the ‘handle’ command. This module will report open files based on the repeated execution of the handle command similarly to the lsof module. The handle command, although free, has a restrictive license that does not allow redistribution; for this reason the user willing to use the module has to manually download and copy the handle command in the native command directory.

A module for capturing a snapshot of the currently installed software was developed by calling the operating system-dependent software management component. It is an example of a module that uses customized operating system calls. Another module of this kind extracts the information about the currently used graphics card. Several extraction modules were developed only by using Java SE API, and work independently of external libraries and commands.

The directory monitor module will monitors directories for the creation, modification or deletion of files uses the Java Watch Service. General file storage information is extracted by calling Files.getFileStore(file), while posix (Portable Operating System Interface for Unix) file information is extracted using Files.readAttributes(file,PosixFileAttributes.class);.

Information about the Java installation, and more general operating system properties is extracted by calling System.getProperties();. Javax.xml was used for developing a module that extracts information from XML files with XPath expressions. To extract information from a normal text file with a regular expression java.util.regex was used, as well as a module that extracts information from log files via regular expressions. Two modules were developed by using java.awt: A module that captures screenshots to save graphical elements, and a module that extracts the systems graphic properties. A module to calculate the checksum of a file uses java.security.

Additionally, the following modules also exist that include external tools for the information extraction or environment monitoring: an Apache Tika file identification module, a MediaInfo module, an OS X Spotlight module, a module for extracting MS Office document dependencies using the Office Dependency Discovery Tool, and the chrome-cli for monitoring of opened tabs in the Chrome browser. For the extraction of font dependencies of pdf files the PDFBox library is used.

In addition, a generic module was created to execute any OS command with a specific configuration, allowing users to extend create new modules by just defining the command configuration and with no programming necessary.