Skip to content

Validation of basicFusion products

H. Joe Lee edited this page Oct 31, 2017 · 10 revisions

Goal

Give 99.999% confidence to data users that all input files are processed properly to output files.

Introduction

Data validation is very important step in data production. How can you tell whether the final data product is processed correctly without missing data? There are several ways to ensure that the contents of input files are migrated successfully to output files.

Problem

The main problem is the typical big data problem - volume, variety, and velocity.

Solution

HDF4 to HDF5 migration validation

  • Check sum of input file sizes vs. output file size: they should be linearly proportional
  • Process error logs: any error message indicates that something went wrong.
  • Perform image analysis: data values may differ because of scale/offset processing but patterns (like Hurricane) in imagery will be same.

HDF5 / netCDF-4 interoperability

The final product should be interoperable with netCDF. Thus, the following validation will be also helpful to ensure data interoperability.

  • Comparison of ncdump and h5dump: Both tools should report similar output.

Experiment

To set up Python environment with matplotlib on Roger, run module load python/2.7.10 first and run configureEnv.sh in the util/ directory.

It will create a virtual environment in externLib/BFpyEnv with all the required dependencies. Source the source externLib/BFpyEnv/bin/activate, then run the inquireSize.py script.

To install python-hdf4 to generate image from HDF4 files, install HDF4 first with -fPIC

export CFLAGS=-fPIC && ./configure --disable-netcdf --prefx=/home/username && make && make install

Then, export library and include

export INCLUDE_DIRS=/home/hyoklee/include && export LIBRARY_DIRS=/home/hyoklee/lib && pip install python-hdf4