Synthetic Data Generator

Ever found yourself scrambling around to find test data and then when you find some it isn't in the quantity that you need? Or you can't generate the data multi threaded and so it takes too long to produce.

Look no further, we have a data generator that fakes up some classic human resources data about employees. We have also created the data structure to contain the types of complex data structures that can make computation expensive or difficult to truly test your platform.

This repo provides the code to generate as many Employee records as you want, split over as many Avro files as you desire. You can also optionally define the number of parallel threads used to generate your data.

An Employee objects contains the following fields:

class Employee {
    UserId uid;
    String name;
    String dateOfBirth;
    PhoneNumber[] contactNumbers;
    EmergencyContact[] emergencyContacts;
    Address address;
    BankDetails bankDetails;
    String taxCode;
    Nationality nationality;
    Manager[] manager;
    String hireDate;
    Grade grade;
    Department department;
    int salaryAmount;
    int salaryBonus;
    WorkLocation workLocation;
    Sex sex;
}

The manager field is an array of manager, which could potentially be nested several layers deep, in the generated example manager is nested 3-5 layers deep.

To use the generator you will need to have installed (git, maven and JDK 11).

To get started first clone this repo locally.

git clone https://github.com/gchq/synthetic-data-generator.git

Then cd into the synthetic-data-generator directory and build the codebase

mvn clean install

then to start the generator:

.createHRData.sh PATH EMPLOYEES FILES [THREADS]

where:

PATH is the relative path to generate the files
EMPLOYEES is the number of employee records to create
FILES is the number of files to spread them over
THREADS (optional) specifies the number of threads to use.

For example to generate 1,000,000 employee records, spread over 15 files, running the program with 4 threads, and writing the output files to /data/employee:

.createHRData.sh data/employee 1000000 15 4

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
createHRData.sh		createHRData.sh
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Synthetic Data Generator

About

Releases 3

Packages

Languages

License

gchq/synthetic-data-generator

Folders and files

Latest commit

History

Repository files navigation

Synthetic Data Generator

About

Resources

License

Stars

Watchers

Forks

Releases 3

Packages 0

Languages

Packages