Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Goscript - Data Generation solely with Go #732

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -377,7 +377,7 @@ Execute the following steps to run the challenge:
This will take a few minutes.
**Attention:** the generated file has a size of approx. **12 GB**, so make sure to have enough diskspace.

If you're running the challenge with a non-Java language, there's a non-authoritative Python script to generate the measurements file at `src/main/python/create_measurements.py`. The authoritative method for generating the measurements is the Java program `dev.morling.onebrc.CreateMeasurements`.
If you're running the challenge with a non-Java language, there's a non-authoritative Python script to generate the measurements file at `src/main/python/create_measurements.py`, as well a non-authoritative Go script at `src/main/go/sdb/create_measurements.go`. The authoritative method for generating the measurements is the Java program `dev.morling.onebrc.CreateMeasurements`.

3. Calculate the average measurement values:

Expand Down
111 changes: 111 additions & 0 deletions src/main/go/sdb/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@

Golang Weather Data Generator

======================



Overview

--------



If you're reading this, I'm assuming that you are generally familiar with the 1 Billion Rows Challenge and have likely already cloned this repo locally. If not, please review the readme in the root folder for much needed context.

My particular program is designed generate test data for the 1 Billion Rows Challenge using Go and only Go. Please note that this **NOT** a solution to the Challenge, but merely a tool to generate a large amount of test data.

This program originated as a necessity; when I started looking at this challenge, I found that I had no easy way to generate 1 Billion rows of test data. The default instructions for data generation require Java, which I did not feel like installing just for this project, and while there are a couple of solutions in Go merged into this repo, there was no tool simply for creating test data using Go.

This program generates simulated weather data for a predefined list of weather stations. It creates a file (`measurements.txt`) containing measurements for each station, including the station name and a randomly generated temperature value, conforming to the format of the 1 Billion Row Challenge:

Sokyriany;66.8
Araranguá;-63.2
New Ulm;90.2

This program is designed to be flexible; you can specify however many rows you want to generate, although performance *will be affected for very large numbers of rows.*

Features

--------



- Customizable Data Generation: Users can specify the exact number of data rows they wish to generate.

- Simulated Weather Data: For each row, the program selects a weather station from a predefined list and assigns it a randomly generated temperature.

- Testability: The design incorporates dependency injection and interfaces, enhancing testability and maintainability.



Prerequisites

-------------



- Go (version 1.15 or newer recommended)



Ensure Go is installed and properly configured on your system. You can verify this by running `go version` in your terminal.


Usage

-----



To run the program, navigate to the project directory and use the following command:






`go run . [numRows]`



Replace `[numRows]` with the number of data rows you want to generate. For example:






`go run . 10000`



This command generates a file with 10,000 rows of simulated weather data.

This utility is **NOT** yet optimized, and as such it currently takes 10-12 minutes to write a file with a full billion rows. As time permits, I'll try to come back and optimize, but at present, I consider this an acceptable tradeoff for a utility you'll probably run only once.



Configuration

-------------



- Data File Location: The output file is saved to `/data` as `measurements.txt`. Customize this path in the source code if necessary.

- Weather Station List: The list of weather stations is read from the `weather_stations.csv` file in `/data`. This file is included with the repo; please make sure it has not been messed with.



Testing

-------



The program includes unit tests for its core functionality, although test coverage is not complete. Run these tests to ensure the program operates as expected:



`go test`
201 changes: 201 additions & 0 deletions src/main/go/sdb/create_measurements.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,201 @@
//
// Copyright 2023 The original authors
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
//

// # Based on https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CreateMeasurements.java and https://github.com/gunnarmorling/1brc/blob/main/src/main/python/create_measurements.py

package main

import (
"bufio"
"fmt"
"math/rand"
"os"
"strconv"
"strings"
"time"
)

type FileOpener interface {
Open(name string) (*os.File, error)
}

type RealFileOpener struct{}

func (r RealFileOpener) Open(name string) (*os.File, error) {
return os.Open(name)
}

type FileWriter interface {
Create(name string) (*os.File, error)
NewWriter(file *os.File) *bufio.Writer
}

type RealFileWriter struct{}

func (RealFileWriter) Create(name string) (*os.File, error) {
return os.Create(name)
}

func (RealFileWriter) NewWriter(file *os.File) *bufio.Writer {
return bufio.NewWriter(file)
}

type Random interface {
Float64() float64
Intn(n int) int
}

type StdRandom struct{}

func (StdRandom) Float64() float64 {
return rand.Float64()
}

func (StdRandom) Intn(n int) int {
return rand.Intn(n)
}

func checkArgs(args []string) (int, error) {
if len(args) != 2 {
return 0, fmt.Errorf("incorrect number of arguments - see example usage, go run create_measurements.go 1000")
}
numRows, err := strconv.Atoi(args[1])
if err != nil || numRows <= 0 {
return 0, fmt.Errorf("argument must be a positive integer - - see example usage, go run create_measurements.go 1000")
}
return numRows, nil
}

func buildWeatherStationNameList(opener FileOpener) ([]string, error) {
var stationNames []string

file, err := opener.Open("../../../../data/weather_stations.csv")
if err != nil {
fmt.Println("Error opening file:", err)
}
defer file.Close()

scanner := bufio.NewScanner(file)
for scanner.Scan() {
line := scanner.Text()
if strings.Contains(line, "#") {
continue
}
station := strings.Split(line, ";")[0]
stationNames = append(stationNames, station)
}
return stationNames, nil
}

func estimateFileSize(weatherStationNames []string, numRowsToCreate int) string {
totalNameBytes := 0
for _, name := range weatherStationNames {
totalNameBytes += len(name)
}
avgNameBytes := totalNameBytes / len(weatherStationNames)
avgTempBytes := 4.400200100050025
avgLineLength := avgNameBytes + int(avgTempBytes) + 2
fileSize := numRowsToCreate * avgLineLength
return fmt.Sprintf("Estimated max file size is: %s.", convertBytes(fileSize))
}

func convertBytes(num int) string {
units := []string{"bytes", "KiB", "MiB", "GiB"}
var i int
for num >= 1024 && i < len(units)-1 {
num /= 1024
i++
}
return fmt.Sprintf("%d %s", num, units[i])
}

func buildTestData(weatherStationNames []string, numRowsToCreate int, fileWriter FileWriter, random Random) error {
startTime := time.Now()

coldestTemp := -99.9
hottestTemp := 99.9

// Adjust the batchSize based on numRowsToCreate if less than 10,000
batchSize := 10000
if numRowsToCreate < batchSize {
batchSize = numRowsToCreate
}

fmt.Println("Building test data...")

// Initialize the file and writer
file, err := fileWriter.Create("../../../../data/measurements.txt")
if err != nil {
return fmt.Errorf("error creating file: %w", err)
}
defer file.Close()

writer := fileWriter.NewWriter(file)
defer writer.Flush()

// Generate and write data in batches
for i := 0; i < numRowsToCreate; i += batchSize {
end := i + batchSize
if end > numRowsToCreate {
end = numRowsToCreate
}

for j := i; j < end; j++ {
stationName := weatherStationNames[random.Intn(len(weatherStationNames))]
temp := random.Float64()*(hottestTemp-coldestTemp) + coldestTemp
line := fmt.Sprintf("%s;%.1f\n", stationName, temp)
if _, err := writer.WriteString(line); err != nil {
return fmt.Errorf("error writing string: %w", err)
}
}
}

fmt.Println("\nTest data successfully written.")
fmt.Printf("Elapsed time: %s\n", time.Since(startTime))
return nil
}

func main() {
args := os.Args
numRowsToCreate, err := checkArgs(args)

if err != nil {
fmt.Println(err)
os.Exit(1)
}

opener := RealFileOpener{}
weatherStationNames, err := buildWeatherStationNameList(opener)

if err != nil {
fmt.Println(err)
os.Exit(1)
}

fmt.Println(estimateFileSize(weatherStationNames, numRowsToCreate))

fileWriter := RealFileWriter{}
random := StdRandom{}

// Call buildTestData with the concrete implementations.
err = buildTestData(weatherStationNames, numRowsToCreate, fileWriter, random)
if err != nil {
fmt.Printf("Failed to build test data: %v\n", err)
return
}

fmt.Println("Test data build complete.")
}
Loading