Process Workflow

Purpose

This document describes what tasks bamboo performs for each of the requests listed.

Import dataset

API call: `POST /datasets`

Description: Load a dataset into bamboo. Returns the id of the dataset, loads the data into the database, and calculates the summary statistics for the data.

request (sync)

[controllers/datasets.py:create] request is handled by Datasets controller create method
[controllers/datasets.py.create] if data is present, create a new Dataset object and save it (just initial metadata)
[lib/io.py:import_from_csv] run corresponding import_from_* method (e.g. import_from_csv)
[lib/io.py:import_from_csv] import_from_csv writes csv file out to tmp file (pandas read_csv requires closed file)
[lib/io.py:import_from_csv] call_async import_dataset, return dataset
[controllers/datasets.py:create] returns id in response

task (async)

[lib/io.py:import_dataset] file_reader/read_csv into BambooFrame, delete tmp file
[lib/io.py:import_dataset] calls save_observations which calls Observation.save()
[models/observation.py:save] calls build_schema
[models/dataset.py:build_schema] calls schema property of dataset obj and gets schema from the dataset.record which returns Schema.safe_init on it
[lib/schema_builder.py:safe_init] returns empty Schema
[lib/schema_builder.py:schema_from_dframe] gets dtypes from dframe in dict
[lib/schema_builder.py:schema_from_dframe] loops through name in dtypes, adds non-reserved names to column_names and slugs columns into encoded_names
[lib/schema_builder.py:schema_from_dframe] makes blank Schema obj, loops through name, dtype in dtypes and sets column schema (label, olap_type, simpletype, cardinality) for each, adding the col schema to the main schema using the encoded column name, return new_schema
[models/dataset.py:build_schema] self.set_schema to new_schema, updates the dataset record in database
[models/observation.py:save] batch_save dataset.encode_dframe_columns
[models/dataset.py:encode_dframe_columns] adds dataset_id column to dframe as new BambooFrame
[models/abstract_model:batch_save] calls through to _batch_command
[models/abstract_model.py:batch_command] constructs records from dframe like so: records = [row.to_dict() for (_, row) in dframe[start:end].iterrows()]
[models/abstract_model.py:batch_command] runs command (insert) for records
[models/observation.py:save] update dataset in db with rows/cols and state = ready
[models/observation.py:save] calls dataset.summarize()
[models/dataset.py:summarize] calls self.reload() (is this necessary?)
[models/dataset.py:reload] gets the dataset record from the db and sets it to self.dataset (also clears _dframe cache, do we need to?)
[models/dataset.py:summarize] calls summarize on dframe
[core/summary.py:summarize] checks cached stats and calls through to summarize_df (since no groups)
[core/summary.py:summarize_df] uses series_to_json_dict on summarize_series to set summary for each column in dframe
[core/summary.py:summarize] save summary in dataset record using dict_for_mongo
return back through and end task...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Process Workflow

Purpose

Import dataset

API call: `POST /datasets`

request (sync)

task (async)

Clone this wiki locally

Process Workflow

Purpose

Import dataset

API call: POST /datasets

request (sync)

task (async)

Clone this wiki locally

API call: `POST /datasets`