forked from pld/bamboo
-
Notifications
You must be signed in to change notification settings - Fork 20
Process Workflow
mejymejy edited this page Feb 27, 2013
·
3 revisions
This document describes what tasks bamboo performs for each of the requests listed.
Description: Load a dataset into bamboo. Returns the id of the dataset, loads the data into the database, and calculates the summary statistics for the data.
- [controllers/datasets.py:create] request is handled by Datasets controller create method
- [controllers/datasets.py.create] if data is present, create a new Dataset object and save it (just initial metadata)
- [lib/io.py:import_from_csv] run corresponding import_from_* method (e.g. import_from_csv)
- [lib/io.py:import_from_csv] import_from_csv writes csv file out to tmp file (pandas read_csv requires closed file)
- [lib/io.py:import_from_csv] call_async import_dataset, return dataset
- [controllers/datasets.py:create] returns id in response
- [lib/io.py:import_dataset] file_reader/read_csv into BambooFrame, delete tmp file
- [lib/io.py:import_dataset] calls save_observations which calls Observation.save()
- [models/observation.py:save] calls build_schema
- [models/dataset.py:build_schema] calls schema property of dataset obj and gets schema from the dataset.record which returns Schema.safe_init on it
- [lib/schema_builder.py:safe_init] returns empty Schema
- [lib/schema_builder.py:schema_from_dframe] gets dtypes from dframe in dict
- [lib/schema_builder.py:schema_from_dframe] loops through name in dtypes, adds non-reserved names to column_names and slugs columns into encoded_names
- [lib/schema_builder.py:schema_from_dframe] makes blank Schema obj, loops through name, dtype in dtypes and sets column schema (label, olap_type, simpletype, cardinality) for each, adding the col schema to the main schema using the encoded column name, return new_schema
- [models/dataset.py:build_schema] self.set_schema to new_schema, updates the dataset record in database
- [models/observation.py:save] batch_save dataset.encode_dframe_columns
- [models/dataset.py:encode_dframe_columns] adds dataset_id column to dframe as new BambooFrame
- [models/abstract_model:batch_save] calls through to _batch_command
-
[models/abstract_model.py:batch_command] constructs records from dframe like so:
records = [row.to_dict() for (_, row) in dframe[start:end].iterrows()]
- [models/abstract_model.py:batch_command] runs command (insert) for records
- [models/observation.py:save] update dataset in db with rows/cols and state = ready
- [models/observation.py:save] calls dataset.summarize()
- [models/dataset.py:summarize] calls self.reload() (is this necessary?)
- [models/dataset.py:reload] gets the dataset record from the db and sets it to self.dataset (also clears _dframe cache, do we need to?)
- [models/dataset.py:summarize] calls summarize on dframe
- [core/summary.py:summarize] checks cached stats and calls through to summarize_df (since no groups)
- [core/summary.py:summarize_df] uses series_to_json_dict on summarize_series to set summary for each column in dframe
- [core/summary.py:summarize] save summary in dataset record using dict_for_mongo
- return back through and end task...