-
Notifications
You must be signed in to change notification settings - Fork 5
API Documentation
This document describes basic functionality in the library, and provides simple usage of the APIs. More detailed options are documented in the source code.
The APIs are divided into three groups:
-
Data Ingestion: This is provided through static methods available in the tsv package.
-
Data Transformation: These are availabe in the tsv.TSV class.
-
Visualization and Advanced Functionalities: These are part of different extension packages.
Existing libraries for data analysis were designed atleast 5-10 years back when there were usually 10-20 columns in a dataset. Modern datasets have columns in the 100s or even 1000s. This become more evident when datasets are joined for enrichement, and new features are created adding more and more columns.
This library provides simple APIs to work with such wide column datasets. Methods like select(), drop(), sort() take regular expression for getting the list of columns.
For example, selecting all columns that start with prefix petal from the iris dataset:
>>> x.select("petal.*")
Both single or set of columns and/or regular expressions can be specified wherever applicable. Another example to mix and match column names and regular expressions:
>>> x.select(["petal_length", "sepal.*"])
>>> x.drop("petal.*")
All advanced apis like aggregate() follow simple naming convention to create new names with appropriate prefix or suffixes to help with selecting them in groups.
>>> x.aggregate("class", ["petal_length"], [min]).show()
class petal_length:min
Iris-setosa 1.000
Iris-versicolor 3.000
Iris-virginica 4.500
- tsv.read(paths): Reads the data present in the list of file paths and urls.
- tsv.write(tsv_obj, path): Writes the tsv object to the specified path.
- tsv.merge(tsv_objs): Merges the list of tsv objects into one.
- tsv.exists(path): Check whether the specified path exists or not. Useful for implementing fast forwarding.
- tsv.from_df(df): Converts a pandas dataframe to TSV.
- tsv.from_maps(mps): Converts the given array of maps to TSV
Supported File Formats: tsv, simple csv and gzip/zip compressed versions.
Examples
>>> x = tsv.read("data/iris.tsv")
>>> x = tsv.read("s3://bucket-name/path/iris.tsv")
>>> x = tsv.read(["data/file1.tsv.gz", "data/file2.tsv.gz"]
>>> x = tsv.read("https://github.com/CrowdStrike/tsv-data-analytics/raw/main/data/iris.tsv")
Saving TSV object to file
>>> tsv.write(x, "data/output_file.tsv.gz")
Merging mutliple TSV objects
>>> ytsv = tsv.merge([x1, x2])
Checking for existence of file path
>>> flag = tsv.exists("data/iris.tsv")
>>> flag = tsv.exists("s3://bucket-name/path/iris.tsv")
Converting pandas dataframe to TSV
>>> import pandas as pd
>>> df = pandas.read_csv("data/iris.csv")
>>> x = tsv.from_df(df)
>>> x.show(3)
sepal_length sepal_width petal_length petal_width class
5.1 3.5 1.4 0.2 Iris-setosa
4.9 3.0 1.4 0.2 Iris-setosa
4.7 3.2 1.3 0.2 Iris-setosa
These APIs are part of TSV class. Once the data is loaded as TSV, all these methods can be used.
Note: The optional arguments are shown in italics.
- num_cols(): Returns the number of columns in the tsv object.
- num_rows(): Returns the number of rows in the tsv object.
- columns(): Returns the list of columns in the tsv object.
- size_in_bytes(): Returns the size of the tsv object in bytes.
- show(n, max_col_width, title): Pretty prints the first 'n' rows, each upto max_col_width wide and title which is displayed at the top.
- show_transpose(n, title): Transposes the tsv object and does pretty print.
Examples
>>> x.show(3)
>>>
sepal_length sepal_width petal_length petal_width class
5.1 3.5 1.4 0.2 Iris-setosa
4.9 3.0 1.4 0.2 Iris-setosa
4.7 3.2 1.3 0.2 Iris-setosa
>>> x.show_transpose(3)
>>>
col_name row:1 row:2 row:3
sepal_length 5.1 4.9 4.7
sepal_width 3.5 3.0 3.2
petal_length 1.4 1.4 1.3
petal_width 0.2 0.2 0.2
class Iris-setosa Iris-setosa Iris-setosa
- select(cols): Selects the given columns which can be a single column, pattern or an array of columns.
Examples
>>> x.select(["class", "petal.*"]).show(3)
class petal_length petal_width
Iris-setosa 1.4 0.2
Iris-setosa 1.4 0.2
Iris-setosa 1.3 0.2
- skip(n): Skips the first n rows.
- last(n): Takes the last n rows.
- take(n): Takes the first n rows.
Examples
>>> x.take(2).show()
sepal_length sepal_width petal_length petal_width class
5.1 3.5 1.4 0.2 Iris-setosa
4.9 3.0 1.4 0.2 Iris-setosa
- eq_int(col, value): Returns all rows where the int value of col is equal to value.
- ge_int(col, value): Returns all rows where the int value of col is less than or equal to value.
- gt_int(col, value): Returns all rows where the int value of col is greater than value.
- le_int(col, value): Returns all rows where the int value of col is less than or equal to value.
- lt_int(col, value): Returns all rows where the int value of col is less than value.
- eq_float(col, value): Returns all rows where the float value of col is equal to value.
- ge_float(col, value): Returns all rows where the float value of col is less than or equal to value.
- gt_float(col, value): Returns all rows where the float value of col is greater than value.
- le_float(col, value): Returns all rows where the float value of col is less than or equal to value.
- lt_float(col, value): Returns all rows where the float value of col is less than value.
- is_nonzero(col): Returns all rows where the float value of col is not zero.
Examples
>>> x.gt_float("petal_length", 1.4)
>>> x.eq_float("sepal_length", 5.1)
- eq_str(col, value): Returns rows where string value of col is equal to value.
- ge_str(col, value): Returns rows where string value of col is greater than or equal to value.
- gt_str(col, value): Returns rows where string value of col is greater than value.
- le_str(col, value): Returns rows where string value of col is less than or equal to value.
- lt_str(col, value): Returns rows where string value of col is less than value.
- startswith(col, value): Returns rows where string value of col starts with value.
- endswith(col, value): Returns rows where string value of col ends with value.
- regex_match(col, value): Returns rows where string value of col matches the regular expression in value.
- not_eq_str(col, value): Returns rows where string value of col not equal to value.
- not_startswith(col, value): Returns rows where string value of col does not start with value.
- not_endswith(col, value): Returns rows where string value of col does not end with value.
- not_regex_match(col, value): Returns rows where string value of col does not match regular expression in value.
- replace_str_inline(cols, old_str, new_str): Replaces the occurrences of substring old_str with new_str in column col.
Examples
>>> x.eq_str("class", "Iris-setosa")
>>> x.regex_match("class", ".*setosa")
- values_in(col, values): Returns rows where the value of the col is one of the provided values.
- filter(cols, lambda_func): Returns rows that satisfy lambda function on the values of the given columns.
- transform(cols, lambda_func, output_cols): Applies lambda function to the given cols. The lambda function can return single or multiple values. The output_cols should match the list of values in the output.
- transform_inline(cols, func): This applies the lambda function on each of the given cols and returns new values under the same column names.
- values_not_in(col, values): This is negation of values_in() api.
- exclude_filter(cols, lambda_func): This is negation of filter() api.
- ratio(col1, col2, new_col, default = 0): Returns the ratio of col1 / col2 as new_col. If denominator is 0, then returns default value.
Examples
values_in() to take rows with specific values for column class.
>>> x.values_in("class", ["Iris-setosa", "Iris-versicolor"])
filter() with single and multiple input columns.
>>> x.filter(["sepal_length"], lambda x: float(x) != 0)
>>> x.filter(["petal_length", "petal_width"], lambda x, y: float(x) > 1.4 and float(y) > 2.0)
transform() with multiple input columns.
>>> x.transform(["petal_length", "petal_width"], lambda x,y: float(x) * float(y), "petal_length_and_width").show(3)
sepal_length sepal_width petal_length petal_width class petal_length_and_width
5.1 3.5 1.4000 0.2000 Iris-setosa 0.2800
4.9 3.0 1.4000 0.2000 Iris-setosa 0.2800
4.7 3.2 1.3000 0.2000 Iris-setosa 0.2600
transform() with multiple output columns.
>>> x.transform("petal_length", lambda x: (float(x)*2, float(x)*3), ["petal_length_2x", "petal_length_3x"]).show(3)
sepal_length sepal_width petal_length petal_width class petal_length_2x petal_length_3x
5.1 3.5 1.4000 0.2000 Iris-setosa 2.8000 4.2000
4.9 3.0 1.4000 0.2000 Iris-setosa 2.8000 4.2000
4.7 3.2 1.3000 0.2000 Iris-setosa 2.6000 3.9000
- explode(cols, lambda_func, prefix): This method creates a map of the cols in each row, and passes to the given lambda_func. The output of the lambda_func should be an array of maps where each output map will get added as new row with the key-values as col values. As the name suggests, this api can explode the number of rows and columns, and should be used carefully.
Examples
>>> def my_explode_func(mp):
>>> mps = []
>>> mps.append({"dim_name": "petal_length", "dim_value": str(mp["petal_length"])})
>>> mps.append({"dim_name": "sepal_length", "dim_value": str(mp["sepal_length"])})
>>> return mps
>>>
>>> x.explode(["petal_length", "sepal_length"], my_explode_func, "len_col").show(3)
sepal_width petal_width class len_col:dim_name len_col:dim_value
3.5 0.2 Iris-setosa petal_length 1.4
3.5 0.2 Iris-setosa sepal_length 5.1
3.0 0.2 Iris-setosa petal_length 1.4
- url_encode(col, new_col): URL encode the values in col and create a new column with name new_col
- url_decode(col, new_col): URL decode the values in col and create a new column with name new_col
- sample(sampling_ratio, seed): Randomly samples perc rows from the data using the given seed.
- sample_n(n, seed): Randomly samples n rows from the data using the given seed.
Examples
>>> x.sample_n(3).show()
sepal_length sepal_width petal_length petal_width class
5.1 2.5 3.0 1.1 Iris-versicolor
7.3 2.9 6.3 1.8 Iris-virginica
5.4 3.7 1.5 0.2 Iris-setosa
- sample_class(col, col_value, sampling_ratio, seed): This api randomly samples sampling_ratio rows for column col but only for row that have value as col_value. Useful for doing downsampling of specific class only.
- sample_group_by_col_value(grouping_cols, col, col_value, sampling_ratio, seed): This api groups data using the grouping_cols, and then does sampling of given column and its value within that group. Useful for downsampling data where there is lot of skewness in few col values within specific groups.
- sample_group_by_key(grouping_cols, sampling_ratio): This api does random sampling of data within each group. Useful for cases where only few groups have highly skewed data and need to be downsampled.
- sample_group_by_max_uniq_values(grouping_cols, col, max_uniq_values): This api samples data for a specific col but instead of random sampling, takes all the unique values and samples a percentage of the unique values. Useful for scenarios where all the rows matching the specific column value need to be present.
- sample_group_by_max_uniq_values_per_class(grouping_cols, class_col, col, max_uniq_values_map, def_max_uniq_values, seed): This api samples different values of class_col differently based on max_uniq_values_map map. This is explained in detail in the [sampling documentation].
Examples
>>> x.sample_class("class", "Iris-setosa", 0.1)
- aggregate(grouping_cols, agg_cols, agg_funcs, collapse): This is one of the most useful apis for aggregating data based on set of grouping_cols, and applying multiple aggregation functions. The agg_cols are the list of columns on which agg_funcs are applied in pairwise manner.
- window_aggregate(win_col, agg_cols, agg_funcs, winsize, sliding, collapse): This api is an extension of aggregate where data slices are created using windows of size winsize. For each window, agg_funcs are applied on agg_cols. If sliding is true, then a sliding window logic is used. Mostly useful for time series data where win_col is date or timestamp, and moving averages are needed.
- distinct(): This api removes all duplicate rows.
Examples
Compute total sum of petal_length and petal_width in iris data. Notice the convention in the output columns.
>>> x.aggregate("class", ["petal_length", "petal_width"], [sum, sum]).show()
class petal_length:sum petal_width:sum
Iris-setosa 73.20 12.20
Iris-versicolor 213.00 66.30
Iris-virginica 277.60 101.30
Use collapse = False to get all the original rows. Useful for debugging, or chaining multiple aggregate() functions together.
>>> x.aggregate("class", ["petal_length", "petal_width"], [sum, sum], collapse = False).show(3)
sepal_length sepal_width petal_length petal_width class petal_length:sum petal_width:sum
5.10 3.50 1.40 0.20 Iris-setosa 73.20 12.20
4.90 3.00 1.40 0.20 Iris-setosa 73.20 12.20
4.70 3.20 1.30 0.20 Iris-setosa 73.20 12.20
- explode_json(url_encoded_col, prefix): This api provides out of the box support for reading simple json blobs and converting to tabular format for data analysis. If there are lists in different sections of json, the default merging strategy is similar to cogroup. A more correct way is to use merge_list_method = join where cartisian product will be created. Useful for parsing json response from web services which are mostly simple in nature, and a default parser can atleast help in looking at the raw data in a simpler way.
A detailed example is provided in the example-notebooks/json-parsing.ipynb notebook.
- join(that, lkeys, rkeys, join_type, lsuffix, rsuffix, default_val, def_val_map): This is the primary api for joining two TSV objects. The lkeys is the list of columns on the left side to use for joining. If the names of join columns in right side are different, the specify the same in rkeys. join_type is inner, left or right. For any outer joins, the missing values can be either specific at each column in def_val_map or have a fallback global value in default_val.
- inner_join(that, lkeys, rkeys, lsuffix, rsuffix, default_val, def_val_map): Wrapper over join() api with join_type = inner.
- outer_join(that, lkeys, rkeys, lsuffix, rsuffix, default_val, def_val_map): Wrapper over join() api with join_type = outer.
- left_join(that, lkeys, rkeys, lsuffix, rsuffix, default_val, def_val_map): Wrapper over join() api with join_type = left.
- right_join(that, lkeys, rkeys, lsuffix, rsuffix, default_val, def_val_map): Wrapper over join() api with join_type = right.
- union(tsv_list): This api appends all the TSVs from the tsvlist in the current TSV object. The tsv_list can be a single tsv or an array.
Examples
>>> low_size = x.le_float("petal_length", 3)
>>> high_size = x.gt_float("petal_length", 3)
>>> low_size.inner_join(high_size, lkeys = "class", lsuffix = "low", rsuffix = "high") \
.select(["class", "petal_length:.*"]) \
.show(3)
class petal_length:low petal_length:high
Iris-versicolor 3.0 4.7
Iris-versicolor 3.0 4.5
Iris-versicolor 3.0 4.9
- drop(cols): This api deletes the columns from the TSV object. Throws error if any of the column or pattern is missing.
- drop_if_exists(cols): This api deletes the columns from the TSV object. Doesnt throw any error if any of the columns or patterns are missing.
- rename(col, new_col): This api renames the old col as new_col.
Examples
Drop columns
>>> x.drop("petal.*").show(3)
sepal_length sepal_width class
5.1 3.5 Iris-setosa
4.9 3.0 Iris-setosa
4.7 3.2 Iris-setosa
Rename column
>>> x.rename("class", "class_label").show(3)
sepal_length sepal_width petal_length petal_width class_label
5.1 3.5 1.4 0.2 Iris-setosa
4.9 3.0 1.4 0.2 Iris-setosa
4.7 3.2 1.3 0.2 Iris-setosa
- rename_suffix(old_suffix, new_suffix): This api renames all columns that have the suffix old_suffix with the new_suffix.
- rename_prefix(old_prefix, new_prefix): This api renames all columns that have the prefix old_prefix with the new_prefix.
- remove_suffix(suffix): This api renames the columns having suffix suffix by removing the suffix from their names.
- rename_prefix: This api renames the columns having prefix prefix by removing the prefix from their names.
- add_prefix(prefix, cols): This api adds prefix to all the given cols. If cols = None then prefix is added to all columns. renames the columns having suffix suffix by removing the suffix from their names.
- add_prefix(suffix, cols): This api adds suffix to all the given cols. If cols = None then prefix is added to all columns.
Examples
>>> x.transform_inline("petal.*", lambda x: float(x)*0.4).add_prefix("approx_inches", "petal.*").show(3)
sepal_length sepal_width approx_inches:petal_length approx_inches:petal_width class
5.1 3.5 0.56 0.08 Iris-setosa
4.9 3.0 0.56 0.08 Iris-setosa
4.7 3.2 0.52 0.08 Iris-setosa
- sort(cols): Sorts the data using the given columns.
- reverse_sort(cols): This is a wrapper api over sort() for doing sorting in reverse.
Examples
>>> x.sort("petal_length").show(3)
sepal_length sepal_width petal_length petal_width class
4.6 3.6 1.0 0.2 Iris-setosa
4.3 3.0 1.1 0.1 Iris-setosa
5.8 4.0 1.2 0.2 Iris-setosa
- reorder(cols): This api reorders the columns in the TSV object for ease of use in jupyter notebooks. In case of multiple columns in cols, the original relative ordering is preserved.
Examples
>>> x.reorder("class").show(3)
class sepal_length sepal_width petal_length petal_width
Iris-setosa 5.1 3.5 1.4 0.2
Iris-setosa 4.9 3.0 1.4 0.2
Iris-setosa 4.7 3.2 1.3 0.2
- transpose(n): This api transposes the list of rows and columns. Useful for looking at data with lot of columns that don't fit into the width of the screen.
Examples
>>> x.transpose(3).show(10)
col_name row:1 row:2 row:3
sepal_length 5.1 4.9 4.7
sepal_width 3.5 3.0 3.2
petal_length 1.4 1.4 1.3
petal_width 0.2 0.2 0.2
class Iris-setosa Iris-setosa Iris-setosa
- extend_class(derived_class, *args, **kwargs): This is an advanced function to plugin extensions and other 3rd party modules. For more details, see example-notebooks/extend-class example.
- to_json_records(): This api converts each row into a json object of a map. Each row in the output will be a json string.
- to_csv(): This api converts the TSV into simple CSV format file, which means commas(,) and double quotes as special characters are not supported within the fields.
- to_df(): This api converts the TSV object into a pandas dataframe.
- to_maps(): This api converts each row into a map and returns a list of those maps.
Examples
Convert each record into json
>>> x.to_json_records().show(3, max_col_width = 200)
json
{"sepal_length": "5.1", "sepal_width": "3.5", "petal_length": "1.4", "petal_width": "0.2", "class": "Iris-setosa"}
{"sepal_length": "4.9", "sepal_width": "3.0", "petal_length": "1.4", "petal_width": "0.2", "class": "Iris-setosa"}
{"sepal_length": "4.7", "sepal_width": "3.2", "petal_length": "1.3", "petal_width": "0.2", "class": "Iris-setosa"}
- col_as_array(col): This api returns all the values of the given col as a string array.
- col_as_array_uniq(col): This api all the unique values of the given col as a string array.
- col_as_float_array(col): This api returns all the values of the given col as float array.
- col_as_int_array(col): This api returns all the values of the given col as int array.
Examples
>>> x.col_as_array_uniq("class")
['Iris-setosa', 'Iris-versicolor', 'Iris-virginica']
>>> x.col_as_float_array("petal_length")[0:4]
[1.4, 1.4, 1.3, 1.5]
- add_row(row_fields): This api adds all the values in the row_fields as column values in the given order to the current TSV.
- add_map_as_row(mp, default_val): This api takes all the key values in map mp as column names and values and add to the current TSV. If any column is missing, then default_val is used to take the default value if it is defined, else throw error.
- add_const(col, value): This api adds a new column col with the given value.
- add_const_if_missing(): This api adds a new column col with the given value only if the column is not present already.
- add_seq_num(col): This api assigns a unique sequence number to each row with the name col.
- replicate_rows(col): This api reads the value of column col, and replicates each row according to the value of the given column.
Examples
>>> x.add_seq_num("sno").show(3)
sno sepal_length sepal_width petal_length petal_width class
1 5.1 3.5 1.4 0.2 Iris-setosa
2 4.9 3.0 1.4 0.2 Iris-setosa
3 4.7 3.2 1.3 0.2 Iris-setosa
- assign_value(col, value): This api assigns a constant value to an existing column.
- copy(col, new_col): This api copies the column col to new_col.
- set_missing_values(cols, default_val): This api sets the value of each column specific in cols as default_val wherever its value is empty.
Examples
>>> x.assign_value("petal_length", "1.0").show(3)
sepal_length sepal_width petal_length petal_width class
5.1 3.5 1.0 0.2 Iris-setosa
4.9 3.0 1.0 0.2 Iris-setosa
4.7 3.2 1.0 0.2 Iris-setosa
Any functionality that needs extensive 3rd party libraries like matplotlib, seaborn or scikit as provided as extension packages. Not all extension packages might be relevant for all users, and please refer to the documentation section of individual packages for further details. Here are some basic explained for visualization.
These plots are best described in the [example-notebooks/graphs].
- linechart
- scatterplot
- histogram
- density
- barchart
- boxplot
- corr_heatmap
- pairplot
This is work in progress.
There are certain flags that enable or disable internal log messages. These are meant for advanced users only.
- tsv.enable_debug_mode(): Enables debug mode.
- tsv.disable_debug_mode(): Disables debug mode.
- tsv.set_report_progress_perc(perc): Displays a progress bar when the operations are finished in multiples of perc.
- tsv.set_report_progress_min_thresh(thresh): Sets the debugging only on operations where number of rows is more than thresh.