Skip to content

API Documentation

Amit Jaiswal edited this page Jan 22, 2022 · 18 revisions

Overview

This document describes basic functionality in the library, and provides simple usage of the APIs. More detailed options are documented in the source code.

The APIs are divided into three groups:

  1. Data Ingestion: This is provided through static methods available in the tsv package.

  2. Data Transformation: These are availabe in the tsv.TSV class.

  3. Visualization and Advanced Functionalities: These are part of different extension packages.

Working with Data Having 100s of Columns

Existing libraries for data analysis were designed atleast 5-10 years back when there were usually 10-20 columns in a dataset. Modern datasets have columns in the 100s or even 1000s. This become more evident when datasets are joined for enrichement, and new features are created adding more and more columns.

This library provides simple APIs to work with such wide column datasets. Methods like select(), drop(), sort() take regular expression for getting the list of columns.

For example, selecting all columns that start with prefix petal from the iris dataset:

>>> x.select("petal.*") 

Both single or set of columns and/or regular expressions can be specified wherever applicable. Another example to mix and match column names and regular expressions:

>>> x.select(["petal_length", "sepal.*"])
>>> x.drop("petal.*")

All advanced apis like aggregate() follow simple naming convention to create new names with appropriate prefix or suffixes to help with selecting them in groups.

>>> x.aggregate("class", ["petal_length"], [min]).show()
class          	petal_length:min
Iris-setosa    	           1.000
Iris-versicolor	           3.000
Iris-virginica 	           4.500 

API Documentation

A. Data Ingestion

Read and Write from Local FileSystem, S3 or Web

  • tsv.read(paths): Reads the data present in the list of file paths and urls.
  • tsv.write(tsv_obj, path): Writes the tsv object to the specified path.
  • tsv.merge(tsv_objs): Merges the list of tsv objects into one.
  • tsv.exists(path): Check whether the specified path exists or not. Useful for implementing fast forwarding.
  • tsv.from_df(df): Converts a pandas dataframe to TSV.
  • tsv.from_maps(mps): Converts the given array of maps to TSV

Supported File Formats: tsv, simple csv and gzip/zip compressed versions.

Examples

>>> x = tsv.read("data/iris.tsv")
>>> x = tsv.read("s3://bucket-name/path/iris.tsv")
>>> x = tsv.read(["data/file1.tsv.gz", "data/file2.tsv.gz"]
>>> x = tsv.read("https://github.com/CrowdStrike/tsv-data-analytics/raw/main/data/iris.tsv")

Saving TSV object to file

>>> tsv.write(x, "data/output_file.tsv.gz")

Merging mutliple TSV objects

>>> ytsv = tsv.merge([x1, x2])

Checking for existence of file path

>>> flag = tsv.exists("data/iris.tsv")
>>> flag = tsv.exists("s3://bucket-name/path/iris.tsv")

Converting pandas dataframe to TSV

>>> import pandas as pd
>>> df = pandas.read_csv("data/iris.csv")
>>> x = tsv.from_df(df)
>>> x.show(3)

sepal_length	sepal_width	petal_length	petal_width	class      
5.1         	        3.5	         1.4	        0.2	Iris-setosa
4.9         	        3.0	         1.4	        0.2	Iris-setosa
4.7         	        3.2	         1.3	        0.2	Iris-setosa

B. Data Transformation and Analysis

These APIs are part of TSV class. Once the data is loaded as TSV, all these methods can be used.

Note: The optional arguments are shown in italics.

1. Basic Summary

  • num_cols(): Returns the number of columns in the tsv object.
  • num_rows(): Returns the number of rows in the tsv object.
  • columns(): Returns the list of columns in the tsv object.
  • size_in_bytes(): Returns the size of the tsv object in bytes.

2. Pretty Print

  • show(n, max_col_width, title): Pretty prints the first 'n' rows, each upto max_col_width wide and title which is displayed at the top.
  • show_transpose(n, title): Transposes the tsv object and does pretty print.

Examples

>>> x.show(3)
>>>
sepal_length	sepal_width	petal_length	petal_width	class      
5.1         	        3.5	         1.4	        0.2	Iris-setosa
4.9         	        3.0	         1.4	        0.2	Iris-setosa
4.7         	        3.2	         1.3	        0.2	Iris-setosa
>>> x.show_transpose(3)
>>>
col_name    	row:1      	row:2      	row:3      
sepal_length	5.1        	4.9        	4.7        
sepal_width 	3.5        	3.0        	3.2        
petal_length	1.4        	1.4        	1.3        
petal_width 	0.2        	0.2        	0.2        
class       	Iris-setosa	Iris-setosa	Iris-setosa

3. Select Columns

  • select(cols): Selects the given columns which can be a single column, pattern or an array of columns.

Examples

>>> x.select(["class", "petal.*"]).show(3)

class      	petal_length	petal_width
Iris-setosa	         1.4	        0.2
Iris-setosa	         1.4	        0.2
Iris-setosa	         1.3	        0.2

4. Select Rows Slice

  • skip(n): Skips the first n rows.
  • last(n): Takes the last n rows.
  • take(n): Takes the first n rows.

Examples

>>> x.take(2).show()

sepal_length	sepal_width	petal_length	petal_width	class      
5.1         	        3.5	         1.4	        0.2	Iris-setosa
4.9         	        3.0	         1.4	        0.2	Iris-setosa

5. Arithmetic Comparison

  • eq_int(col, value): Returns all rows where the int value of col is equal to value.
  • ge_int(col, value): Returns all rows where the int value of col is less than or equal to value.
  • gt_int(col, value): Returns all rows where the int value of col is greater than value.
  • le_int(col, value): Returns all rows where the int value of col is less than or equal to value.
  • lt_int(col, value): Returns all rows where the int value of col is less than value.
  • eq_float(col, value): Returns all rows where the float value of col is equal to value.
  • ge_float(col, value): Returns all rows where the float value of col is less than or equal to value.
  • gt_float(col, value): Returns all rows where the float value of col is greater than value.
  • le_float(col, value): Returns all rows where the float value of col is less than or equal to value.
  • lt_float(col, value): Returns all rows where the float value of col is less than value.
  • is_nonzero(col): Returns all rows where the float value of col is not zero.

Examples

>>> x.gt_float("petal_length", 1.4)
>>> x.eq_float("sepal_length", 5.1)

6. String Comparison

  • eq_str(col, value): Returns rows where string value of col is equal to value.
  • ge_str(col, value): Returns rows where string value of col is greater than or equal to value.
  • gt_str(col, value): Returns rows where string value of col is greater than value.
  • le_str(col, value): Returns rows where string value of col is less than or equal to value.
  • lt_str(col, value): Returns rows where string value of col is less than value.
  • startswith(col, value): Returns rows where string value of col starts with value.
  • endswith(col, value): Returns rows where string value of col ends with value.
  • regex_match(col, value): Returns rows where string value of col matches the regular expression in value.
  • not_eq_str(col, value): Returns rows where string value of col not equal to value.
  • not_startswith(col, value): Returns rows where string value of col does not start with value.
  • not_endswith(col, value): Returns rows where string value of col does not end with value.
  • not_regex_match(col, value): Returns rows where string value of col does not match regular expression in value.
  • replace_str_inline(cols, old_str, new_str): Replaces the occurrences of substring old_str with new_str in column col.

Examples

>>> x.eq_str("class", "Iris-setosa")
>>> x.regex_match("class", ".*setosa")

7. Basic Filtering and Transformation

  • values_in(col, values): Returns rows where the value of the col is one of the provided values.
  • filter(cols, lambda_func): Returns rows that satisfy lambda function on the values of the given columns.
  • transform(cols, lambda_func, output_cols): Applies lambda function to the given cols. The lambda function can return single or multiple values. The output_cols should match the list of values in the output.
  • transform_inline(cols, func): This applies the lambda function on each of the given cols and returns new values under the same column names.
  • values_not_in(col, values): This is negation of values_in() api.
  • exclude_filter(cols, lambda_func): This is negation of filter() api.
  • ratio(col1, col2, new_col, default = 0): Returns the ratio of col1 / col2 as new_col. If denominator is 0, then returns default value.

Examples

values_in() to take rows with specific values for column class.

>>> x.values_in("class", ["Iris-setosa", "Iris-versicolor"])

filter() with single and multiple input columns.

>>> x.filter(["sepal_length"], lambda x: float(x) != 0)
>>> x.filter(["petal_length", "petal_width"], lambda x, y: float(x) > 1.4 and float(y) > 2.0)

transform() with multiple input columns.

>>> x.transform(["petal_length", "petal_width"], lambda x,y: float(x) * float(y), "petal_length_and_width").show(3)

sepal_length	sepal_width	petal_length	petal_width	class      	petal_length_and_width
5.1         	        3.5	      1.4000	     0.2000	Iris-setosa	                0.2800
4.9         	        3.0	      1.4000	     0.2000	Iris-setosa	                0.2800
4.7         	        3.2	      1.3000	     0.2000	Iris-setosa	                0.2600

transform() with multiple output columns.

>>> x.transform("petal_length", lambda x: (float(x)*2, float(x)*3), ["petal_length_2x", "petal_length_3x"]).show(3)

sepal_length	sepal_width	petal_length	petal_width	class      	petal_length_2x	petal_length_3x
5.1         	        3.5	      1.4000	     0.2000	Iris-setosa	         2.8000	         4.2000
4.9         	        3.0	      1.4000	     0.2000	Iris-setosa	         2.8000	         4.2000
4.7         	        3.2	      1.3000	     0.2000	Iris-setosa	         2.6000	         3.9000

8. Advanced Filtering and Transformation

  • explode(cols, lambda_func, prefix): This method creates a map of the cols in each row, and passes to the given lambda_func. The output of the lambda_func should be an array of maps where each output map will get added as new row with the key-values as col values. As the name suggests, this api can explode the number of rows and columns, and should be used carefully.

Examples

>>> def my_explode_func(mp):
>>>     mps = []
>>>     mps.append({"dim_name": "petal_length", "dim_value": str(mp["petal_length"])})
>>>     mps.append({"dim_name": "sepal_length", "dim_value": str(mp["sepal_length"])})
>>>     return mps
>>>
>>> x.explode(["petal_length", "sepal_length"], my_explode_func, "len_col").show(3)

sepal_width	petal_width	class      	len_col:dim_name	len_col:dim_value
3.5        	        0.2	Iris-setosa	petal_length    	              1.4
3.5        	        0.2	Iris-setosa	sepal_length    	              5.1
3.0        	        0.2	Iris-setosa	petal_length    	              1.4

9. URL Encoding and Decoding

  • url_encode(col, new_col): URL encode the values in col and create a new column with name new_col
  • url_decode(col, new_col): URL decode the values in col and create a new column with name new_col

10. Sampling Rows

  • sample(sampling_ratio, seed): Randomly samples perc rows from the data using the given seed.
  • sample_n(n, seed): Randomly samples n rows from the data using the given seed.

Examples

>>> x.sample_n(3).show()

sepal_length	sepal_width	petal_length	petal_width	class          
5.1         	        2.5	         3.0	        1.1	Iris-versicolor
7.3         	        2.9	         6.3	        1.8	Iris-virginica 
5.4         	        3.7	         1.5	        0.2	Iris-setosa 

11. Sampling Groups

  • sample_class(col, col_value, sampling_ratio, seed): This api randomly samples sampling_ratio rows for column col but only for row that have value as col_value. Useful for doing downsampling of specific class only.
  • sample_group_by_col_value(grouping_cols, col, col_value, sampling_ratio, seed): This api groups data using the grouping_cols, and then does sampling of given column and its value within that group. Useful for downsampling data where there is lot of skewness in few col values within specific groups.
  • sample_group_by_key(grouping_cols, sampling_ratio): This api does random sampling of data within each group. Useful for cases where only few groups have highly skewed data and need to be downsampled.
  • sample_group_by_max_uniq_values(grouping_cols, col, max_uniq_values): This api samples data for a specific col but instead of random sampling, takes all the unique values and samples a percentage of the unique values. Useful for scenarios where all the rows matching the specific column value need to be present.
  • sample_group_by_max_uniq_values_per_class(grouping_cols, class_col, col, max_uniq_values_map, def_max_uniq_values, seed): This api samples different values of class_col differently based on max_uniq_values_map map. This is explained in detail in the [sampling documentation].

Examples

>>> x.sample_class("class", "Iris-setosa", 0.1)

12. Grouping and Aggregation

  • aggregate(grouping_cols, agg_cols, agg_funcs, collapse): This is one of the most useful apis for aggregating data based on set of grouping_cols, and applying multiple aggregation functions. The agg_cols are the list of columns on which agg_funcs are applied in pairwise manner.
  • window_aggregate(win_col, agg_cols, agg_funcs, winsize, sliding, collapse): This api is an extension of aggregate where data slices are created using windows of size winsize. For each window, agg_funcs are applied on agg_cols. If sliding is true, then a sliding window logic is used. Mostly useful for time series data where win_col is date or timestamp, and moving averages are needed.
  • distinct(): This api removes all duplicate rows.

Examples

Compute total sum of petal_length and petal_width in iris data. Notice the convention in the output columns.

>>> x.aggregate("class", ["petal_length", "petal_width"], [sum, sum]).show()

class          	petal_length:sum	petal_width:sum
Iris-setosa    	           73.20	          12.20
Iris-versicolor	          213.00	          66.30
Iris-virginica 	          277.60	         101.30

Use collapse = False to get all the original rows. Useful for debugging, or chaining multiple aggregate() functions together.

>>> x.aggregate("class", ["petal_length", "petal_width"], [sum, sum], collapse = False).show(3)

sepal_length	sepal_width	petal_length	petal_width	class      	petal_length:sum	petal_width:sum
5.10        	       3.50	        1.40	       0.20	Iris-setosa	           73.20	          12.20
4.90        	       3.00	        1.40	       0.20	Iris-setosa	           73.20	          12.20
4.70        	       3.20	        1.30	       0.20	Iris-setosa	           73.20	          12.20

13. Generic JSON Parsing

  • explode_json(url_encoded_col, prefix): This api provides out of the box support for reading simple json blobs and converting to tabular format for data analysis. If there are lists in different sections of json, the default merging strategy is similar to cogroup. A more correct way is to use merge_list_method = join where cartisian product will be created. Useful for parsing json response from web services which are mostly simple in nature, and a default parser can atleast help in looking at the raw data in a simpler way.

A detailed example is provided in the example-notebooks/json-parsing.ipynb notebook.

14. Join and Union

  • join(that, lkeys, rkeys, join_type, lsuffix, rsuffix, default_val, def_val_map): This is the primary api for joining two TSV objects. The lkeys is the list of columns on the left side to use for joining. If the names of join columns in right side are different, the specify the same in rkeys. join_type is inner, left or right. For any outer joins, the missing values can be either specific at each column in def_val_map or have a fallback global value in default_val.
  • inner_join(that, lkeys, rkeys, lsuffix, rsuffix, default_val, def_val_map): Wrapper over join() api with join_type = inner.
  • outer_join(that, lkeys, rkeys, lsuffix, rsuffix, default_val, def_val_map): Wrapper over join() api with join_type = outer.
  • left_join(that, lkeys, rkeys, lsuffix, rsuffix, default_val, def_val_map): Wrapper over join() api with join_type = left.
  • right_join(that, lkeys, rkeys, lsuffix, rsuffix, default_val, def_val_map): Wrapper over join() api with join_type = right.
  • union(tsv_list): This api appends all the TSVs from the tsvlist in the current TSV object. The tsv_list can be a single tsv or an array.

Examples

>>> low_size = x.le_float("petal_length", 3)
>>> high_size = x.gt_float("petal_length", 3)
>>> low_size.inner_join(high_size, lkeys = "class", lsuffix = "low", rsuffix = "high") \
    .select(["class", "petal_length:.*"]) \
    .show(3)

class          	petal_length:low	petal_length:high
Iris-versicolor	             3.0	              4.7
Iris-versicolor	             3.0	              4.5
Iris-versicolor	             3.0	              4.9

15. Drop and Rename Columns

  • drop(cols): This api deletes the columns from the TSV object. Throws error if any of the column or pattern is missing.
  • drop_if_exists(cols): This api deletes the columns from the TSV object. Doesnt throw any error if any of the columns or patterns are missing.
  • rename(col, new_col): This api renames the old col as new_col.

Examples

Drop columns

>>> x.drop("petal.*").show(3)

sepal_length	sepal_width	class      
5.1         	        3.5	Iris-setosa
4.9         	        3.0	Iris-setosa
4.7         	        3.2	Iris-setosa

Rename column

>>> x.rename("class", "class_label").show(3)

sepal_length	sepal_width	petal_length	petal_width	class_label
5.1         	        3.5	         1.4	        0.2	Iris-setosa
4.9         	        3.0	         1.4	        0.2	Iris-setosa
4.7         	        3.2	         1.3	        0.2	Iris-setosa

16. Add or Rename Column Prefix and Suffix

  • rename_suffix(old_suffix, new_suffix): This api renames all columns that have the suffix old_suffix with the new_suffix.
  • rename_prefix(old_prefix, new_prefix): This api renames all columns that have the prefix old_prefix with the new_prefix.
  • remove_suffix(suffix): This api renames the columns having suffix suffix by removing the suffix from their names.
  • rename_prefix: This api renames the columns having prefix prefix by removing the prefix from their names.
  • add_prefix(prefix, cols): This api adds prefix to all the given cols. If cols = None then prefix is added to all columns. renames the columns having suffix suffix by removing the suffix from their names.
  • add_prefix(suffix, cols): This api adds suffix to all the given cols. If cols = None then prefix is added to all columns.

Examples

>>> x.transform_inline("petal.*", lambda x: float(x)*0.4).add_prefix("approx_inches", "petal.*").show(3)

sepal_length	sepal_width	approx_inches:petal_length	approx_inches:petal_width	class      
5.1         	        3.5	                      0.56	                     0.08	Iris-setosa
4.9         	        3.0	                      0.56	                     0.08	Iris-setosa
4.7         	        3.2	                      0.52	                     0.08	Iris-setosa

17. Sorting

  • sort(cols): Sorts the data using the given columns.
  • reverse_sort(cols): This is a wrapper api over sort() for doing sorting in reverse.

Examples

>>> x.sort("petal_length").show(3)

sepal_length	sepal_width	petal_length	petal_width	class      
4.6         	        3.6	         1.0	        0.2	Iris-setosa
4.3         	        3.0	         1.1	        0.1	Iris-setosa
5.8         	        4.0	         1.2	        0.2	Iris-setosa

18. Reorder Columns

  • reorder(cols): This api reorders the columns in the TSV object for ease of use in jupyter notebooks. In case of multiple columns in cols, the original relative ordering is preserved.

Examples

>>> x.reorder("class").show(3)

class      	sepal_length	sepal_width	petal_length	petal_width
Iris-setosa	         5.1	        3.5	         1.4	        0.2
Iris-setosa	         4.9	        3.0	         1.4	        0.2
Iris-setosa	         4.7	        3.2	         1.3	        0.2

19. Transpose from Row to Column Format

  • transpose(n): This api transposes the list of rows and columns. Useful for looking at data with lot of columns that don't fit into the width of the screen.

Examples

>>> x.transpose(3).show(10)
col_name    	row:1      	row:2      	row:3      
sepal_length	5.1        	4.9        	4.7        
sepal_width 	3.5        	3.0        	3.2        
petal_length	1.4        	1.4        	1.3        
petal_width 	0.2        	0.2        	0.2        
class       	Iris-setosa	Iris-setosa	Iris-setosa

20. Extending to Other Derived Classes

  • extend_class(derived_class, *args, **kwargs): This is an advanced function to plugin extensions and other 3rd party modules. For more details, see example-notebooks/extend-class example.

21. Conversion to Other Data Formats

  • to_json_records(): This api converts each row into a json object of a map. Each row in the output will be a json string.
  • to_csv(): This api converts the TSV into simple CSV format file, which means commas(,) and double quotes as special characters are not supported within the fields.
  • to_df(): This api converts the TSV object into a pandas dataframe.
  • to_maps(): This api converts each row into a map and returns a list of those maps.

Examples

Convert each record into json

>>> x.to_json_records().show(3, max_col_width = 200)

json
{"sepal_length": "5.1", "sepal_width": "3.5", "petal_length": "1.4", "petal_width": "0.2", "class": "Iris-setosa"}
{"sepal_length": "4.9", "sepal_width": "3.0", "petal_length": "1.4", "petal_width": "0.2", "class": "Iris-setosa"}
{"sepal_length": "4.7", "sepal_width": "3.2", "petal_length": "1.3", "petal_width": "0.2", "class": "Iris-setosa"}

22. Getting Column Values as Arrays

  • col_as_array(col): This api returns all the values of the given col as a string array.
  • col_as_array_uniq(col): This api all the unique values of the given col as a string array.
  • col_as_float_array(col): This api returns all the values of the given col as float array.
  • col_as_int_array(col): This api returns all the values of the given col as int array.

Examples

>>> x.col_as_array_uniq("class")

['Iris-setosa', 'Iris-versicolor', 'Iris-virginica']

>>> x.col_as_float_array("petal_length")[0:4]

[1.4, 1.4, 1.3, 1.5]

23. Appending Rows and Columns

  • add_row(row_fields): This api adds all the values in the row_fields as column values in the given order to the current TSV.
  • add_map_as_row(mp, default_val): This api takes all the key values in map mp as column names and values and add to the current TSV. If any column is missing, then default_val is used to take the default value if it is defined, else throw error.
  • add_const(col, value): This api adds a new column col with the given value.
  • add_const_if_missing(): This api adds a new column col with the given value only if the column is not present already.
  • add_seq_num(col): This api assigns a unique sequence number to each row with the name col.
  • replicate_rows(col): This api reads the value of column col, and replicates each row according to the value of the given column.

Examples

>>> x.add_seq_num("sno").show(3)

sno	sepal_length	sepal_width	petal_length	petal_width	class      
1  	         5.1	        3.5	         1.4	        0.2	Iris-setosa
2  	         4.9	        3.0	         1.4	        0.2	Iris-setosa
3  	         4.7	        3.2	         1.3	        0.2	Iris-setosa

24. Static Data Transformations

  • assign_value(col, value): This api assigns a constant value to an existing column.
  • copy(col, new_col): This api copies the column col to new_col.
  • set_missing_values(cols, default_val): This api sets the value of each column specific in cols as default_val wherever its value is empty.

Examples

>>> x.assign_value("petal_length", "1.0").show(3)

sepal_length	sepal_width	petal_length	petal_width	class      
5.1         	        3.5	         1.0	        0.2	Iris-setosa
4.9         	        3.0	         1.0	        0.2	Iris-setosa
4.7         	        3.2	         1.0	        0.2	Iris-setosa

C. Visualization, Statistics and Machine Learning

Any functionality that needs extensive 3rd party libraries like matplotlib, seaborn or scikit as provided as extension packages. Not all extension packages might be relevant for all users, and please refer to the documentation section of individual packages for further details. Here are some basic explained for visualization.

1. Basic Plots

These plots are best described in the [example-notebooks/graphs].

  • linechart
  • scatterplot
  • histogram
  • density
  • barchart
  • boxplot

2. Advanced Plots

  • corr_heatmap
  • pairplot

3. Statistics and Machine Learning

This is work in progress.

D. Enable / Disable Debug Messages

There are certain flags that enable or disable internal log messages. These are meant for advanced users only.

  • tsv.enable_debug_mode(): Enables debug mode.
  • tsv.disable_debug_mode(): Disables debug mode.
  • tsv.set_report_progress_perc(perc): Displays a progress bar when the operations are finished in multiples of perc.
  • tsv.set_report_progress_min_thresh(thresh): Sets the debugging only on operations where number of rows is more than thresh.