- pandas-like data-frame library
- Column built on typed arrays
- tries to be memory efficient
- extensions to arrays
- great for tabular data
- reads data in various formats: CSV, JSON, array of rows, array of columns, JS object, JS Map
- work in progress
See JSDoc-generated API docs see docs.
For more human-friendly docs keep reading.
npm install --save dataf
Run the node REPL.
node
Import the library (make sure it's installed).
const DF = require('dataf')
DF.dataSets
[ 'alcohol.csv', // alcohol consumption math students
'countries.csv', // geopolitical data for all countries
'diabetes.csv',
'food.csv', // food choices
'got.csv', // game of thrones deaths
'happiness.csv', // world happiness 2017
'iris.csv',
'mushrooms.csv',
'pokemon.csv', // stats for all from all generations
'superheros.csv'
... ]
All have been placed in the public domain.
let iris = new DF('iris') // use `let`, you will be re-assigning a lot
NOTE
the lookup of datasets happens by recursive search of each directory in DF.opts.DATASETS
. You can use this and simply df.opts.DATASETS.push(yourDir)
and your dataset will be discoverable. You don't need to specify the extension. .csv
and .json
extensions are appended if not provided (e.g. iris is actually stored in iris.csv
). Dataset files must be in either CSV or JSON formats.
iris.head().print()
// .head(20) for the first 20 rows
// .tail() for last rows
# u8 Id f32 SepalLe... f32 SepalWi... f32 PetalLe... f32 PetalWi... s Species
- ----- -------------- -------------- -------------- -------------- -------------
0 1 5.09 3.50 1.39 0.20 Iris-setos...
1 2 4.90 3.00 1.39 0.20 Iris-setos...
2 3 4.69 3.20 1.29 0.20 Iris-setos...
3 4 4.59 3.09 1.50 0.20 Iris-setos...
4 5 5.00 3.59 1.39 0.20 Iris-setos...
- ----- -------------- -------------- -------------- -------------- -------------
5B 20B 20B 20B 20B NaN
NOTE the data types next to column names and memory indicators for every column.
iris.slice(10, 20).print() // can be .slice(5) for .slice(5, end)
# u8 Id f32 SepalLe... f32 SepalWi... f32 PetalLe... f32 PetalWi... s Species
- ----- -------------- -------------- -------------- -------------- -------------
0 11 5.40 3.70 1.50 0.20 Iris-setos...
1 12 4.80 3.40 1.60 0.20 Iris-setos...
- ----- -------------- -------------- -------------- -------------- -------------
2B 8B 8B 8B 8B NaN
NOTE the library will try to compute the width of each column
We know that there are 6 columns (try running iris.nCols
). To get all column names run:
iris.colNames.print(100) // make sure it prints all
Column s [Id, SepalLengthCm, SepalWidthCm, PetalLengthCm, PetalWidthCm, Species]
If you want to extract a column (Column, see the Column API below) from a data frame try:
iris.Species.print(5) // last column
Column s[Iris-setosa, Iris-setosa, Iris-setosa, Iris-setosa, Iris-setosa, ... 145 more]
Here s
stands for STRING. You may also see: f64
, f32
, i32
, i16
, i8
, u32
, u16
and u8
.
NOTE some column names will have spaces or/and will clash with the API and you will have to use iris.col(2)
OR iris.col('SepalWidthCm')
.
Columns can always be referred to by their index OR name.
Suppose you only want the first couple of columns:
iris.select(0, 1, -2).print(5) // the 1st, 2nd and the 2nd to last
This show the first 5 rows of the new data frame with only: Id
, SepalLength
and PetalWidth
.
# u8 Id f32 SepalLe... f32 PetalWi...
--- ----- -------------- --------------
0 1 5.09 0.20
1 2 4.90 0.20
2 3 4.69 0.20
3 4 4.59 0.20
4 5 5.00 0.20
... ... (145 more) ...
--- ----- -------------- --------------
150B 600B 600B
If you want to select a range of column: e.g. from the 1st to the 3rd try:
iris.sliceCols(0, 2).print(3)
# u8 Id f32 SepalLe... f32 SepalWi...
--- ----- -------------- --------------
0 1 5.09 3.50
1 2 4.90 3.00
2 3 4.69 3.20
... ... (147 more) ...
--- ----- -------------- --------------
150B 600B 600B
This is the same as:
iris.sliceCols('Id', 'SepalWidthCm').print(3)
iris.numeric // all BUT the "Species" column (getter)
iris.nominal // just the "Species" column (getter)
If you want to remove the 2nd and the second to last columns:
iris.drop(1, -2).print(3)
# f32 SepalLe... f32 SepalWi... f32 PetalLe... s Species
--- -------------- -------------- -------------- -------------
0 5.09 3.50 1.39 Iris-setos...
1 4.90 3.00 1.39 Iris-setos...
2 4.69 3.20 1.29 Iris-setos...
... ... (147 more) ... ...
--- -------------- -------------- -------------- -------------
600B 600B 600B NaN
NOTE those operations are not in-place meaning dropping produces a new data frame without specified columns.
Signature: iris.where(val, colId, op)
. Where op is one of {"=" (default), ">", "<", ">=", "<="}
.
iris.Species[0]
'Iris-setosa'
iris.where('Iris-setosa', -1) // -1 for last col
// ... DataFrame with subset of rows with just Iris-setosa
Signature: iris.filter(rowTestFunc)
.
Signature: iris.filter(valTestFunc, colId)
.
iris.where('Iris-setosa', -1)
// OR
iris.filter(species => species === 'Iris-setosa', -1)
iris.val(10, 'Species') // val(rowIdx, colId)
'Iris-setosa'
Accessing a single row:
const row = iris.row(20) // 21st row
[ 21,
5.400000095367432,
3.4000000953674316,
1.7000000476837158,
0.20000000298023224,
'Iris-setosa' ]
const irow = iris.irow(10);
Array.from(irow)
[ 5.400000095367432,
3.700000047683716,
1.5,
0.20000000298023224,
'Iris-setosa' ]
If you want to iterate over all the rows (not efficient) try:
const rowIt = iris.slice(0, 3).rowsIter // (getter)
for (const r of rowIt) {
console.log(r)
}
// you may also iterate over the dataframe (equivalent method)
for (const r of iris) {
console.log(r)
}
[ 1,
5.099999904632568,
3.5,
1.399999976158142,
0.20000000298023224,
'Iris-setosa' ]
[ 2,
4.900000095367432,
3,
1.399999976158142,
0.20000000298023224,
'Iris-setosa' ]
[ 3,
4.699999809265137,
3.200000047683716,
1.2999999523162842,
0.20000000298023224,
'Iris-setosa' ]
Just assign:
// 2nd col
iris[1] = iris[1].map(s => s >= 5 ? 0 : 1)
// equivalent to:
iris.SepalLengthCm = iris.SepalLengthCm.map(s => s >= 5 ? 0 : 1)
NOTE this might have to be dataset[' Col With Spaces'] = newCol
.
Apply function to each element is selected column:
iris.map(-1, label => {
// there is an easier way to do this (see `DataFrame.labelEncode()`)
if (label === 'Iris-versi') {
return 0;
} else if (label === 'Iris-virgi') {
return 1;
} else {
return 2;
}
});
NOTE use iris.map(null, f)
to apply to all columns.
null
means it will be applied to all.
.trunc(colId | null)
.floor(colId | null)
.ceil(colId | null)
.round(colId | null)
.abs(colId | null)
.sqrt(colId | null)
.cbrt(colId | null)
.square(colId | null)
.cube(colId | null)
.add(colId | null, n)
.sub(colId | null, n)
.mul(colId | null, n)
.div(colId | null, n)
It's smart enough to know not to apply them to string columns if they don't
make sense (e.g. .abs()
). String columns are ignored.
iris.rename(0, 'First').rename(-2, 'Second to Last')
// or just
iris.rename(0, 'First', -2, 'Second to Last')
iris.concat(iris) // append all rows (axis 0)
iris.concat(iris, 1) // append all columns (axis 1)
NOTE this library will manage duplicate column names.
iris.concat(iris, 1).colNames
[ 'Id',
'SepalLengthCm',
'SepalWidthCm',
'PetalLengthCm',
'PetalWidthCm',
'Species',
'Id2',
'SepalLengthCm2',
'SepalWidthCm2',
'PetalLengthCm2',
'PetalWidthCm2',
'Species2' ]
iris.appendCol(iris.Id, 'Id2') // .appendCol(col, colName)
iris.shuffle()
iris.reverse()
Both are safe in that the won't modify in place.
Signature: iris.sort(colId, 'asc' (default) | 'des' )
.
iris.sort('SepalWidthCm') // default is iris.sort(colId, 'asc')
iris.sort('SepalWidthCm', 'des') // descending sort
NOTE
constants such as 'des'
are defined in the constants
module which you can import:
const { DataType, LoggingLevel, PrintingPreset, SortingOrder, What } = require('dataf/constants')
MATH
.add()
.sub()
.mul()
.div()
STATS
.min()
.max()
.range()
.mean()
.var()
variance.stdev()
standard deviation.median()
.Q3()
.Q1()
.IQR()
inter-quartile range.skewness()
.kurtosis()
.mad()
mean absolute deviation
E.g.:
iris.IQR()
# s column f32 IQR
- ------------- -------
0 Id 75.00
1 SepalLengt... 1.30
2 SepalWidth... 0.50
3 PetalLengt... 3.50
4 PetalWidth... 1.50
- ------------- -------
NaN 20B
Signatures:
Signature | Description |
---|---|
`.sample(0.15)` | for random 15% of the dataset |
`iris.sample(30)` | for random 30 sample of the dataset |
`iris.sample(0.5, true)` | with replacement (default) |
`iris.sample(100, false)` | **without** replacement |
iris.summary() // this will produce a summary data frame with info for every column
# s column s dtype f32 min f32 max f32 range f32 mean f32 stdev
- ------------- ------- ------- ------- --------- -------- ---------
0 Id u8 1.00 150.00 149.00 75.50 43.30
1 SepalLengt... f32 4.30 7.90 3.59 5.84 0.82
2 SepalWidth... f32 2.00 4.40 2.40 3.05 0.43
3 PetalLengt... f32 1.00 6.90 5.90 3.75 1.75
4 PetalWidth... f32 0.10 2.50 2.40 1.19 0.76
5 Species s NaN NaN NaN NaN NaN
- ------------- ------- ------- ------- --------- -------- ---------
NaN NaN 24B 24B 24B 24B 24B
This is particularly useful for nominal / discrete attributes that take on a
small amount of values. E.g. Gender
is one of {M, F}
or Salary
is one of {Low, Med, High}
.
iris.counts(-1) // for the last column
// iris.ps(-1) // for normalized values
# s Species u8 count
- ------------- --------
0 Iris-setos... 50
1 Iris-versi... 50
2 Iris-virgi... 50
- ------------- --------
NaN 3B
For a correlation of each column with each other column (matrix):
iris.corr(/* `false` to *not* print the first column */)
# s column f64 Id f64 SepalLe... f64 SepalWi... f64 PetalLe... f64 PetalWi...
- ------------- ------ -------------- -------------- -------------- --------------
0 Id 1.00 0.71 -0.39 0.88 0.89
1 SepalLengt... 0.71 1.00 -0.10 0.87 0.81
2 SepalWidth... -0.39 -0.10 1.00 -0.42 -0.35
3 PetalLengt... 0.88 0.87 -0.42 1.00 0.96
4 PetalWidth... 0.89 0.81 -0.35 0.96 1.00
- ------------- ------ -------------- -------------- -------------- --------------
NaN 40B 40B 40B 40B 40B
Other matrix operations:
iris.cov()
iris.dot()
iri.distance()
To remove all rows that have some value:
// from all cols i.e. remove all rows where any of the value is NaN
iris.removeAll(NaN)
// from 1th and 3rd cols and from col 'PetalLengthCm'
iris.removeAll(NaN, 0, 2, 'PetalLengthCm')
iris.kBins('SepalLengthCm'); // 5 bins for this column
iris.kBins(null); // 3 bins for all columns
iris.kBins(2) // 3rd (2 idx) col, 3 bins
.col(2) // select ONLY 3rd column (index is 2), which is of type Column
.print(10)
Column u8[2, 1, 2, 1, 2, 2, 2, 2, 1, 1, ... 40 more]
NOTE this is smart enough only to target numeric attributes so string columns will be ignored (no need to run .numeric
).
Feature selection (i.e. select best columns, by default uses "var"
-- variance):
Signature: iris.nBest(n, metric)
where metric is one of:
"var"
"stdev"
"mean"
"mad"
"IQR"
"median"
"Q1"
"Q3"
"skewness"
"min"
"range"
"max"
OR a function from Column (one column) to a number (Column -> Num
).
iris.drop('Id') // `Id` column is not useful
.numeric // select all numeric cols
.nBest(2) // best 2 features using variance as score
.print(3) // show first 3 rows
// try: iris.drop('Id').numeric.nBest(2, 'mad').print(3)
# f32 PetalLe... f32 SepalLe...
--- -------------- --------------
0 1.39 5.09
1 1.39 4.90
2 1.29 4.69
... (147 more) ...
--- -------------- --------------
600B 600B
Using .nBest()
in this way is naive and you might want to normalize (scale to the same range) the values:
iris.drop('Id') // `Id` column is not useful
.numeric // select all numeric cols
.normalize() // bring them to range [0, 1]
.nBest(2) // best 2 features using variance as score
.print(3)
As you can see you might get different results:
# f32 PetalWi... f32 PetalLe...
--- -------------- --------------
0 0.04 0.06
1 0.04 0.06
2 0.04 0.05
... (147 more) ...
--- -------------- --------------
600B 600B
It's a bit awkward to constantly have to drop the 'Species'
column because it's a string column...
You can easily convert it to a numeric column:
From:
iris.select(-2, -1).print(48, 52)
# f32 PetalWi... s Species
--- -------------- -------------
... (48 more) ...
48 0.20 Iris-setos...
49 0.20 Iris-setos...
50 1.39 Iris-versi...
51 1.50 Iris-versi...
... (98 more) ...
--- -------------- -------------
600B NaN
To:
iris.select(-2, -1).labelEncode().print(48, 52)
# f32 PetalWi... u8 Species
--- -------------- ----------
... (48 more) ...
48 0.20 0
49 0.20 0
50 1.39 1
51 1.50 1
... (98 more) ...
--- -------------- ----------
600B 150B
By default all string columns will be label encoded (numeric columns will be ignored). You may specify the colIds
e.g. df.labelEncode(0, -3, 'Target')
.
Signature: iris.oneHot(colId)
// expects the column to be unsigned int
iris.labelEncode('Species')
.oneHot('Species')
.print(48, 52)
# u8 0 u8 1 u8 2
--- ---- --------- ----
... ... (48 more) ...
48 1 0 0
49 1 0 0
50 0 1 0
51 0 1 0
... ... (98 more) ...
--- ---- --------- ----
150B 150B 150B
For demonstration let's make a 1-col data frame:
iris.select(1).print(3)
# f32 SepalLe...
- --------------
0 5.09
1 4.90
2 4.69
- --------------
12B
To clip:
iris.select(1)
.clip(null, 4.88, 5) // null == all cols
.print(3)
# f32 SepalLe...
--- --------------
0 5.00
1 4.90
2 4.88
... (147 more)
--- --------------
600B
Notice that 5.09
got clipped to 5.00
!
To remove outliers (outside of Q1 to Q3) run:
iris.dropOutliers() // consider all cols
iris.dropOutliers(0, -2) // consider just 1st and second to last cols
Data Type | String |
---|---|
string | s |
32-bit signed integer | i32 |
16-bit signed integer | i16 |
8-bit signed integer | i8 |
32-bit unsigned integer | u32 |
16-bit unsigned integer | u16 |
8-bit unsigned integer | u8 |
32-bit float (single precision) | f32 |
64-bit float (double precision) | f64 |
If you want to get the data type for all columns try:
iris.dtypes
[ 'u8', 'f32', 'f32', 'f32', 'f32', 's' ] // read-only
Or for a prettier output make a meta data frame with information about the previous data frame!
iris.dtype() // note difference between `iris.dtype()` (method) and `iris.dtypes` (getter)
SIDENOTE .dtype()
is an aggregate! This means it produces a data frame from applying a Column -> *
operation to all columns.
# s column s dtype
- ------------- -------
0 Id u8
1 SepalLengt... f32
2 SepalWidth... f32
3 PetalLengt... f32
4 PetalWidth... f32
5 Species s
- ------------- -------
NaN NaN
You can force-cast columns:
iris.cast(2, 'u8') // passing `null` instead of `2` would run cast on all cols
You can also run iris.downcast()
and let the library figure out the most efficient data type for each column so that data is not lost.
This is especially useful after truncating (floats are converted to integers).
Default:
# u8 Id f32 SepalLe... f32 SepalWi... f32 PetalLe... f32 PetalWi... s Species
- ----- -------------- -------------- -------------- -------------- -------------
0 1 5.09 3.50 1.39 0.20 Iris-setos...
1 2 4.90 3.00 1.39 0.20 Iris-setos...
2 3 4.69 3.20 1.29 0.20 Iris-setos...
- ----- -------------- -------------- -------------- -------------- -------------
3B 12B 12B 12B 12B NaN
Now see how much memory can be saved:
iris.trunc().downcast().head(3)
# u8 Id u8 SepalLe... u8 SepalWi... u8 PetalLe... u8 PetalWi... s Species
- ----- ------------- ------------- ------------- ------------- -------------
0 1 5 3 1 0 Iris-setos...
1 2 4 3 1 0 Iris-setos...
2 3 4 3 1 0 Iris-setos...
- ----- ------------- ------------- ------------- ------------- -------------
3B 3B 3B 3B 3B NaN
Although this information is by default printed, you may produce a data frame with information about memory consumption of each column.
iris.memory()
# s column u16 memory
- ------------- ----------
0 Id 150
1 SepalLengt... 600
2 SepalWidth... 600
3 PetalLengt... 600
4 PetalWidth... 600
- ------------- ----------
NaN 10B
NOTE it's not calculated for string columns (notice that "Species" is missing).
To figure out how much your data frame is taking in total try:
iris.memory()
.col(-1)
.add()
2550 // bytes
If for some reason you need a deep-copy try (expensive):
iris.clone()
Shallow copies are cheap:
iris.copy()
Sometimes you may want to get rows from 10th to 20th and e.g. 50th to 65th:
// [F, T],[F, T] // FROM - TO
iris.slice(9, 19, 49, 64)
The same applies to column slices:
iris.sliceCols(-3, -2, 0, 2)
# f32 PetalLe... f32 PetalWi... u8 Id f32 SepalLe... f32 SepalWi...
- -------------- -------------- ----- -------------- --------------
0 1.39 0.20 1 5.09 3.50
1 1.39 0.20 2 4.90 3.00
2 1.29 0.20 3 4.69 3.20
- -------------- -------------- ----- -------------- --------------
12B 12B 3B 12B 12B
iris.head(2).toHTML(/* optional file name */)
<table>
<tr>
<th>Id</th>
<th>SepalLengthCm</th>
<th>SepalWidthCm</th>
<th>PetalLengthCm</th>
<th>PetalWidthCm</th>
<th>Species</th>
</tr>
<tr>
<td>1</td>
<td>5.099999904632568</td>
<td>3.5</td>
<td>1.399999976158142</td>
<td>0.20000000298023224</td>
<td>Iris-setosa</td>
</tr>
<tr>
<td>2</td>
<td>4.900000095367432</td>
<td>3</td>
<td>1.399999976158142</td>
<td>0.20000000298023224</td>
<td>Iris-setosa</td>
</tr>
</table>
iris.head(2).toJSON(/* optional file name */)
{
"Id": [1, 2],
"SepalLengthCm": [5.099999904632568, 4.900000095367432],
"SepalWidthCm": [3.5, 3],
"PetalLengthCm": [1.399999976158142, 1.399999976158142],
"PetalWidthCm": [0.20000000298023224, 0.20000000298023224],
"Species": ["Iris-setosa", "Iris-setosa"]
}
iris.head(2).toCSV(/* optional file name */)
Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
1,5.099999904632568,3.5,1.399999976158142,0.20000000298023224,Iris-setosa
2,4.900000095367432,3,1.399999976158142,0.20000000298023224,Iris-setosa
iris.head(2).toSQLTableDef('MyIrisTable', /* optional file name */)
CREATE TABLE IF NOT EXISTS MyIrisTable (
Id INT,
SepalLengthCm REAL,
SepalWidthCm REAL,
PetalLengthCm REAL,
PetalWidthCm REAL,
Species TEXT
)
iris.toSQLUpdates('MyIrisTable', /* optional file name */)
UPDATE MyIrisTable SET Id = 1, SepalLengthCm = 5.099999904632568, SepalWidthCm = 3.5, PetalLengthCm = 1.399999976158142, PetalWidthCm = 0.20000000298023224, Species = Iris-setosa;
UPDATE MyIrisTable SET Id = 2, SepalLengthCm = 4.900000095367432, SepalWidthCm = 3, PetalLengthCm = 1.399999976158142, PetalWidthCm = 0.20000000298023224, Species = Iris-setosa;
UPDATE MyIrisTable SET Id = 3, SepalLengthCm = 4.699999809265137, SepalWidthCm = 3.200000047683716, PetalLengthCm = 1.2999999523162842, PetalWidthCm = 0.20000000298023224, Species = Iris-setosa;
UPDATE MyIrisTable SET Id = 4, SepalLengthCm = 4.599999904632568, SepalWidthCm = 3.0999999046325684, PetalLengthCm = 1.5, PetalWidthCm = 0.20000000298023224, Species = Iris-setosa;
iris.head(4).toSQLInserts('MyIrisTable', /* optional file name */)
INSERT INTO MyIrisTable (Id, SepalLengthCm, SepalWidthCm, PetalLengthCm, PetalWidthCm, Species) VALUES (1, 5.099999904632568, 3.5, 1.399999976158142, 0.20000000298023224, Iris-setosa);
INSERT INTO MyIrisTable (Id, SepalLengthCm, SepalWidthCm, PetalLengthCm, PetalWidthCm, Species) VALUES (2, 4.900000095367432, 3, 1.399999976158142, 0.20000000298023224, Iris-setosa);
INSERT INTO MyIrisTable (Id, SepalLengthCm, SepalWidthCm, PetalLengthCm, PetalWidthCm, Species) VALUES (3, 4.699999809265137, 3.200000047683716, 1.2999999523162842, 0.20000000298023224, Iris-setosa);
INSERT INTO MyIrisTable (Id, SepalLengthCm, SepalWidthCm, PetalLengthCm, PetalWidthCm, Species) VALUES (4, 4.599999904632568, 3.0999999046325684, 1.5, 0.20000000298023224, Iris-setosa);
Option | Default | Sensible Alternatives | Description |
---|---|---|---|
`PRINT_PREC` | 2 | 3, 4, 5, 6, 7, 8 | how many float digits after the radix point to print |
`FLOAT_PREC` | 32 | 64 | - |
`MIN_COL_WIDTH` | 10 | 12, 15, 20 | constrain width of columns when printing |
`HEAD_LEN` | 5 | 7, 10, 20 | by default print this number of rows when running `.head()`, `.tail()` etc. |
To set:
DF.opts.OPTION = VALUE;
const args = df.colNames
// replace spaces with '_'
.map(c => [c, c.replace(/\s+/, '_')])
// flatten
.reduce((pair1, pair2) => pair1.concat(pair2), []);
df = df.rename(...args)
This would normally take a lot of code:
iris.normalize()
.matrix(
(col1, col2) => Math.abs(col1.mean() - col2.mean()),
true, // show cols
true, // halves the computation time when f(c2, c1) == f(c1, c2)
0) // saves computation on the diagonal, when f(c, c) == id
df = df.labelEncode() // string cols => unsigned int
.kBins(null) // f64, f32, ... => unsigned int
.downcast() // optimize
// see memory
df.memory()
// see dtypes
df.dtype()
// megabytes
B = df.memory() // mem for each col
.add() // add up
.val(0, 1) // get total
MB = B / 1e6
TODO
- I am not a statistician
- Unit tests for
DataFrame
are not done yet - Alpha-stage
- I would not use it in production (yet)
- This isn't supposed to be an exact copy of pandas
- In some places it's not efficient
- Date columns / mixed data types not supported. Every column must be either
numeric OR string. A single
DataFrame
may have a combination of numeric and string columns. - I am a student.
MIT