Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs(docstring-readme): finish docstring and write readme #16

Merged
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
docs(README): add installation and quick example walk-through
ninopleno committed Dec 4, 2023
commit 76767e59743541f568637a6021694511545d8f71
151 changes: 147 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
@@ -40,10 +40,153 @@
</a>
</p>

# Introduction
## Installation

TODO: Write final introduction
```shell
# Install without openpyxl
$ pip3 install anomalytics

# Quick Start
# Install with openpyxl
$ pip3 install "anomalytics[extra]"
```

TODO: Write installation and quick application.
## Use Case

`anomalytics` can be used to analyze anomalies in your dataset (boht `pandas.DataFrame` or `pandas.Series`). To start, let's follow along with this minimum example where we want to detect extremely high anomalies in our time series dataset.

1. Import `anomalytics` and initialise your time series:

```python
import anomalytics as atics

ts = atics.read_ts(
"my_dataset.csv",
"csv"
)
ts.head()
```
```shell
Date-Time
2008-11-03 08:00:00 -0.282
2008-11-03 09:00:00 -0.368
2008-11-03 10:00:00 -0.400
2008-11-03 11:00:00 -0.320
2008-11-03 12:00:00 -0.155
Name: Example Dataset, dtype: float64
```

2. Set the time windows of t0, t1, and t2 to compute dynamic expanding period for calculating the threshold via quantile:

```python
t0, t1, t2 = atics.set_time_window(ts.shape[0], "POT", "historical", t0_pct=0.7, t1_pct=0.2, t2_pct=0.1)
print(f"T0: {t0}")
print(f"T1: {t1}")
print(f"T2: {t2}")
```
```shell
T0: 70000
T1: 20000
T2: 10000
```

3. Extract exceedances and indicate that it is a `"high"` anomaly type and what's the `q`uantile:

```python
exceedance_ts = atics.get_exceedance_peaks_over_threshold(ts, ts.shape[0], "high", 0.95)
exceedance_ts.tail()
```
```shell
Date-Time
2020-03-31 19:00:00 0.867
2020-03-31 20:00:00 0.867
2020-03-31 21:00:00 0.867
2020-03-31 22:00:00 0.867
2020-03-31 23:00:00 0.867
Name: Example Dataset, dtype: float64
```

4. Compute the anomaly score for each exceedance and initialize a params for further analysis and evaluation:

```python
params = {}
anomaly_score_ts = atics.get_anomaly_score(exceedance_ts, exceedance_ts.shape[0], params)
anomaly_score_ts.head()
```
```shell
Date-Time
2016-10-29 00:00:00 0.0
2016-10-29 01:00:00 0.0
2016-10-29 02:00:00 0.0
2016-10-29 03:00:00 0.0
2016-10-29 04:00:00 0.0
Name: Example Dataset, dtype: float64
...
```

5. Inspec our parameters (the result of genpareto fitting):

```python
print(params)
```
```shell
{0: {'datetime': Timestamp('2016-10-29 03:00:00'),
'c': 0.0,
'loc': 0.0,
'scale': 0.0,
'p_value': 0.0,
'anomaly_score': 0.0},
1: {'datetime': Timestamp('2016-10-29 04:00:00'),
...
'loc': 0,
'scale': 0.19125308567629334,
'p_value': 0.19286132173263668,
'anomaly_score': 5.1850728337654886},
...}
```

6. Detect the extremely high anomalies:

```python
anomaly_ts = pot_detecto.detect(anomaly_score_ts, t1, 0.90)
anomaly_ts.head()
```
```shell
Date-Time
2019-02-09 08:00:00 False
2019-02-09 09:00:00 False
2019-02-09 10:00:00 False
2019-02-09 11:00:00 False
2019-02-09 12:00:00 False
Name: Example Dataset, dtype: bool
```

7. Evaluate your analysis result with Kolmogorov Smirnov 1 sample test:

```python
ks_result = ks_1sample(ts=exceedance_ts, stats_method="POT", fit_params=params)
print(ks_result)
```
```shell
{'total_nonzero_exceedances': 5028, 'start_datetime': '2023-10-1000:00:00', 'end_datetime': '2023-10-1101:00:00', 'stats_distance': 0.0284, 'p_value': 0.8987, 'c': 0.003566, 'loc': 0, 'scale': 0.140657}
```

# Reference

* Nakamura, C. (2021, July 13). On Choice of Hyper-parameter in Extreme Value Theory Based on Machine Learning Techniques. arXiv:2107.06074 [cs.LG]. https://doi.org/10.48550/arXiv.2107.06074

* Davis, N., Raina, G., & Jagannathan, K. (2019). LSTM-Based Anomaly Detection: Detection Rules from Extreme Value Theory. In Proceedings of the EPIA Conference on Artificial Intelligence 2019. https://doi.org/10.48550/arXiv.1909.06041

* Arian, H., Poorvasei, H., Sharifi, A., & Zamani, S. (2020, November 13). The Uncertain Shape of Grey Swans: Extreme Value Theory with Uncertain Threshold. arXiv:2011.06693v1 [econ.GN]. https://doi.org/10.48550/arXiv.2011.06693

* Yiannis Kalliantzis. (n.d.). Detect Outliers: Expert Outlier Detection and Insights. Retrieved [23-12-04T15:10:12.000Z], from https://detectoutliers.com/

# Wall of Fame

I am deeply grateful to have met, guided, or even just read some inspirational works from people who motivate me to publish this open-source package as a part of my capstone project at CODE university of applied sciences in Berlin (2023):

* My lovely mother Sarbina Lindenberg
* Adam Roe
* Alessandro Dolci
* Christian Leschinski
* Johanna Kokocinski
* Peter Krauß
39 changes: 30 additions & 9 deletions src/anomalytics/stats/peaks_over_threshold.py
Original file line number Diff line number Diff line change
@@ -38,8 +38,9 @@ def get_threshold_peaks_over_threshold(

## Example
----------
>>> pot_threshold_df = pot_detecto.compute_exceedance_threshold(df, "high", 0.95)
>>> pot_threshold_df.tail()
>>> t0, t1, t2 = set_time_window(1000, "POT", "historical", t0_pct=0.7, t1_pct=0.2, t2_pct=0.1)
>>> pot_threshold_ts = get_threshold_peaks_over_threshold(ts, t0, "high", 0.95)
>>> pot_threshold_ts.tail()
Date-Time
2020-03-31 19:00:00 0.867
2020-03-31 20:00:00 0.867
@@ -104,8 +105,9 @@ def get_exceedance_peaks_over_threshold(

## Example
----------
>>> exceedance_df = pot_detecto.extract_exceedance(df, "high", pot_threshold_df)
>>> exceedance_df.tail()
>>> t0, t1, t2 = set_time_window(1000, "POT", "historical", t0_pct=0.7, t1_pct=0.2, t2_pct=0.1)
>>> exceedance_ts = get_exceedance_peaks_over_threshold(ts, t0, "high", 0.95)
>>> exceedance_ts.tail()
Date-Time
2020-03-31 19:00:00 0.867
2020-03-31 20:00:00 0.867
@@ -167,15 +169,32 @@ def get_anomaly_score(ts: pd.Series, t0: int, gpd_params: typing.Dict) -> pd.Ser

## Example
----------
>>> anomaly_score_df = pot_detecto.extract_exceedance(df, "high", pot_threshold_df)
>>> anomaly_score_df.head()
>>> t0, t1, t2 = set_time_window(1000, "POT", "historical", t0_pct=0.7, t1_pct=0.2, t2_pct=0.1)
>>> params = {}
>>> anomaly_score_ts = get_anomaly_score(exceedance_ts, t0, params)
>>> anomaly_score_ts.head()
Date-Time
2016-10-29 00:00:00 0.0
2016-10-29 01:00:00 0.0
2016-10-29 02:00:00 0.0
2016-10-29 03:00:00 0.0
2016-10-29 04:00:00 0.0
Name: Example Dataset, dtype: float64
...
>>> params
{0: {'datetime': Timestamp('2016-10-29 03:00:00'),
'c': 0.0,
'loc': 0.0,
'scale': 0.0,
'p_value': 0.0,
'anomaly_score': 0.0},
1: {'datetime': Timestamp('2016-10-29 04:00:00'),
...
'loc': 0,
'scale': 0.19125308567629334,
'p_value': 0.19286132173263668,
'anomaly_score': 5.1850728337654886},
...}

## Raises
---------
@@ -261,7 +280,8 @@ def get_anomaly_threshold(ts: pd.Series, t1: int, q: float = 0.90) -> float:

## Example
----------
>>> anomaly_threshold = pot_detecto.compute_anomaly_threshold(anomaly_score_df, 0.90)
>>> t0, t1, t2 = set_time_window(1000, "POT", "historical", t0_pct=0.7, t1_pct=0.2, t2_pct=0.1)
>>> anomaly_threshold = compute_anomaly_threshold(anomaly_score_ts, t1, 0.90)
>>> print(anomaly_threshold)
9.167442809714414

@@ -308,8 +328,9 @@ def get_anomaly(ts: pd.Series, t1: int, q: float = 0.90) -> pd.Series:

## Example
----------
>>> anomaly_df = pot_detecto.detect(anomaly_score_df, anomaly_threshold_df)
>>> anomaly_df.head()
>>> t0, t1, t2 = set_time_window(1000, "POT", "historical", t0_pct=0.7, t1_pct=0.2, t2_pct=0.1)
>>> anomaly_ts = pot_detecto.detect(anomaly_score_ts, t1, 0.90)
>>> anomaly_ts.head()
Date-Time
2019-02-09 08:00:00 False
2019-02-09 09:00:00 False