docs(README): add installation and quick example walk-through

Aeternalis-Ingenium · Aeternalis-Ingenium · Dec 4, 2023 · Dec 4, 2023 · Dec 4, 2023 · Dec 4, 2023
commit 76767e59743541f568637a6021694511545d8f71
@@ -40,10 +40,153 @@
     </a>
 </p>
 
-# Introduction
+## Installation
 
-TODO: Write final introduction
+```shell
+# Install without openpyxl
+$ pip3 install anomalytics
 
-# Quick Start
+# Install with openpyxl
+$ pip3 install "anomalytics[extra]"
+```
 
-TODO: Write installation and quick application.
+## Use Case
+
+`anomalytics` can be used to analyze anomalies in your dataset (boht `pandas.DataFrame` or `pandas.Series`). To start, let's follow along with this minimum example where we want to detect extremely high anomalies in our time series dataset.
+
+1. Import `anomalytics` and initialise your time series:
+
+    ```python
+    import anomalytics as atics
+
+    ts = atics.read_ts(
+        "my_dataset.csv",
+        "csv"
+    )
+    ts.head()
+    ```
+    ```shell
+    Date-Time
+    2008-11-03 08:00:00   -0.282
+    2008-11-03 09:00:00   -0.368
+    2008-11-03 10:00:00   -0.400
+    2008-11-03 11:00:00   -0.320
+    2008-11-03 12:00:00   -0.155
+    Name: Example Dataset, dtype: float64
+    ```
+
+2. Set the time windows of t0, t1, and t2 to compute dynamic expanding period for calculating the threshold via quantile:
+
+    ```python
+    t0, t1, t2 = atics.set_time_window(ts.shape[0], "POT", "historical", t0_pct=0.7, t1_pct=0.2, t2_pct=0.1)
+    print(f"T0: {t0}")
+    print(f"T1: {t1}")
+    print(f"T2: {t2}")
+    ```
+    ```shell
+    T0: 70000
+    T1: 20000
+    T2: 10000
+    ```
+
+3. Extract exceedances and indicate that it is a `"high"` anomaly type and what's the `q`uantile:
+
+    ```python
+    exceedance_ts = atics.get_exceedance_peaks_over_threshold(ts, ts.shape[0], "high", 0.95)
+    exceedance_ts.tail()
+    ```
+    ```shell
+    Date-Time
+    2020-03-31 19:00:00    0.867
+    2020-03-31 20:00:00    0.867
+    2020-03-31 21:00:00    0.867
+    2020-03-31 22:00:00    0.867
+    2020-03-31 23:00:00    0.867
+    Name: Example Dataset, dtype: float64
+    ```
+
+4. Compute the anomaly score for each exceedance and initialize a params for further analysis and evaluation:
+
+    ```python
+    params = {}
+    anomaly_score_ts = atics.get_anomaly_score(exceedance_ts, exceedance_ts.shape[0], params)
+    anomaly_score_ts.head()
+    ```
+    ```shell
+    Date-Time
+    2016-10-29 00:00:00    0.0
+    2016-10-29 01:00:00    0.0
+    2016-10-29 02:00:00    0.0
+    2016-10-29 03:00:00    0.0
+    2016-10-29 04:00:00    0.0
+    Name: Example Dataset, dtype: float64
+    ...
+    ```
+
+5. Inspec our parameters (the result of genpareto fitting):
+
+    ```python
+    print(params)
+    ```
+    ```shell
+    {0: {'datetime': Timestamp('2016-10-29 03:00:00'),
+    'c': 0.0,
+    'loc': 0.0,
+    'scale': 0.0,
+    'p_value': 0.0,
+    'anomaly_score': 0.0},
+    1: {'datetime': Timestamp('2016-10-29 04:00:00'),
+    ...
+    'loc': 0,
+    'scale': 0.19125308567629334,
+    'p_value': 0.19286132173263668,
+    'anomaly_score': 5.1850728337654886},
+    ...}
+    ```
+
+6. Detect the extremely high anomalies:
+
+    ```python
+    anomaly_ts = pot_detecto.detect(anomaly_score_ts, t1, 0.90)
+    anomaly_ts.head()
+    ```
+    ```shell
+    Date-Time
+    2019-02-09 08:00:00    False
+    2019-02-09 09:00:00    False
+    2019-02-09 10:00:00    False
+    2019-02-09 11:00:00    False
+    2019-02-09 12:00:00    False
+    Name: Example Dataset, dtype: bool
+    ```
+
+7. Evaluate your analysis result with Kolmogorov Smirnov 1 sample test:
+
+    ```python
+    ks_result = ks_1sample(ts=exceedance_ts, stats_method="POT", fit_params=params)
+    print(ks_result)
+    ```
+    ```shell
+    {'total_nonzero_exceedances': 5028, 'start_datetime': '2023-10-1000:00:00', 'end_datetime': '2023-10-1101:00:00', 'stats_distance': 0.0284, 'p_value': 0.8987, 'c': 0.003566, 'loc': 0, 'scale': 0.140657}
+    ```
+
+# Reference
+
+* Nakamura, C. (2021, July 13). On Choice of Hyper-parameter in Extreme Value Theory Based on Machine Learning Techniques. arXiv:2107.06074 [cs.LG]. https://doi.org/10.48550/arXiv.2107.06074
+
+* Davis, N., Raina, G., & Jagannathan, K. (2019). LSTM-Based Anomaly Detection: Detection Rules from Extreme Value Theory. In Proceedings of the EPIA Conference on Artificial Intelligence 2019. https://doi.org/10.48550/arXiv.1909.06041
+
+* Arian, H., Poorvasei, H., Sharifi, A., & Zamani, S. (2020, November 13). The Uncertain Shape of Grey Swans: Extreme Value Theory with Uncertain Threshold. arXiv:2011.06693v1 [econ.GN]. https://doi.org/10.48550/arXiv.2011.06693
+
+* Yiannis Kalliantzis. (n.d.). Detect Outliers: Expert Outlier Detection and Insights. Retrieved [23-12-04T15:10:12.000Z], from https://detectoutliers.com/
+
+# Wall of Fame
+
+I am deeply grateful to have met, guided, or even just read some inspirational works from people who motivate me to publish this open-source package as a part of my capstone project at CODE university of applied sciences in Berlin (2023):
+
+* My lovely mother Sarbina Lindenberg
+* Adam Roe
+* Alessandro Dolci
+* Christian Leschinski
+* Johanna Kokocinski
+* Peter Krauß
@@ -38,8 +38,9 @@ def get_threshold_peaks_over_threshold(
 
     ## Example
     ----------
-    >>> pot_threshold_df = pot_detecto.compute_exceedance_threshold(df, "high", 0.95)
-    >>> pot_threshold_df.tail()
+    >>> t0, t1, t2 = set_time_window(1000, "POT", "historical", t0_pct=0.7, t1_pct=0.2, t2_pct=0.1)
+    >>> pot_threshold_ts = get_threshold_peaks_over_threshold(ts, t0, "high", 0.95)
+    >>> pot_threshold_ts.tail()
     Date-Time
     2020-03-31 19:00:00    0.867
     2020-03-31 20:00:00    0.867
@@ -104,8 +105,9 @@ def get_exceedance_peaks_over_threshold(
 
     ## Example
     ----------
-    >>> exceedance_df = pot_detecto.extract_exceedance(df, "high", pot_threshold_df)
-    >>> exceedance_df.tail()
+    >>> t0, t1, t2 = set_time_window(1000, "POT", "historical", t0_pct=0.7, t1_pct=0.2, t2_pct=0.1)
+    >>> exceedance_ts = get_exceedance_peaks_over_threshold(ts, t0, "high", 0.95)
+    >>> exceedance_ts.tail()
     Date-Time
     2020-03-31 19:00:00    0.867
     2020-03-31 20:00:00    0.867
@@ -167,15 +169,32 @@ def get_anomaly_score(ts: pd.Series, t0: int, gpd_params: typing.Dict) -> pd.Ser
 
     ## Example
     ----------
-    >>> anomaly_score_df = pot_detecto.extract_exceedance(df, "high", pot_threshold_df)
-    >>> anomaly_score_df.head()
+    >>> t0, t1, t2 = set_time_window(1000, "POT", "historical", t0_pct=0.7, t1_pct=0.2, t2_pct=0.1)
+    >>> params = {}
+    >>> anomaly_score_ts = get_anomaly_score(exceedance_ts, t0, params)
+    >>> anomaly_score_ts.head()
     Date-Time
     2016-10-29 00:00:00    0.0
     2016-10-29 01:00:00    0.0
     2016-10-29 02:00:00    0.0
     2016-10-29 03:00:00    0.0
     2016-10-29 04:00:00    0.0
     Name: Example Dataset, dtype: float64
+    ...
+    >>> params
+    {0: {'datetime': Timestamp('2016-10-29 03:00:00'),
+    'c': 0.0,
+    'loc': 0.0,
+    'scale': 0.0,
+    'p_value': 0.0,
+    'anomaly_score': 0.0},
+    1: {'datetime': Timestamp('2016-10-29 04:00:00'),
+    ...
+    'loc': 0,
+    'scale': 0.19125308567629334,
+    'p_value': 0.19286132173263668,
+    'anomaly_score': 5.1850728337654886},
+    ...}
 
     ## Raises
     ---------
@@ -261,7 +280,8 @@ def get_anomaly_threshold(ts: pd.Series, t1: int, q: float = 0.90) -> float:
 
     ## Example
     ----------
-    >>> anomaly_threshold = pot_detecto.compute_anomaly_threshold(anomaly_score_df, 0.90)
+    >>> t0, t1, t2 = set_time_window(1000, "POT", "historical", t0_pct=0.7, t1_pct=0.2, t2_pct=0.1)
+    >>> anomaly_threshold = compute_anomaly_threshold(anomaly_score_ts, t1, 0.90)
     >>> print(anomaly_threshold)
     9.167442809714414
 
@@ -308,8 +328,9 @@ def get_anomaly(ts: pd.Series, t1: int, q: float = 0.90) -> pd.Series:
 
     ## Example
     ----------
-    >>> anomaly_df = pot_detecto.detect(anomaly_score_df, anomaly_threshold_df)
-    >>> anomaly_df.head()
+    >>> t0, t1, t2 = set_time_window(1000, "POT", "historical", t0_pct=0.7, t1_pct=0.2, t2_pct=0.1)
+    >>> anomaly_ts = pot_detecto.detect(anomaly_score_ts, t1, 0.90)
+    >>> anomaly_ts.head()
     Date-Time
     2019-02-09 08:00:00    False
     2019-02-09 09:00:00    False