Skip to content

Commit

Permalink
Merge pull request #1196 from msvinaykumar/logplotscounter
Browse files Browse the repository at this point in the history
added timer for plots
  • Loading branch information
dinogun authored May 17, 2024
2 parents 98dc2a1 + 073f156 commit 0b3ecdb
Show file tree
Hide file tree
Showing 3 changed files with 127 additions and 68 deletions.
110 changes: 78 additions & 32 deletions design/KruizePromQL.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
# Custom Prometheus Queries for Kruize

These are the custom Prometheus queries that you can use while running Kruize. These queries provide valuable insights into the performance of Kruize APIs and KruizeDB methods.
These are the custom Prometheus queries that you can use while running Kruize. These queries provide valuable insights
into the performance of Kruize APIs and KruizeDB methods.

## KruizeAPI Metrics

Expand All @@ -16,24 +17,34 @@ The following are the available Kruize APIs that you can monitor:

To monitor the performance of these APIs, you can use the following metrics:

- `kruizeAPI_count`: This metric provides the count of invocations for a specific API. It measures how many times the API has been called.
- `kruizeAPI_sum`: This metric provides the sum of the time taken by a specific API. It measures the total time consumed by the API across all invocations.
- `kruizeAPI_max`: This metric provides the maximum time taken by a specific API. It measures the highest execution time observed for the API.
- `kruizeAPI_count`: This metric provides the count of invocations for a specific API. It measures how many times the
API has been called.
- `kruizeAPI_sum`: This metric provides the sum of the time taken by a specific API. It measures the total time consumed
by the API across all invocations.
- `kruizeAPI_max`: This metric provides the maximum time taken by a specific API. It measures the highest execution time
observed for the API.

Here are some sample metrics for the mentioned APIs which can run in Prometheus:

- `kruizeAPI_count{api="createExperiment", application="Kruize", method="POST", status="success"}`: Returns the count of successful invocations for the `createExperiment` API.
- `kruizeAPI_count{api="createExperiment", application="Kruize", method="POST", status="failure"}`: Returns the count of failed invocations for the `createExperiment` API.
- `kruizeAPI_sum{api="createExperiment", application="Kruize", method="POST", status="success"}`: Returns the sum of the time taken by the successful invocations of `createExperiment` API.
- `kruizeAPI_max{api="createExperiment", application="Kruize", method="POST", status="success"}`: Returns the maximum time taken by the successful invocation of `createExperiment` API.
- `kruizeAPI_count{api="createExperiment", application="Kruize", method="POST", status="success"}`: Returns the count of
successful invocations for the `createExperiment` API.
- `kruizeAPI_count{api="createExperiment", application="Kruize", method="POST", status="failure"}`: Returns the count of
failed invocations for the `createExperiment` API.
- `kruizeAPI_sum{api="createExperiment", application="Kruize", method="POST", status="success"}`: Returns the sum of the
time taken by the successful invocations of `createExperiment` API.
- `kruizeAPI_max{api="createExperiment", application="Kruize", method="POST", status="success"}`: Returns the maximum
time taken by the successful invocation of `createExperiment` API.

By changing the value of the `api` and `method` label, you can gather metrics for other Kruize APIs such as `listRecommendations`, `listExperiments`, and `updateResults`.
By changing the value of the `api` and `method` label, you can gather metrics for other Kruize APIs such
as `listRecommendations`, `listExperiments`, and `updateResults`.

Here is a sample command to collect the metric through `curl`
- `curl --silent -G -kH "Authorization: Bearer ${TOKEN}" --data-urlencode 'query=kruizeAPI_sum{api="listRecommendations", application="Kruize", method="GET", status="success"}' ${PROMETHEUS_URL} | jq` :
Returns the sum of the time taken by `listRecommendations` API.


- `curl --silent -G -kH "Authorization: Bearer ${TOKEN}" --data-urlencode 'query=kruizeAPI_sum{api="listRecommendations", application="Kruize", method="GET", status="success"}' ${PROMETHEUS_URL} | jq` :
Returns the sum of the time taken by `listRecommendations` API.

Sample Output:

```
{
"status": "success",
Expand Down Expand Up @@ -72,7 +83,8 @@ The following are the available Kruize DB methods that you can monitor:

- `addExperimentToDB`: Method for adding an experiment to the database.
- `addResultToDB`: Method for adding experiment results to the database.
- `addBulkResultsToDBAndFetchFailedResults`: Method for adding bulk experiment results to the database and fetch the failed results.
- `addBulkResultsToDBAndFetchFailedResults`: Method for adding bulk experiment results to the database and fetch the
failed results.
- `addRecommendationToDB`: Method for adding a recommendation to the database.
- `loadExperimentByName`: Method for loading an experiment by name.
- `loadResultsByExperimentName`: Method for loading experiment results by experiment name.
Expand All @@ -82,28 +94,51 @@ The following are the available Kruize DB methods that you can monitor:
- `loadPerformanceProfileByName`: Method to load a specific performance profile.
- `loadAllPerformanceProfiles`: Method to load all performance profiles.

## KruizeMethod Metrics

The following are the available Kruize methods that you can monitor:

- `generatePlots`: Method to generate box plot metrics for all terms.

Sample Output:

```
KruizeMethod_max{application="Kruize",method="generatePlots",status="success",} 0.036112854
KruizeMethod_count{application="Kruize",method="generatePlots",status="success",} 2.0
KruizeMethod_sum{application="Kruize",method="generatePlots",status="success",} 0.050705769
```

## Time taken for KruizeDB metrics

To monitor the performance of these methods, you can use the following metrics:

- `kruizeDB_count`: This metric provides the count of calls made to the specific DB method. It measures how many times the DB method has been called.
- `kruizeDB_sum`: This metric provides the sum of the time taken by a specific DB method. It measures the total time consumed by the DB method across all invocations.
- `kruizeDB_max`: This metric provides the maximum time taken by a specific DB method. It measures the highest execution time observed for the DB method.
- `kruizeDB_count`: This metric provides the count of calls made to the specific DB method. It measures how many times
the DB method has been called.
- `kruizeDB_sum`: This metric provides the sum of the time taken by a specific DB method. It measures the total time
consumed by the DB method across all invocations.
- `kruizeDB_max`: This metric provides the maximum time taken by a specific DB method. It measures the highest execution
time observed for the DB method.

Here are some sample metrics for the mentioned DB methods which can run in Prometheus:

- `kruizeDB_count{application="Kruize", method="addExperimentToDB", status="success"}`: Number of successful invocations of `addExperimentToDB` method.
- `kruizeDB_count{application="Kruize", method="addExperimentToDB", status="failure"}`: Number of failed invocations of `addExperimentToDB` method.
- `kruizeDB_sum{application="Kruize", method="addExperimentToDB", status="success"}`: Total time taken by the `addExperimentToDB` method which were success.
- `kruizeDB_max{application="Kruize", method="addExperimentToDB", status="success"}`: Maximum time taken by the `addExperimentToDB` method which were success.
- `kruizeDB_count{application="Kruize", method="addExperimentToDB", status="success"}`: Number of successful invocations
of `addExperimentToDB` method.
- `kruizeDB_count{application="Kruize", method="addExperimentToDB", status="failure"}`: Number of failed invocations
of `addExperimentToDB` method.
- `kruizeDB_sum{application="Kruize", method="addExperimentToDB", status="success"}`: Total time taken by
the `addExperimentToDB` method which were success.
- `kruizeDB_max{application="Kruize", method="addExperimentToDB", status="success"}`: Maximum time taken by
the `addExperimentToDB` method which were success.

By changing the value of the `method` label, you can gather metrics for other KruizeDB metrics.

Here is a sample command to collect the metric through `curl`

- `curl --silent -G -kH "Authorization: Bearer ${TOKEN}" --data-urlencode 'query=kruizeDB_sum{application="Kruize", method="loadRecommendationsByExperimentName", status="success"}' ${PROMETHEUS_URL} | jq` :
Returns the sum of the time taken by `loadRecommendationsByExperimentName` method.

Sample Output:

```
{
"status": "success",
Expand Down Expand Up @@ -139,15 +174,20 @@ Sample Output:
# Kruize Metrics Collection and Analysis

To facilitate the performance analysis of the Kruize application, we provide a comprehensive script, [kruize_metrics.py](../scripts/kruize_metrics.py), which enables the collection of Kruize metrics in CSV format.
This script generates two distinct output files: increase_kruizemetrics.csv and total_kruizemetrics.csv. Notably, the PostgresDB metrics maintain consistency across both files.
To facilitate the performance analysis of the Kruize application, we provide a comprehensive
script, [kruize_metrics.py](../scripts/kruize_metrics.py), which enables the collection of Kruize metrics in CSV format.
This script generates two distinct output files: increase_kruizemetrics.csv and total_kruizemetrics.csv. Notably, the
PostgresDB metrics maintain consistency across both files.

### Output Files and Format

- `increase_kruizemetrics.csv`: This file leverages increase() queries to ascertain the total incremental changes in Kruize metric values over time.
- `total_kruizemetrics.csv`: This file employs the original queries to compute cumulative metric values since the inception of the Kruize application.
- `increase_kruizemetrics.csv`: This file leverages increase() queries to ascertain the total incremental changes in
Kruize metric values over time.
- `total_kruizemetrics.csv`: This file employs the original queries to compute cumulative metric values since the
inception of the Kruize application.

Each column within the CSV files corresponds to specific API and DB metrics, capturing counts, sums, and maximum values for both successful and failed operations.
Each column within the CSV files corresponds to specific API and DB metrics, capturing counts, sums, and maximum values
for both successful and failed operations.

### Some key columns for insightful analysis:

Expand Down Expand Up @@ -175,19 +215,25 @@ Each column within the CSV files corresponds to specific API and DB metrics, cap
| kruizeDB_size | Current size of the Kruize database. |
| kruizeDB_results | Total count of results available in the database across all experiments. |


# Initial Analysis Insights

Upon analyzing the collected metrics, several crucial insights emerge:

- `Database Growth`: As the number of experiments and associated results increases, there is a proportional growth in the size of the database.
- `Database Growth`: As the number of experiments and associated results increases, there is a proportional growth in
the size of the database.

- `Update Recommendations Time`: Currently, the time required for updating recommendations exhibits an increasing trend with the growth in results. This aspect necessitates closer attention and potential optimization efforts.
- `Update Recommendations Time`: Currently, the time required for updating recommendations exhibits an increasing trend
with the growth in results. This aspect necessitates closer attention and potential optimization efforts.

- `Stable Update Results Time`: The time taken for updating experiment results is expected to remain relatively stable. Any deviations from this expected pattern warrant further investigation for potential performance issues.
- `Stable Update Results Time`: The time taken for updating experiment results is expected to remain relatively stable.
Any deviations from this expected pattern warrant further investigation for potential performance issues.

- `DB Method Aggregation`: While individual DB method metrics provide valuable insights, it is important to understand how they collectively contribute to the overall API metrics. A comprehensive analysis of both individual and aggregated DB metrics is essential for a holistic performance assessment.
- `DB Method Aggregation`: While individual DB method metrics provide valuable insights, it is important to understand
how they collectively contribute to the overall API metrics. A comprehensive analysis of both individual and
aggregated DB metrics is essential for a holistic performance assessment.

- `Max Value Analysis`: Evaluating the maximum values allows for the identification of peak performance periods for each method, aiding in the identification of potential performance bottlenecks.
- `Max Value Analysis`: Evaluating the maximum values allows for the identification of peak performance periods for each
method, aiding in the identification of potential performance bottlenecks.

By conducting a thorough analysis based on these initial insights, users can effectively monitor and optimize the performance of the Kruize application, thereby ensuring a seamless and efficient user experience.
By conducting a thorough analysis based on these initial insights, users can effectively monitor and optimize the
performance of the Kruize application, thereby ensuring a seamless and efficient user experience.
Original file line number Diff line number Diff line change
Expand Up @@ -31,11 +31,13 @@
import com.autotune.operator.KruizeDeploymentInfo;
import com.autotune.utils.GenericRestApiClient;
import com.autotune.utils.KruizeConstants;
import com.autotune.utils.MetricsConfig;
import com.autotune.utils.Utils;
import com.google.gson.Gson;
import com.google.gson.JsonArray;
import com.google.gson.JsonElement;
import com.google.gson.JsonObject;
import io.micrometer.core.instrument.Timer;
import org.json.JSONObject;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
Expand Down Expand Up @@ -589,7 +591,19 @@ private boolean generateRecommendationsBasedOnTerms(ContainerData containerData,
// generate plots when minimum data is available for the term
if (KruizeDeploymentInfo.plots) {
if (null != monitoringStartTime) {
mappedRecommendationForTerm.setPlots(new PlotManager(containerData.getResults(), terms, monitoringStartTime, monitoringEndTime).generatePlots());
Timer.Sample timerBoxPlots = null;
String status = "success"; // TODO avoid this constant at multiple place
try {
timerBoxPlots = Timer.start(MetricsConfig.meterRegistry());
mappedRecommendationForTerm.setPlots(new PlotManager(containerData.getResults(), terms, monitoringStartTime, monitoringEndTime).generatePlots());
} catch (Exception e) {
status = String.format("Box plots Failed due to - %s", e.getMessage());
} finally {
if (timerBoxPlots != null) {
MetricsConfig.timerBoxPlots = MetricsConfig.timerBBoxPlots.tag("status", status).register(MetricsConfig.meterRegistry());
timerBoxPlots.stop(MetricsConfig.timerBoxPlots);
}
}
}
}
}
Expand Down Expand Up @@ -1408,7 +1422,7 @@ private String getResults(Map<String, KruizeObject> mainKruizeExperimentMAP, Kru
* @param interval_start_time The start time of the interval for fetching metrics.
* @param dataSourceInfo The datasource object to fetch metrics from.
* @throws Exception if an error occurs during the fetching process.
* TODO: Need to add right abstractions for this
* TODO: Need to add right abstractions for this
*/
public void fetchMetricsBasedOnDatasource(KruizeObject kruizeObject, Timestamp interval_end_time, Timestamp interval_start_time, DataSourceInfo dataSourceInfo) throws Exception {
try {
Expand Down
Loading

0 comments on commit 0b3ecdb

Please sign in to comment.