Note: This is not an official definition. It should be a description / explanation of the terms used in the context of working together on https://github.com/mlcommons/mobile_app_open, so that all contributors have the same understanding of a term.
Term | Description |
---|---|
accelerator | A computer hardware which improves the speed of inference. |
throughput | Number of inference queries per second. |
latency | Duration of one inference query. |
accuracy | The accuracy of a benchmark. The unit of measurements depends on the type of benchmark. For example, mAP is used for object detection task and F1 score is used for language processing task. |
backend | Vendor-specific code for running benchmark. |
performance mode | In performance mode, the app measures only the throughput. |
accuracy mode | In accuracy mode, the app measures the throughput and the accuracy. |
LoadGen (Load Generator) | LoadGen is a library that provides a reliable way to measure performance. It allows the app to run tests using different scenarios: single-stream, multistream, server, and offline (not all of them are actually used in this app). It collects information for logging, debugging, and postprocessing the data. It records queries and responses from the system under test, and at the end of the run, it reports statistics, summarizes the results, and determines whether the run was valid. You can read more about LoadGen in the paper it was originally introduced in: https://arxiv.org/pdf/1911.02549.pdf or take a look at the code in https://github.com/mlcommons/inference/tree/master/loadgen |
single stream scenario | The single-stream scenario represents one inference-query stream with a query sample size of 1, reflecting the many client applications where responsiveness is critical. An example is offline voice transcription on Google’s Pixel 4 smartphone. To measure performance, we inject a single query into the inference system; when the query is complete, we record the completion time and inject the next query. The metric is the query stream’s 90th-percentile latency. Source: https://arxiv.org/pdf/1911.02549.pdf Note: In the MLPerf Mobile app we converted the latency to throughput to display it as score. |
offline scenario | The offline scenario represents batch-processing applications where all data is immediately available and latency is unconstrained. An example is identifying the people and locations in a photo album. For this scenario, we send a single query that includes all sample-data IDs to be processed, and the system is free to process the input data in any order. Similar to the multistream scenario, neighboring samples in the query are contiguous in memory. The metric for the offline scenario is throughput measured in samples per second. Source: https://arxiv.org/pdf/1911.02549.pdf |
Helpful links:
- Mobile Inference Results: https://mlcommons.org/benchmarks/inference-mobile/
- MLPerf Inference Rules: https://github.com/mlcommons/inference_policies/blob/master/inference_rules.adoc