This tool analyzes performance traces from Metal operations, providing insights into throughput, bottlenecks, and optimization opportunities.
This tool can be installed from PyPI:
pipx install tt-perf-report
Installing with pipx will automatically create a virtual environment and make the tt-perf-report
command available.
- Build Metal with performance tracing enabled:
./build_metal -p
- Run your test in TT-Metal with the tracy module to capture traces:
python -m tracy -r -p -v -m pytest path/to/test.py
This generates a CSV file containing operation timing data.
Tracy signposts mark specific sections of code for analysis. Add signposts to your Python code:
import tracy
# Mark different sections of your code
tracy.signpost("Compilation pass")
model(input_data)
tracy.signpost("Performance pass")
for _ in range(10):
model(input_data)
The tool uses the last signpost by default, which is typically the most relevant section for a performance test(e.g., the final iteration after compilation / warmup).
Common signpost usage:
--signpost name
: Analyze ops after the specified signpost--ignore-signposts
: Analyze the entire trace
The output of the performance report is a table of operations. Each operation is assigned a unique ID starting from 1. You can re-run the tool with different IDs to focus on specific sections of the trace.
Use --id-range
to analyze specific sections:
# Analyze ops 5 through 10
tt-perf-report trace.csv --id-range 5-10
# Analyze from op 31 onwards
tt-perf-report trace.csv --id-range 31-
# Analyze up to op 12
tt-perf-report trace.csv --id-range -12
This is particularly useful for:
- Isolating decode pass in prefill+decode LLM inference
- Analyzing single transformer layers without embeddings/projections
- Focusing on specific model components
--min-percentage value
: Hide ops below specified % of total time (default: 0.5)--color/--no-color
: Force colored/plain output--csv FILENAME
: Output the table to CSV format for further analysis or inclusion into automated reporting pipelines--no-advice
: Show only performance table, skip optimization advice
The performance report provides several key metrics for analyzing operation performance:
- Device Time: Time spent executing the operation on device (in microseconds)
- Op-to-op Gap: Time between operations, including host overhead and kernel dispatch (in microseconds)
- Total %: Percentage of total execution time spent on this operation
- Cores: Number of cores used by the operation (max 64 on Wormhole)
- DRAM: Memory bandwidth achieved (in GB/s)
- DRAM %: Percentage of theoretical peak DRAM bandwidth (288 GB/s on Wormhole)
- FLOPs: Compute throughput achieved (in TFLOPs)
- FLOPs %: Percentage of theoretical peak compute for the given math fidelity
- Bound: Performance classification of the operation:
DRAM
: Memory bandwidth bound (>65% of peak DRAM)FLOP
: Compute bound (>65% of peak FLOPs)BOTH
: Both memory and compute boundSLOW
: Neither memory nor compute boundHOST
: Operation running on host CPU
- Math Fidelity: Precision configuration used for matrix operations:
HiFi4
: Highest precision (74 TFLOPs/core)HiFi2
: Medium precision (148 TFLOPs/core)LoFi
: Lowest precision (262 TFLOPs/core)
The tool automatically highlights potential optimization opportunities:
- Red op-to-op times indicate high host or kernel launch overhead (>6.5μs)
- Red core counts indicate underutilization (<10 cores)
- Green metrics indicate good utilization of available resources
- Yellow metrics indicate room for optimization
Typical use:
tt-perf-report trace.csv
Build a table of all ops with no advice:
tt-perf-report trace.csv --no-advice
View ops 100-200 with advice:
tt-perf-report trace.csv --id-range 100-200
Export the table of ops and columns as a CSV file:
tt-perf-report trace.csv --csv my_report.csv