v0.6.0
Highlights 🌟
- Sandboxed Execution with Docker 🐳 LLM-generated code can now be executed within a safe docker sandbox, includes parallel evaluation of multiple models within multiple containers.
- Scaling Benchmarks with Kubernetes 📈 Docker evaluations can be scaled across kubernetes clusters to support benchmarking many models in parallel on distributed hardware.
- New Task Types 📚
- Code Repair 🛠️ Prompts an LLM with compilation errors and asks to fix them.
- Code Transpilation 🔀 Has an LLM transpile source code from one programming language into another.
- Static Code Repair Benchmark 🚑 LLMs commonly make small mistakes that are easily fixable using static analysis - this benchmark task showcases the potential of this technique.
- Automatically pull Ollama Models 🦙 Ollama models are now automatically pulled when specified for the evaluation.
- Improved Reporting 📑 Results are now written alongside the benchmark, meaning nothing is lost in case of an error. Plus a new tool
eval-dev-quality report
for combining multiple evaluation results into one.
See the full release notes below. 🤗
Merge Requests
- Development & Management 🛠️
- Demo script to run models sequentially in separate evaluations on the "light" repository by @ahumenberger #189
- Documentation 📚
- Document roadmaps and release schedule by @bauersimon #196
- Evaluation ⏱️
- Isolated Execution
- Docker Support
- Build Docker image for every release by @Munsio #199
- Docker evaluation runtime by @Munsio #211, #238, #234, #252
- Parallel execution of containerized evaluations by @Munsio #221
- Run docker image generation on each push by @Munsio #247
- fix, Use
main
revision docker tag by default by @Munsio #249, - fix, Add commit revision to docker and reports by @Munsio #255
- fix, IO error when multiple Containers use the same result path by @Munsio #274
- Test docker in GitHub Actions by @Munsio #260
- fix, Ignore CLI argument model, provider and testdata checks on host when using containerization by @Munsio #290
- fix, Pass environment tokens into container by @Munsio #250
- fix, Use a pinned Java 11 version by @Munsio #279
- Make paths absolute when copying docker results cause docker gets confused with paths containing colons by @Munsio #308
- Kubernetes Support
- Docker Support
- Timeouts for test execution and
symflower
test generation by @ruiAzevedo19 #277, #267, #188 - Clarify prompt that code responses must be in code fences by @ruiAzevedo19 #259
- fix, Use backoff for retrying LLMs cause some LLMs need more time to recover by @zimmski #172
- Isolated Execution
- Models 🤖
- Pull Ollama models if they are selected for evaluation by @Munsio #284
- Model Selection
- Exclude certain models (e.g. "openrouter/auto"), because is just forwarding to a model automatically by @bauersimon #288
- Exclude the
perplexicty
online models because they have a "per request" cost #288 (automatically excluded as online models)
- fix, Retry openrouter models query cause it sometimes just errors by @bauersimon #191
- fix, Default to all repositories if none are explicitly selected by @bauersimon #182
- fix, Do not start Ollama server if no Ollama model is selected by @ruiAzevedo19 #269
- fix, Always use forward slashes in prompts so its unified by @ruiAzevedo19 #268
- Reports & Metrics 🗒️
- Logging
- refactor, Structural logging by @ahumenberger #245
- Store model responses in separate files for easier lookup by @ahumenberger #278
- Store coverage objects by @ruiAzevedo19 #223
- Write out results right away so we don't loose anything if the evaluation crashes by @ruiAzevedo19 #243
- refactor, Abstract the storage of assessments by @ahumenberger #178
- fix, Do not overwrite results but create a separate result directory by @bauersimon #179
- New
report
subcommand for postprocessing report datareport
subcommand to combine multiple evaluations into one by @ruiAzevedo19 #271- Let
report
command also combine markdown reports by @ruiAzevedo19 #258
- Report evaluation configuration (used models + repositories) as a JSON artifact for reproducibility
- Store models for the evaluation in JSON configuration report by @bauersimon #285
- Store repositories for the evaluation in JSON configuration report by @bauersimon #287
- Load models and repositories that were used from JSON configuration by @ruiAzevedo19 #291
- Report maximum of executable files by @ruiAzevedo19 #261
- Experiment with human-readable model names and costs to prepare for data visualization
- Generate the summed model files from the evaluation.csv by @ruiAzevedo19 #241
- Extract human-readable names of models by @ruiAzevedo19 #217
- Extract model costs by @ruiAzevedo19 #216
- Remove summed CSVs, human-readable names to handle them later during visualization by @ruiAzevedo19 #256
- Logging
- Operating Systems 🖥️
- More tests for Windows
- Explicitly test Java test path logic on Windows by @bauersimon #184
- Extend temporary repository tests to Windows by @bauersimon
- More tests for Windows
- Tools 🧰
symflower fix
auto-repair of common LLM mistakes- Integrate
symflower fix
into evaluation by @ruiAzevedo19, @bauersimon #229 - Do not run
symflower fix
when there is a timeout of the LLM by @ruiAzevedo19 #236 - Update
symflower
to latest version to benefit from improved Go test package repairs by @bauersimon, @Munsio #294, #303
- Integrate
- Tasks 🔢
- Infrastructure for different Task types
- Introduce the interface for doing "evaluation tasks" so we can easily add them by @ahumenberger #197, #166
- fix, CSV header missing the task identifier by @bauersimon #190
- Compile Go and Java so compilation errors can be used for code repair task by @ruiAzevedo19 #162
- refactor, Share logging setup between multiple tasks by @bauersimon #202
- fix, Missing return statements when checking model capabilities by @bauersimon #239
- Validate task repositories before evaluation by @ruiAzevedo19 #265, #306
- New task types
- Evaluation task for code repair by @ruiAzevedo19 #170, #192
- fix, Ignore git and Maven repositories when validating code-repair repositories by @ahumenberger, @ruiAzevedo19 #281
- fix, Correct test value for "variable unknown" code repair task by @ruiAzevedo19 #212
- Evaluation task for transpilation (Go->Java and Java->Go) by @ruiAzevedo19 #246, #226
- Early merger for transpilation task by @ruiAzevedo19 #264
- Evaluation task for code repair by @ruiAzevedo19 #170, #192
- fix, Make Java Knapsack easier to solve by reducing Java specifics by @ruiAzevedo19 #262
- Internal management of Testdata repositories as temporary Git repositories
- fix, Create temporary repositories just once by @bauersimon #180
- fix, Fail tests immediately if outdated tools are installed by @bauersimon #171
- fix, Clarify Java build files to use proper version as required by Maven by @ruiAzevedo19 #275
- Infrastructure for different Task types
Closed un-merged (contained in other PRs): #254, #253, #251, #248, #240, #222
Issues
#17: Sandbox execution
#43: Infer if a model actually returned source code
#126: Exclude openrouter/auto since it is just a random model
#141: Follow-Up from using Git to reset the temporary directory
#152: The prompt uses different paths depending on the OS
#156: Running Ollama tests with the wrong Ollama binary should fail hard
#157: Logic for "Create temporary repositories for each language so the repository is copied only once per language." copies more than needed
#159: https://github.com/symflower/eval-dev-quality/pull/155/files missing a test
#160: New task to check for Go and Java compilation errors
#163: Automatic selection of repositories is broken
#165: Support multiple evaluation tasks
#167: Fixed timeouts for symflower unit-tests
and symflower test
#168: Evaluation task: Code repair
#169: Improve maintainability of assessments
#176: If results folder already exists, add suffix but don't overwrite or error
#181: Log model responses directly to file and reuse them for debugging
#185: Add timeout to symflower test
#186: Openrouter returns 524 when querying models
#187: CSV report header is missing the task identifier
#198: Isolation of evaluations
#200: Follow-up "Code repairing task to enable models to fix code with compilation errors"
#201: Evaluation task: Transpile
#205: Tool/command to combine multiple evaluations into one
#206: Extract human-readable names for models
#207: Add the current commit revision to the binary, Docker image and reports
#210: Extract model costs into log and CSVs
#213: Apply symflower fix to a "write-test" result of a model
#215: Report the maximum theoretically reachable #files-executed
#219: unable to create temporary repository path: exec: WaitDelay expired before I/O complete
#224: Follow up - Isolated evaluations
#225: Do not start the ollama server if not needed
#230: Make the Knapsack.java case easier to solve for models
#232: Follow-up: Apply "symflower fix" to a "write-test" result of a model when it errors, so model responses can possibly be fixed
#237: Dump the assessments in the CSV files once they happen and not in the end of all executions
#242: Docker runtime is using the wrong container image
#257: Change all prompts to enforce code fences
#263: Check if all testdata repositories are well-formed just once, and not in every task run
#270: Malformed Maven version
#273: Docker containers may use the same result-path
#276: Flaky test when testing symflower unit-tests timeout
#282: Use a JSON configuration file to set up an evaluation run
#283: Pull ollama models
#302: Docker runtime broken on main