Highlights 🌟

Sandboxed Execution with Docker 🐳 LLM-generated code can now be executed within a safe docker sandbox, includes parallel evaluation of multiple models within multiple containers.
Scaling Benchmarks with Kubernetes 📈 Docker evaluations can be scaled across kubernetes clusters to support benchmarking many models in parallel on distributed hardware.
New Task Types 📚
- Code Repair 🛠️ Prompts an LLM with compilation errors and asks to fix them.
- Code Transpilation 🔀 Has an LLM transpile source code from one programming language into another.
Static Code Repair Benchmark 🚑 LLMs commonly make small mistakes that are easily fixable using static analysis - this benchmark task showcases the potential of this technique.
Automatically pull Ollama Models 🦙 Ollama models are now automatically pulled when specified for the evaluation.
Improved Reporting 📑 Results are now written alongside the benchmark, meaning nothing is lost in case of an error. Plus a new tool eval-dev-quality report for combining multiple evaluation results into one.

See the full release notes below. 🤗

Merge Requests

Development & Management 🛠️
- Demo script to run models sequentially in separate evaluations on the "light" repository by @ahumenberger #189
Documentation 📚
- Document roadmaps and release schedule by @bauersimon #196
Evaluation ⏱️
- Isolated Execution
  - Docker Support
    - Build Docker image for every release by @Munsio #199
    - Docker evaluation runtime by @Munsio #211, #238, #234, #252
    - Parallel execution of containerized evaluations by @Munsio #221
    - Run docker image generation on each push by @Munsio #247
    - fix, Use main revision docker tag by default by @Munsio #249,
    - fix, Add commit revision to docker and reports by @Munsio #255
    - fix, IO error when multiple Containers use the same result path by @Munsio #274
    - Test docker in GitHub Actions by @Munsio #260
    - fix, Ignore CLI argument model, provider and testdata checks on host when using containerization by @Munsio #290
    - fix, Pass environment tokens into container by @Munsio #250
    - fix, Use a pinned Java 11 version by @Munsio #279
    - Make paths absolute when copying docker results cause docker gets confused with paths containing colons by @Munsio #308
  - Kubernetes Support
    - Kubernetes evaluation runtime by @Munsio #231
    - Copy back results from the cluster to the initial host by @Munsio #272
    - fix, Only use valid characters in Kubernetes job names by @Munsio #292
- Timeouts for test execution and symflower test generation by @ruiAzevedo19 #277, #267, #188
- Clarify prompt that code responses must be in code fences by @ruiAzevedo19 #259
- fix, Use backoff for retrying LLMs cause some LLMs need more time to recover by @zimmski #172
Models 🤖
- Pull Ollama models if they are selected for evaluation by @Munsio #284
- Model Selection
  - Exclude certain models (e.g. "openrouter/auto"), because is just forwarding to a model automatically by @bauersimon #288
  - Exclude the perplexicty online models because they have a "per request" cost #288 (automatically excluded as online models)
- fix, Retry openrouter models query cause it sometimes just errors by @bauersimon #191
- fix, Default to all repositories if none are explicitly selected by @bauersimon #182
- fix, Do not start Ollama server if no Ollama model is selected by @ruiAzevedo19 #269
- fix, Always use forward slashes in prompts so its unified by @ruiAzevedo19 #268
Reports & Metrics 🗒️
- Logging
  - refactor, Structural logging by @ahumenberger #245
  - Store model responses in separate files for easier lookup by @ahumenberger #278
  - Store coverage objects by @ruiAzevedo19 #223
- Write out results right away so we don't loose anything if the evaluation crashes by @ruiAzevedo19 #243
- refactor, Abstract the storage of assessments by @ahumenberger #178
- fix, Do not overwrite results but create a separate result directory by @bauersimon #179
- New report subcommand for postprocessing report data
  - report subcommand to combine multiple evaluations into one by @ruiAzevedo19 #271
  - Let report command also combine markdown reports by @ruiAzevedo19 #258
- Report evaluation configuration (used models + repositories) as a JSON artifact for reproducibility
  - Store models for the evaluation in JSON configuration report by @bauersimon #285
  - Store repositories for the evaluation in JSON configuration report by @bauersimon #287
  - Load models and repositories that were used from JSON configuration by @ruiAzevedo19 #291
- Report maximum of executable files by @ruiAzevedo19 #261
- Experiment with human-readable model names and costs to prepare for data visualization
  - Generate the summed model files from the evaluation.csv by @ruiAzevedo19 #241
  - Extract human-readable names of models by @ruiAzevedo19 #217
  - Extract model costs by @ruiAzevedo19 #216
  - Remove summed CSVs, human-readable names to handle them later during visualization by @ruiAzevedo19 #256
Operating Systems 🖥️
- More tests for Windows
  - Explicitly test Java test path logic on Windows by @bauersimon #184
  - Extend temporary repository tests to Windows by @bauersimon
Tools 🧰
- symflower fix auto-repair of common LLM mistakes
  - Integrate symflower fix into evaluation by @ruiAzevedo19, @bauersimon #229
  - Do not run symflower fix when there is a timeout of the LLM by @ruiAzevedo19 #236
  - Update symflower to latest version to benefit from improved Go test package repairs by @bauersimon, @Munsio #294, #303
Tasks 🔢
- Infrastructure for different Task types
  - Introduce the interface for doing "evaluation tasks" so we can easily add them by @ahumenberger #197, #166
  - fix, CSV header missing the task identifier by @bauersimon #190
  - Compile Go and Java so compilation errors can be used for code repair task by @ruiAzevedo19 #162
  - refactor, Share logging setup between multiple tasks by @bauersimon #202
  - fix, Missing return statements when checking model capabilities by @bauersimon #239
  - Validate task repositories before evaluation by @ruiAzevedo19 #265, #306
- New task types
  - Evaluation task for code repair by @ruiAzevedo19 #170, #192
    - fix, Ignore git and Maven repositories when validating code-repair repositories by @ahumenberger, @ruiAzevedo19 #281
    - fix, Correct test value for "variable unknown" code repair task by @ruiAzevedo19 #212
  - Evaluation task for transpilation (Go->Java and Java->Go) by @ruiAzevedo19 #246, #226
    - Early merger for transpilation task by @ruiAzevedo19 #264
- fix, Make Java Knapsack easier to solve by reducing Java specifics by @ruiAzevedo19 #262
- Internal management of Testdata repositories as temporary Git repositories
  - fix, Create temporary repositories just once by @bauersimon #180
  - fix, Fail tests immediately if outdated tools are installed by @bauersimon #171
- fix, Clarify Java build files to use proper version as required by Maven by @ruiAzevedo19 #275

Closed un-merged (contained in other PRs): #254, #253, #251, #248, #240, #222

Issues

#17: Sandbox execution
#43: Infer if a model actually returned source code
#126: Exclude openrouter/auto since it is just a random model
#141: Follow-Up from using Git to reset the temporary directory
#152: The prompt uses different paths depending on the OS
#156: Running Ollama tests with the wrong Ollama binary should fail hard
#157: Logic for "Create temporary repositories for each language so the repository is copied only once per language." copies more than needed
#159: https://github.com/symflower/eval-dev-quality/pull/155/files missing a test
#160: New task to check for Go and Java compilation errors
#163: Automatic selection of repositories is broken
#165: Support multiple evaluation tasks
#167: Fixed timeouts for symflower unit-tests and symflower test
#168: Evaluation task: Code repair
#169: Improve maintainability of assessments
#176: If results folder already exists, add suffix but don't overwrite or error
#181: Log model responses directly to file and reuse them for debugging
#185: Add timeout to symflower test
#186: Openrouter returns 524 when querying models
#187: CSV report header is missing the task identifier
#198: Isolation of evaluations
#200: Follow-up "Code repairing task to enable models to fix code with compilation errors"
#201: Evaluation task: Transpile
#205: Tool/command to combine multiple evaluations into one
#206: Extract human-readable names for models
#207: Add the current commit revision to the binary, Docker image and reports
#210: Extract model costs into log and CSVs
#213: Apply symflower fix to a "write-test" result of a model
#215: Report the maximum theoretically reachable #files-executed
#219: unable to create temporary repository path: exec: WaitDelay expired before I/O complete
#224: Follow up - Isolated evaluations
#225: Do not start the ollama server if not needed
#230: Make the Knapsack.java case easier to solve for models
#232: Follow-up: Apply "symflower fix" to a "write-test" result of a model when it errors, so model responses can possibly be fixed
#237: Dump the assessments in the CSV files once they happen and not in the end of all executions
#242: Docker runtime is using the wrong container image
#257: Change all prompts to enforce code fences
#263: Check if all testdata repositories are well-formed just once, and not in every task run
#270: Malformed Maven version
#273: Docker containers may use the same result-path
#276: Flaky test when testing symflower unit-tests timeout
#282: Use a JSON configuration file to set up an evaluation run
#283: Pull ollama models
#302: Docker runtime broken on main

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.6.0

Highlights 🌟

Merge Requests

Issues

Contributors