Skip to content

v0.6.0

Compare
Choose a tag to compare
@bauersimon bauersimon released this 02 Aug 11:30
· 193 commits to main since this release
d3ba2cb

Highlights 🌟

  • Sandboxed Execution with Docker 🐳 LLM-generated code can now be executed within a safe docker sandbox, includes parallel evaluation of multiple models within multiple containers.
  • Scaling Benchmarks with Kubernetes 📈 Docker evaluations can be scaled across kubernetes clusters to support benchmarking many models in parallel on distributed hardware.
  • New Task Types 📚
    • Code Repair 🛠️ Prompts an LLM with compilation errors and asks to fix them.
    • Code Transpilation 🔀 Has an LLM transpile source code from one programming language into another.
  • Static Code Repair Benchmark 🚑 LLMs commonly make small mistakes that are easily fixable using static analysis - this benchmark task showcases the potential of this technique.
  • Automatically pull Ollama Models 🦙 Ollama models are now automatically pulled when specified for the evaluation.
  • Improved Reporting 📑 Results are now written alongside the benchmark, meaning nothing is lost in case of an error. Plus a new tool eval-dev-quality report for combining multiple evaluation results into one.

See the full release notes below. 🤗

Merge Requests

  • Development & Management 🛠️
    • Demo script to run models sequentially in separate evaluations on the "light" repository by @ahumenberger #189
  • Documentation 📚
  • Evaluation ⏱️
    • Isolated Execution
      • Docker Support
        • Build Docker image for every release by @Munsio #199
        • Docker evaluation runtime by @Munsio #211, #238, #234, #252
        • Parallel execution of containerized evaluations by @Munsio #221
        • Run docker image generation on each push by @Munsio #247
        • fix, Use main revision docker tag by default by @Munsio #249,
        • fix, Add commit revision to docker and reports by @Munsio #255
        • fix, IO error when multiple Containers use the same result path by @Munsio #274
        • Test docker in GitHub Actions by @Munsio #260
        • fix, Ignore CLI argument model, provider and testdata checks on host when using containerization by @Munsio #290
        • fix, Pass environment tokens into container by @Munsio #250
        • fix, Use a pinned Java 11 version by @Munsio #279
        • Make paths absolute when copying docker results cause docker gets confused with paths containing colons by @Munsio #308
      • Kubernetes Support
        • Kubernetes evaluation runtime by @Munsio #231
        • Copy back results from the cluster to the initial host by @Munsio #272
        • fix, Only use valid characters in Kubernetes job names by @Munsio #292
    • Timeouts for test execution and symflower test generation by @ruiAzevedo19 #277, #267, #188
    • Clarify prompt that code responses must be in code fences by @ruiAzevedo19 #259
    • fix, Use backoff for retrying LLMs cause some LLMs need more time to recover by @zimmski #172
  • Models 🤖
    • Pull Ollama models if they are selected for evaluation by @Munsio #284
    • Model Selection
      • Exclude certain models (e.g. "openrouter/auto"), because is just forwarding to a model automatically by @bauersimon #288
      • Exclude the perplexicty online models because they have a "per request" cost #288 (automatically excluded as online models)
    • fix, Retry openrouter models query cause it sometimes just errors by @bauersimon #191
    • fix, Default to all repositories if none are explicitly selected by @bauersimon #182
    • fix, Do not start Ollama server if no Ollama model is selected by @ruiAzevedo19 #269
    • fix, Always use forward slashes in prompts so its unified by @ruiAzevedo19 #268
  • Reports & Metrics 🗒️
    • Logging
    • Write out results right away so we don't loose anything if the evaluation crashes by @ruiAzevedo19 #243
    • refactor, Abstract the storage of assessments by @ahumenberger #178
    • fix, Do not overwrite results but create a separate result directory by @bauersimon #179
    • New report subcommand for postprocessing report data
    • Report evaluation configuration (used models + repositories) as a JSON artifact for reproducibility
      • Store models for the evaluation in JSON configuration report by @bauersimon #285
      • Store repositories for the evaluation in JSON configuration report by @bauersimon #287
      • Load models and repositories that were used from JSON configuration by @ruiAzevedo19 #291
    • Report maximum of executable files by @ruiAzevedo19 #261
    • Experiment with human-readable model names and costs to prepare for data visualization
  • Operating Systems 🖥️
    • More tests for Windows
  • Tools 🧰
  • Tasks 🔢

Closed un-merged (contained in other PRs): #254, #253, #251, #248, #240, #222

Issues

#17: Sandbox execution
#43: Infer if a model actually returned source code
#126: Exclude openrouter/auto since it is just a random model
#141: Follow-Up from using Git to reset the temporary directory
#152: The prompt uses different paths depending on the OS
#156: Running Ollama tests with the wrong Ollama binary should fail hard
#157: Logic for "Create temporary repositories for each language so the repository is copied only once per language." copies more than needed
#159: https://github.com/symflower/eval-dev-quality/pull/155/files missing a test
#160: New task to check for Go and Java compilation errors
#163: Automatic selection of repositories is broken
#165: Support multiple evaluation tasks
#167: Fixed timeouts for symflower unit-tests and symflower test
#168: Evaluation task: Code repair
#169: Improve maintainability of assessments
#176: If results folder already exists, add suffix but don't overwrite or error
#181: Log model responses directly to file and reuse them for debugging
#185: Add timeout to symflower test
#186: Openrouter returns 524 when querying models
#187: CSV report header is missing the task identifier
#198: Isolation of evaluations
#200: Follow-up "Code repairing task to enable models to fix code with compilation errors"
#201: Evaluation task: Transpile
#205: Tool/command to combine multiple evaluations into one
#206: Extract human-readable names for models
#207: Add the current commit revision to the binary, Docker image and reports
#210: Extract model costs into log and CSVs
#213: Apply symflower fix to a "write-test" result of a model
#215: Report the maximum theoretically reachable #files-executed
#219: unable to create temporary repository path: exec: WaitDelay expired before I/O complete
#224: Follow up - Isolated evaluations
#225: Do not start the ollama server if not needed
#230: Make the Knapsack.java case easier to solve for models
#232: Follow-up: Apply "symflower fix" to a "write-test" result of a model when it errors, so model responses can possibly be fixed
#237: Dump the assessments in the CSV files once they happen and not in the end of all executions
#242: Docker runtime is using the wrong container image
#257: Change all prompts to enforce code fences
#263: Check if all testdata repositories are well-formed just once, and not in every task run
#270: Malformed Maven version
#273: Docker containers may use the same result-path
#276: Flaky test when testing symflower unit-tests timeout
#282: Use a JSON configuration file to set up an evaluation run
#283: Pull ollama models
#302: Docker runtime broken on main