Fix tests/test_examples_run #436

vazirim · 2025-02-17T18:47:34Z

Describe the bug
The test test_examples_run is currently failing due to non-determinism (litellm is not passing temperature:0 to replicate).

We need to figure out if moving away from Replicate can help (e.g. ollama), and how to run the nightly test as a github action (perhaps with watsonx instead).

To Reproduce
Run that test.

Expected behavior

Screenshots

Desktop (please complete the following information):

OS: [e.g. iOS]
Browser [e.g. Chrome, Safari]
Version [e.g. 22]

Additional context
Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

jgchn · 2025-02-18T21:29:39Z

I ran most of the valid examples in test_examples_run and compared the results between Replicate (granite-3.1-8b-instruct) and Ollama (granite-code:8b). I noticed that Ollama sometimes returns shorter answers than those of Replicate but it seems mostly correct.

For example:

examples/fibonacci/fib.pdl

Replicate

(skipping over some details because the output is very long)

Now computing fibonacci(17)

def fibonacci(n: int) -> int:
    if n <= 0:
        return "Input should be a positive integer."
    elif n == 1:
        return 0
    elif n == 2:
        return 1
    else:
        return fibonacci(n - 1) + fibonacci(n - 2)
The result is: 987

Ollama

(skipping over some details because the output is very long)

Find a random number between 1 and 20
11
Now computing fibonacci(11)

def fibonacci(n):
if n <= 1:
return n
else:
return fibonacci(n-1) + fibonacci(n-2)
The result is: 89

Note: the granite model used on Replicate actually produced incorrect code (fib(17)=1597, not 987). The model on ollama produced correct results. I wonder if the difference is because granite-code optimized for coding tasks.

examples/hello/hello-model-chaining.pdl

Replicate:

Hello
Hello
Did you say Hello?
Yes, I did

Ollama:

Hello
Hello! How can I help you today?

Did you say Hello! How can I help you today?
?

Note: Pretty similar in terms of output.

examples/talk/4-function.pdl

Replicate:

The sentence 'I love Paris!' translates to 'Je t'aime Paris!' in French.
The sentence 'I love Madrid!' translates to 'Amo Madrid!' in Spanish.

Ollama:

Je suis adepte de Paris!'

'Te quiero mucho Madrid!'

Note: Again, the outputs are similar.

Next, I looked at some of the files that were labeled as Non-Deterministic.

examples/react/demo.pdl:

Replicate: the output is long so I will not paste here, but the model correctly uses the ReAct pattern and arrived at the right answer.
Ollama: when setting temperature: 0 in the program, this is what we get. The result is incorrect and the model does not make use of the tools provided.

How many years ago was the discoverer of the Hudson River born? Keep in mind we are in 2025.
The discoverer of the Hudson River was not born in 2025. The Hudson River was discovered in 1697 by a group of Dutch settlers led by John Smith. The river is located in present-day New York City and has been an important transportation route for centuries.
Action:
<tool_call>[{"name": "Finish", "arguments": {"topic": "1697"}}]

examples/code/code.pdl

Replicate

This Java method, `deserializeOffsetMap`, is part of the `OffsetUtil` class in the `streamsets/datacollector` repository. It's designed to convert a JSON string, `lastSourceOffset`, into a `Map<String, String>`.

Here's a breakdown of the code:

1. The method is annotated with `@SuppressWarnings("unchecked")`, which suppresses a compiler warning about unchecked or unsafe operations. This is because the `readValue` method of `JSON_MAPPER` can return a `Map` of any type, but we're casting it to `Map<String, String>`.

2. The method takes a single parameter, `lastSourceOffset`, which is expected to be a JSON string representing a map.

3. It initializes a `Map<String, String>` variable named `offsetMap`.

4. If `lastSourceOffset` is `null` or empty, it creates a new `HashMap` and assigns it to `offsetMap`.

5. If `lastSourceOffset` is not `null` or empty, it uses `JSON_MAPPER.readValue(lastSourceOffset, Map.class)` to deserialize the JSON string into a `Map`. The `Map.class` argument tells the `readValue` method to deserialize the JSON into a `Map` of any key and value types.

6. Finally, it returns the `offsetMap`.

In summary, this method converts a JSON string representing a map into a `Map<String, String>`. If the input string is `null` or empty, it returns an empty map.

Ollama

The code is a Java method that takes a string `lastSourceOffset` as input and returns a `Map<String, String>`. The method uses the Jackson library to deserialize the JSON-formatted string into a map. If the input string is empty or null, an empty HashMap is returned. Otherwise, the string is deserialized into a Map using the `JSON_MAPPER.readValue()` method.

Note: Replicate produced an elaborate explanation while Ollama was very concise.

examples/code/code-eval.pdl / examples/tutorial/data_block.pdl

Note: Replicate and Ollama yielded different similarity scores/metrics.

I was able to test this far before reaching the Replicate API limit. Do we care about correctness for test_examples_run or are we only concerned with ensuring that valid programs can be run? @vazirim @mandel

jgchn · 2025-02-18T21:30:30Z

Also linking this blog post that @starpit shared for integrating ollama with Github Action.

vazirim added the bug Something isn't working label Feb 17, 2025

jgchn self-assigned this Feb 18, 2025

jgchn mentioned this issue Feb 24, 2025

Migrate some examples from Replicate to Ollama #522

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix tests/test_examples_run #436

Fix tests/test_examples_run #436

vazirim commented Feb 17, 2025

jgchn commented Feb 18, 2025 •

edited

Loading

jgchn commented Feb 18, 2025 •

edited

Loading

Fix tests/test_examples_run #436

Fix tests/test_examples_run #436

Comments

vazirim commented Feb 17, 2025

jgchn commented Feb 18, 2025 • edited Loading

jgchn commented Feb 18, 2025 • edited Loading

jgchn commented Feb 18, 2025 •

edited

Loading

jgchn commented Feb 18, 2025 •

edited

Loading