Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Agent that thinks more thoroughly about question and considers possible outcomes #47

Merged
merged 21 commits into from
Apr 11, 2024

Conversation

gabrielfior
Copy link
Contributor

@gabrielfior gabrielfior commented Apr 3, 2024

Closes #40

Summary by CodeRabbit

  • New Features

    • Introduced benchmarking functionality for CrewAI agents in prediction markets.
    • Added a DeployableThinkThoroughlyAgent for selecting and betting on prediction markets.
    • New functionalities for creating outcomes, determining probabilities, and making decisions in prediction markets.
  • Refactor

    • Improved organization by moving the market_is_saturated function to a utility module for better reusability.

@gabrielfior gabrielfior linked an issue Apr 3, 2024 that may be closed by this pull request
Copy link
Contributor

coderabbitai bot commented Apr 3, 2024

Walkthrough

The update enhances the prediction market agents by introducing benchmarking, subquestion handling, and deployment strategies. New tools and utilities improve decision-making by incorporating detailed subquestion analysis and outcome probability management.

Changes

File Path Change Summary
.../crewai_subsequential_agent/benchmark.py Introduces benchmarking for CrewAI agents with binary market questions.
.../crewai_subsequential_agent/crewai_agent_subquestions.py Adds handling of subquestions in prediction markets.
.../crewai_subsequential_agent/deploy.py New deployment functionalities for selecting and betting on markets.
.../known_outcome_agent/benchmark.py Adds a directive to ignore type checking.
.../known_outcome_agent/deploy.py Refactors market saturation check to an external module.
.../agents/utils.py Introduces a new saturation check function and API key management.
.../crewai_subsequential_agent/prompts.py Adds functionality for creating and evaluating outcomes based on probabilities.
.../tools/crewai_tools.py Introduces a new tool for internet searches using the Tavily search API.

Assessment against linked issues

Objective Addressed Explanation
Make one of the agents think about the question more thoroughly (#40)

Recent Review Details

Configuration used: CodeRabbit UI

Commits Files that changed from the base of the PR and between 64d8ac6 and 7d2fcc3.
Files selected for processing (1)
  • prediction_market_agent/agents/crewai_subsequential_agent/deploy.py (1 hunks)
Files skipped from review as they are similar to previous changes (1)
  • prediction_market_agent/agents/crewai_subsequential_agent/deploy.py

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share
Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai generate interesting stats about this repository and render them as a table.
    • @coderabbitai show all the console.log statements in this repository.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger a review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@gabrielfior
Copy link
Contributor Author

Note that the notebooks have been added for helping the discussion and will not be merged into main.

Copy link
Contributor

@evangriffiths evangriffiths left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work. This looks like fun to build!

I think this approach is good for improving one of the areas that Martin has mentioned - that the prediction is consistent with predictions for the rest of the probability space.

But I don't think it helps with improving 'depth' of reasoning. I would suspect the prediction for each sub-outcome would perform similarly shallow reasoning (as do our existing agents like the evo agent). Martin was also interested in the idea of getting the agent to reason deeper about the question by generating sub-questions, the answers of which the main question depend on. Like for Will Carlos Alcaraz win the Miami Open by 5 April 2024?, the agent would ask about the probabilities of winning in the quarter final/semi final, win rate of Alcaraz vs the other finalist, etc. and then combine these probabilities in a kind of bayesian way.

Luckily I think these two improvements can be made together, so not saying it should be one or the other. But I think worth thinking about this other type of enhancement at this point.

Also, curious to know, what is the token cost per prediction that you're seeing?

@gabrielfior
Copy link
Contributor Author

I also added a benchmark script to the agent, an excerpt can be found below


Comparison Report

Market Results

Number of markets Proportion resolved Proportion YES Proportion NO
1 0 0 1

Agent Results

Summary Statistics

Agents MSE for p_yes Mean confidence % within +-0.05 % within +-0.1 % within +-0.2 % correct outcome % precision for yes % precision for no % recall for yes % recall for no confidence/p_yes error correlation Mean info_utility Proportion answerable Proportion answered Mean cost ($) Mean time (s)
subsequential_questions 0.25 1 0 0 0 100 0 100 0 100 nan 1 1 0.000727 3.28748

Markets

Market Question subsequential_questions p_yes reference p_yes
[Will the stock price of Donald Trump's media company exceed $100 on 1 April 2024?](Will the stock price of Donald Trump's media company exceed $100 on 1 April 2024?) 0.00 [no] 0.50 [no]

Expected value

Agent Mean expected returns Median expected returns Total expected returns
subsequential_questions 0 0 0
Market Question subsequential_questions
Will the stock price of Donald Trump's media company exceed $100 on 1 April 2024? 0

@gabrielfior gabrielfior marked this pull request as ready for review April 4, 2024 21:52
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Review Status

Configuration used: CodeRabbit UI

Commits Files that changed from the base of the PR and between 3ccf592 and a89ec31.
Files ignored due to path filters (2)
  • poetry.lock is excluded by !**/*.lock, !**/*.lock
  • pyproject.toml is excluded by !**/*.toml
Files selected for processing (8)
  • crewai_multiple_agent.ipynb (1 hunks)
  • prediction_market_agent/agents/crewai_subsequential_agent/benchmark.py (1 hunks)
  • prediction_market_agent/agents/crewai_subsequential_agent/crewai_agent_subquestions.py (1 hunks)
  • prediction_market_agent/agents/crewai_subsequential_agent/deploy.py (1 hunks)
  • prediction_market_agent/agents/crewai_subsequential_agent/prompts.py (1 hunks)
  • prediction_market_agent/agents/known_outcome_agent/benchmark.py (1 hunks)
  • prediction_market_agent/agents/known_outcome_agent/deploy.py (2 hunks)
  • prediction_market_agent/agents/utils.py (1 hunks)
Files not summarized due to errors (1)
  • crewai_multiple_agent.ipynb: Error: Message exceeds token limit
Files skipped from review due to trivial changes (1)
  • prediction_market_agent/agents/known_outcome_agent/benchmark.py
Additional comments not posted (30)
prediction_market_agent/agents/crewai_subsequential_agent/deploy.py (3)

15-18: Ensure the model version (gpt-3.5-turbo) is up-to-date and aligns with the project's requirements for AI models.


44-48: The method calculate_bet_amount only supports xDai markets. Ensure that this limitation is documented and consider implementing support for additional currencies if required by the project.


51-56: The main block uses hard-coded values for deployment parameters. Consider externalizing these values to configuration files or environment variables for better maintainability.

prediction_market_agent/agents/known_outcome_agent/deploy.py (4)

23-23: The import of market_is_saturated from utils is a good practice for code reusability. Ensure that the moved function is no longer used within this file to avoid redundancy.


1-1: The use of # type: ignore at the top of the file suggests there might be type hinting issues. Ensure that all type hints are correct and consider removing this directive if it's no longer necessary.


1-4: > 📝 NOTE

This review was outside the diff hunks, and no overlapping diff hunk was found. Original lines [37-89]

The method answer_binary_market has a complex logic for determining the market answer. Ensure that this logic is thoroughly tested, especially the error handling and the fallback to None when an answer cannot be determined.


1-4: > 📝 NOTE

This review was outside the diff hunks, and no overlapping diff hunk was found. Original lines [94-137]

The main block for deploying the agent contains hard-coded values and paths. Consider externalizing these to configuration files or environment variables for better maintainability and flexibility.

prediction_market_agent/agents/crewai_subsequential_agent/prompts.py (2)

1-98: Ensure that the prompts and expected output formats are aligned with the requirements of the CrewAI framework and are correctly formatted for the LLM to understand. Pay special attention to placeholders like [SCENARIO] and {scenario} to ensure they are used consistently and correctly.


1-98: Consider adding more examples to the prompts to cover a wider range of scenarios and improve the LLM's understanding of the task.

prediction_market_agent/agents/crewai_subsequential_agent/benchmark.py (2)

1-159: Ensure that the benchmarking script accurately reflects the performance of the CrewAI agent by verifying the correctness of the market building, prediction generation, and the final assertion on the mean-squared-error for p_yes.


1-159: Consider adding documentation or comments explaining the benchmarking process, especially the significance of the mean-squared-error assertion and how the benchmark results should be interpreted.

prediction_market_agent/agents/crewai_subsequential_agent/crewai_agent_subquestions.py (3)

25-152: Ensure that the CrewAIAgentSubquestions class and its methods are well-documented, especially the interaction between tasks, agents, and crews within the CrewAI framework. This will help future developers understand and maintain the code.


25-152: Consider adding error handling for the CrewAI framework interactions, especially for cases where tasks fail or return unexpected results. This will improve the robustness of the agent's decision-making process.


25-152: Verify that the asynchronous execution of tasks (async_execution=True) is correctly managed and that the results are correctly aggregated before making a final decision. This is crucial for the accuracy of the agent's predictions.

crewai_multiple_agent.ipynb (16)

4-15: Imports are correctly organized and necessary for the notebook's functionality.


19-37: Loading environment variables using load_dotenv() is a secure practice for configuration.


40-47: Ensure SerperDevTool is effectively utilized in tasks where appropriate.


51-74: Consider adding comments to explain the purpose and functionality of each agent for clarity, especially if the notebook is intended for educational purposes or wider distribution.


51-115: Clarify the use of tools in the research_task definition. If SerperDevTool or other tools are intended to be used, consider uncommenting and properly integrating them.


126-130: Consider including other agents (analyst, writer) and their respective tasks in report_crew if applicable to the simulation's goals, to fully utilize the multi-agent system.


150-152: Consider enhancing result handling for clarity and context, especially if the notebook is intended for production or broader educational use.


185-230: Ensure the alternative approach for breaking down scenarios into possible outcomes is consistently integrated with the rest of the notebook's logic and objectives.


286-287: Consider providing additional examples or explanations to further showcase the alternative approach, especially if the notebook is intended for educational purposes.


426-440: Ensure that the tools assigned to agents, such as search_tool, are utilized effectively in their tasks to fully leverage the capabilities of the multi-agent system.


458-512: Review the verbose logging level in the Crew definition to ensure it's appropriate for the intended use case, as it may produce extensive output that could overwhelm users or obscure important information.


865-871: Enhance the result handling for clarity and context, especially if the notebook is intended for production or broader educational use. Consider using more structured output or visualizations to present the results.


891-896: Add comments to explain the purpose and functionality of the result handling and condition evaluation for clarity, especially if the notebook is intended for educational purposes or wider distribution.


908-908: Ensure that the report's conclusions are based on accurate and up-to-date information, especially if the notebook's analysis is used for decision-making or educational purposes.


925-932: Ensure that the outlined improvements for sentence generation and script execution are implemented systematically and tested thoroughly to enhance the notebook's functionality and accuracy.


940-969: Consider adding more examples or explanations to further illustrate the approach to analyzing prediction market questions, especially if the notebook is intended for a broad audience.

prediction_market_agent/agents/utils.py Show resolved Hide resolved
crewai_multiple_agent.ipynb Outdated Show resolved Hide resolved
crewai_multiple_agent.ipynb Outdated Show resolved Hide resolved
@kongzii
Copy link
Contributor

kongzii commented Apr 5, 2024

Seems like the benchmark was run only on 1 market, that's not very useful, I guess at least 50 would be nice

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Review Status

Configuration used: CodeRabbit UI

Commits Files that changed from the base of the PR and between a89ec31 and 766380e.
Files ignored due to path filters (1)
  • pyproject.toml is excluded by !**/*.toml
Files selected for processing (6)
  • prediction_market_agent/agents/crewai_subsequential_agent/benchmark.py (1 hunks)
  • prediction_market_agent/agents/crewai_subsequential_agent/crewai_agent_subquestions.py (1 hunks)
  • prediction_market_agent/agents/crewai_subsequential_agent/deploy.py (1 hunks)
  • prediction_market_agent/agents/crewai_subsequential_agent/prompts.py (1 hunks)
  • prediction_market_agent/agents/known_outcome_agent/deploy.py (2 hunks)
  • prediction_market_agent/tools/crewai_tools.py (1 hunks)
Files skipped from review as they are similar to previous changes (5)
  • prediction_market_agent/agents/crewai_subsequential_agent/benchmark.py
  • prediction_market_agent/agents/crewai_subsequential_agent/crewai_agent_subquestions.py
  • prediction_market_agent/agents/crewai_subsequential_agent/deploy.py
  • prediction_market_agent/agents/crewai_subsequential_agent/prompts.py
  • prediction_market_agent/agents/known_outcome_agent/deploy.py
Additional comments not posted (1)
prediction_market_agent/tools/crewai_tools.py (1)

9-11: LGTM! The TavilyDevToolSchema class correctly uses Pydantic for input validation, ensuring that search_query is mandatory.

prediction_market_agent/tools/crewai_tools.py Outdated Show resolved Hide resolved
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

prediction_market_agent/tools/crewai_tools.py Outdated Show resolved Hide resolved
prediction_market_agent/utils.py Show resolved Hide resolved
@gabrielfior
Copy link
Contributor Author

Here a few benchmarks for 20 markets and 50 markets


20 markets

Comparison Report

Market Results

Number of markets Proportion resolved Proportion YES Proportion NO
20 0 0.4 0.6

Agent Results

Summary Statistics

Agents MSE for p_yes Mean confidence % within +-0.05 % within +-0.1 % within +-0.2 % correct outcome % precision for yes % precision for no % recall for yes % recall for no confidence/p_yes error correlation Mean info_utility Proportion answerable Proportion answered Mean cost ($) Mean time (s)
subsequential-questions-crewai 0.0866736 0.8675 10 15 45 65 55.5556 72.7273 62.5 66.6667 0.0750478 1 1
random 0.172009 0.521477 20 30 50 55 33.3333 58.8235 12.5 83.3333 -0.272218 1 1
fixed-no 0.278555 1 15 25 25 60 0 60 0 100 nan 1 1
fixed-yes 0.415312 1 15 15 15 40 40 0 100 0 nan 1 1

Markets

Market Question subsequential-questions-crewai p_yes random p_yes fixed-no p_yes fixed-yes p_yes reference p_yes
Will the LK-99 room temp, ambient pressure superconductivity pre-print replicate before 2025? 0.20 [NO] 0.07 [NO] 0.00 [NO] 1.00 [YES] 0.03 [NO]
Will Biden be the 2024 Democratic Nominee? 0.40 [NO] 0.21 [NO] 0.00 [NO] 1.00 [YES] 0.96 [YES]
Will Joe Biden win the 2024 US Presidential Election? 0.69 [YES] 0.54 [YES] 0.00 [NO] 1.00 [YES] 0.48 [NO]
Will Andrew Tate be found guilty of human (sex) trafficking? 0.10 [NO] 0.38 [NO] 0.00 [NO] 1.00 [YES] 0.60 [YES]
Will AI be a major topic during the 2024 presidential debates in the United States? (please read criteria) 0.82 [YES] 0.09 [NO] 0.00 [NO] 1.00 [YES] 0.34 [NO]
In 2028, will an AI be able to generate a full high-quality movie to a prompt? 0.32 [NO] 0.33 [NO] 0.00 [NO] 1.00 [YES] 0.36 [NO]
Will Donald Trump be the Republican nominee for president in 2024? 0.95 [YES] 0.12 [NO] 0.00 [NO] 1.00 [YES] 0.98 [YES]
Will Donald Trump win the 2024 presidential election? 0.40 [NO] 0.37 [NO] 0.00 [NO] 1.00 [YES] 0.51 [YES]
Will an AI get gold on any International Math Olympiad by 2025? 0.68 [YES] 0.02 [NO] 0.00 [NO] 1.00 [YES] 0.20 [NO]
In 2028, will AI be at least as big a political issue as abortion? 0.20 [NO] 0.38 [NO] 0.00 [NO] 1.00 [YES] 0.45 [NO]
Will OpenAI hint at or claim to have AGI by 2025 end? 0.00 [NO] 0.00 [NO] 0.00 [NO] 1.00 [YES] 0.22 [NO]
Will either Joe Biden or Donald Trump be elected President in 2024? 0.80 [YES] 0.18 [NO] 0.00 [NO] 1.00 [YES] 0.97 [YES]
Will the average global temperature in 2024 exceed 2023? 0.75 [YES] 0.58 [YES] 0.00 [NO] 1.00 [YES] 0.60 [YES]
Will GPT-5 be released before 2025? 0.90 [YES] 0.26 [NO] 0.00 [NO] 1.00 [YES] 0.67 [YES]
Will Congress pass a bill in 2024 to ban TikTok in the US or force it to change ownership? 0.80 [YES] 0.50 [NO] 0.00 [NO] 1.00 [YES] 0.49 [NO]
Will Joe Biden get impeached in his first term? 0.20 [NO] 0.23 [NO] 0.00 [NO] 1.00 [YES] 0.08 [NO]
Will Aella be romantically or sexually involved with Destiny by the end of 2024? 0.13 [NO] 0.39 [NO] 0.00 [NO] 1.00 [YES] 0.07 [NO]
Did COVID-19 come from a laboratory? 0.95 [YES] 0.25 [NO] 0.00 [NO] 1.00 [YES] 0.55 [YES]
Will AI wipe out humanity before the year 2030? 0.35 [NO] 0.97 [YES] 0.00 [NO] 1.00 [YES] 0.03 [NO]
Will Threads have more daily active users than Twitter by the end of 2024? 0.20 [NO] 0.21 [NO] 0.00 [NO] 1.00 [YES] 0.04 [NO]

Expected value

Agent Mean expected returns Median expected returns Total expected returns
subsequential-questions-crewai 29.2825 24.017 585.651
random 6.0579 5.99164 121.158
fixed-no 13.6758 6.53544 273.515
fixed-yes -13.6758 -6.53544 -273.515
Market Question subsequential-questions-crewai random fixed-no fixed-yes
Will the LK-99 room temp, ambient pressure superconductivity pre-print replicate before 2025? 94.2138 94.2138 94.2138 -94.2138
Will Biden be the 2024 Democratic Nominee? -92.3114 -92.3114 -92.3114 92.3114
Will Joe Biden win the 2024 US Presidential Election? -3.07088 -3.07088 3.07088 -3.07088
Will Andrew Tate be found guilty of human (sex) trafficking? -20 -20 -20 20
Will AI be a major topic during the 2024 presidential debates in the United States? (please read criteria) -32.847 32.847 32.847 -32.847
In 2028, will an AI be able to generate a full high-quality movie to a prompt? 27.8136 27.8136 27.8136 -27.8136
Will Donald Trump be the Republican nominee for president in 2024? 95.2844 -95.2844 -95.2844 95.2844
Will Donald Trump win the 2024 presidential election? -1.37533 -1.37533 -1.37533 1.37533
Will an AI get gold on any International Math Olympiad by 2025? -59.1427 59.1427 59.1427 -59.1427
In 2028, will AI be at least as big a political issue as abortion? 10 10 10 -10
Will OpenAI hint at or claim to have AGI by 2025 end? 55.4266 55.4266 55.4266 -55.4266
Will either Joe Biden or Donald Trump be elected President in 2024? 93.5522 -93.5522 -93.5522 93.5522
Will the average global temperature in 2024 exceed 2023? 20.2203 20.2203 -20.2203 20.2203
Will GPT-5 be released before 2025? 34.0547 -34.0547 -34.0547 34.0547
Will Congress pass a bill in 2024 to ban TikTok in the US or force it to change ownership? -1.98329 1.98329 1.98329 -1.98329
Will Joe Biden get impeached in his first term? 84.8219 84.8219 84.8219 -84.8219
Will Aella be romantically or sexually involved with Destiny by the end of 2024? 85.3588 85.3588 85.3588 -85.3588
Did COVID-19 come from a laboratory? 10 -10 -10 10
Will AI wipe out humanity before the year 2030? 93.3281 -93.3281 93.3281 -93.3281
Will Threads have more daily active users than Twitter by the end of 2024? 92.3068 92.3068 92.3068 -92.3068

50 markets

Comparison Report

Market Results

Number of markets Proportion resolved Proportion YES Proportion NO
50 0 0.36 0.64

Agent Results

Summary Statistics

Agents MSE for p_yes Mean confidence % within +-0.05 % within +-0.1 % within +-0.2 % correct outcome % precision for yes % precision for no % recall for yes % recall for no confidence/p_yes error correlation Mean info_utility Proportion answerable Proportion answered Mean cost ($) Mean time (s)
subsequential-questions-crewai 0.139858 0.85125 8 16 38 62 48.2759 80.9524 77.7778 53.125 0.167525 1 1
random 0.16555 0.426501 10 24 46 52 39.2857 68.1818 61.1111 46.875 -0.0381193 1 1
fixed-no 0.238667 1 10 20 28 64 0 64 0 100 nan 1 1
fixed-yes 0.440751 1 10 10 12 36 36 0 100 0 nan 1 1

Markets

Market Question subsequential-questions-crewai p_yes random p_yes fixed-no p_yes fixed-yes p_yes reference p_yes
Will the LK-99 room temp, ambient pressure superconductivity pre-print replicate before 2025? 0.20 [NO] 0.12 [NO] 0.00 [NO] 1.00 [YES] 0.03 [NO]
Will Biden be the 2024 Democratic Nominee? 0.85 [YES] 0.09 [NO] 0.00 [NO] 1.00 [YES] 0.96 [YES]
Will Joe Biden win the 2024 US Presidential Election? 0.40 [NO] 0.67 [YES] 0.00 [NO] 1.00 [YES] 0.49 [NO]
Will Andrew Tate be found guilty of human (sex) trafficking? 0.95 [YES] 0.53 [YES] 0.00 [NO] 1.00 [YES] 0.60 [YES]
Will AI be a major topic during the 2024 presidential debates in the United States? (please read criteria) 0.80 [YES] 0.15 [NO] 0.00 [NO] 1.00 [YES] 0.34 [NO]
In 2028, will an AI be able to generate a full high-quality movie to a prompt? 0.20 [NO] 0.33 [NO] 0.00 [NO] 1.00 [YES] 0.36 [NO]
Will Donald Trump be the Republican nominee for president in 2024? 0.95 [YES] 0.80 [YES] 0.00 [NO] 1.00 [YES] 0.98 [YES]
Will Donald Trump win the 2024 presidential election? 0.80 [YES] 0.31 [NO] 0.00 [NO] 1.00 [YES] 0.51 [YES]
Will an AI get gold on any International Math Olympiad by 2025? 0.55 [YES] 0.70 [YES] 0.00 [NO] 1.00 [YES] 0.20 [NO]
In 2028, will AI be at least as big a political issue as abortion? 0.80 [YES] 0.88 [YES] 0.00 [NO] 1.00 [YES] 0.45 [NO]
Will OpenAI hint at or claim to have AGI by 2025 end? 0.70 [YES] 0.19 [NO] 0.00 [NO] 1.00 [YES] 0.22 [NO]
Will either Joe Biden or Donald Trump be elected President in 2024? 0.80 [YES] 0.55 [YES] 0.00 [NO] 1.00 [YES] 0.97 [YES]
Will the average global temperature in 2024 exceed 2023? 0.99 [YES] 0.93 [YES] 0.00 [NO] 1.00 [YES] 0.60 [YES]
Will GPT-5 be released before 2025? 0.90 [YES] 0.29 [NO] 0.00 [NO] 1.00 [YES] 0.67 [YES]
Will Congress pass a bill in 2024 to ban TikTok in the US or force it to change ownership? 0.20 [NO] 0.59 [YES] 0.00 [NO] 1.00 [YES] 0.49 [NO]
Will Joe Biden get impeached in his first term? 0.21 [NO] 0.02 [NO] 0.00 [NO] 1.00 [YES] 0.08 [NO]
Will Aella be romantically or sexually involved with Destiny by the end of 2024? 0.20 [NO] 0.07 [NO] 0.00 [NO] 1.00 [YES] 0.07 [NO]
Did COVID-19 come from a laboratory? 0.91 [YES] 0.47 [NO] 0.00 [NO] 1.00 [YES] 0.55 [YES]
Will AI wipe out humanity before the year 2030? 0.20 [NO] 0.55 [YES] 0.00 [NO] 1.00 [YES] 0.03 [NO]
Will Threads have more daily active users than Twitter by the end of 2024? 0.15 [NO] 0.39 [NO] 0.00 [NO] 1.00 [YES] 0.04 [NO]
Will AI pass the Longbets version of the Turing test by the end of 2029? 0.26 [NO] 0.65 [YES] 0.00 [NO] 1.00 [YES] 0.57 [YES]
Will a large language model beat a super grandmaster playing chess by 2028? 0.20 [NO] 0.30 [NO] 0.00 [NO] 1.00 [YES] 0.40 [NO]
Will AI wipe out humanity before the year 2100 0.80 [YES] 0.60 [YES] 0.00 [NO] 1.00 [YES] 0.12 [NO]
Will Jimmy Carter become a centenarian? 0.80 [YES] 0.71 [YES] 0.00 [NO] 1.00 [YES] 0.57 [YES]
Will a Democrat win the 2024 US presidential election? 0.00 [NO] 0.63 [YES] 0.00 [NO] 1.00 [YES] 0.48 [NO]
Will a room-temperature, atmospheric pressure superconductor be discovered before 2030? 0.70 [YES] 0.51 [YES] 0.00 [NO] 1.00 [YES] 0.08 [NO]
Will the US enter a recession by the end of 2024? 0.35 [NO] 0.15 [NO] 0.00 [NO] 1.00 [YES] 0.15 [NO]
Will Linda Yaccarino be the CEO of X on April 13, 2024? 0.90 [YES] 0.60 [YES] 0.00 [NO] 1.00 [YES] 0.98 [YES]
By the end of 2026, will we have transparency into any useful internal pattern within a Large Language Model whose semantics would have been unfamiliar to AI and cognitive science in 2006? 0.85 [YES] 0.20 [NO] 0.00 [NO] 1.00 [YES] 0.49 [NO]
Will the Apple Vision Pro be successful enough to revive interest in mixed reality and the metaverse? 0.85 [YES] 0.46 [NO] 0.00 [NO] 1.00 [YES] 0.28 [NO]
Will the Meissner effect be confirmed near room temperature in copper-substituted lead apatite? 0.85 [YES] 0.69 [YES] 0.00 [NO] 1.00 [YES] 0.04 [NO]
Will this Yudkowsky tweet on AI video generation hold up in 2024? 0.80 [YES] 0.38 [NO] 0.00 [NO] 1.00 [YES] 0.32 [NO]
Will Kizaru betray/defect the Marines? 0.81 [YES] 0.87 [YES] 0.00 [NO] 1.00 [YES] 0.56 [YES]
Will there be an AI language model that surpasses ChatGPT and other OpenAI models before the end of 2024? 0.85 [YES] 0.59 [YES] 0.00 [NO] 1.00 [YES] 0.30 [NO]
In a year, will we think that Sam Altman leaving OpenAI reduced AI risk? 0.80 [YES] 0.97 [YES] 0.00 [NO] 1.00 [YES] 0.11 [NO]
Will Donald Trump win the 2024 US Presidential Election? 0.20 [NO] 0.89 [YES] 0.00 [NO] 1.00 [YES] 0.51 [YES]
Will Sam Altman be the CEO of OpenAI at the end of 2024? 0.85 [YES] 0.19 [NO] 0.00 [NO] 1.00 [YES] 0.96 [YES]
Will Hezbollah directly engage in combat operations against Israel? 0.21 [NO] 0.85 [YES] 0.00 [NO] 1.00 [YES] 0.50 [YES]
Will China launch a full-scale invasion of Taiwan before 2030? 0.20 [NO] 0.49 [NO] 0.00 [NO] 1.00 [YES] 0.23 [NO]
Will Sam Altman be a co-founder of a serious OpenAI competitor by EOY 2024? 0.80 [YES] 0.77 [YES] 0.00 [NO] 1.00 [YES] 0.03 [NO]
Will there be a very reliable way of reading human thoughts by the end of 2024?🧠🕵️ 0.20 [NO] 1.00 [YES] 0.00 [NO] 1.00 [YES] 0.12 [NO]
Is the "100% effective against solid tumors" cancer pill AOH1996 paper legit? [see description] 0.00 [NO] 0.57 [YES] 0.00 [NO] 1.00 [YES] 0.23 [NO]
Will Google mostly catch up to OpenAI in LLM quality and neutralize ChatGPT's lead by the end of 2024? 0.85 [YES] 0.70 [YES] 0.00 [NO] 1.00 [YES] 0.42 [NO]
Will this Yudkowsky tweet hold up? 0.00 [NO] 0.86 [YES] 0.00 [NO] 1.00 [YES] 0.84 [YES]
Will GPT-4 be available to ChatGPT Free Users in 2024? 0.80 [YES] 0.33 [NO] 0.00 [NO] 1.00 [YES] 0.70 [YES]
Will a reliable and general household robot be developed before January 1st, 2030? 0.85 [YES] 0.20 [NO] 0.00 [NO] 1.00 [YES] 0.37 [NO]
Will AI wipe out humanity before the year 2040? 0.10 [NO] 0.98 [YES] 0.00 [NO] 1.00 [YES] 0.06 [NO]
Was Ilya spooked by a capabilities advance? 0.00 [NO] 0.79 [YES] 0.00 [NO] 1.00 [YES] 0.08 [NO]
Will Superalignment succeed? (self assessment) 0.20 [NO] 0.30 [NO] 0.00 [NO] 1.00 [YES] 0.23 [NO]
Will Benjamin Netanyahu (Bibi) be the prime minister of Israel at the end of 2024 0.70 [YES] 0.45 [NO] 0.00 [NO] 1.00 [YES] 0.58 [YES]

Expected value

Agent Mean expected returns Median expected returns Total expected returns
subsequential-questions-crewai 15.9045 13.6694 795.226
random -0.435562 0.802601 -21.7781
fixed-no 20.2083 22.7631 1010.42
fixed-yes -20.2083 -22.7631 -1010.42
Market Question subsequential-questions-crewai random fixed-no fixed-yes
Will the LK-99 room temp, ambient pressure superconductivity pre-print replicate before 2025? 94.1633 94.1633 94.1633 -94.1633
Will Biden be the 2024 Democratic Nominee? 92.3114 -92.3114 -92.3114 92.3114
Will Joe Biden win the 2024 US Presidential Election? 2.95499 -2.95499 2.95499 -2.95499
Will Andrew Tate be found guilty of human (sex) trafficking? 20 20 -20 20
Will AI be a major topic during the 2024 presidential debates in the United States? (please read criteria) -32.847 32.847 32.847 -32.847
In 2028, will an AI be able to generate a full high-quality movie to a prompt? 27.8136 27.8136 27.8136 -27.8136
Will Donald Trump be the Republican nominee for president in 2024? 95.2844 95.2844 -95.2844 95.2844
Will Donald Trump win the 2024 presidential election? 2 -2 -2 2
Will an AI get gold on any International Math Olympiad by 2025? -59.1427 -59.1427 59.1427 -59.1427
In 2028, will AI be at least as big a political issue as abortion? -10 -10 10 -10
Will OpenAI hint at or claim to have AGI by 2025 end? -55.4266 55.4266 55.4266 -55.4266
Will either Joe Biden or Donald Trump be elected President in 2024? 93.5522 93.5522 -93.5522 93.5522
Will the average global temperature in 2024 exceed 2023? 20.2203 20.2203 -20.2203 20.2203
Will GPT-5 be released before 2025? 33.8315 -33.8315 -33.8315 33.8315
Will Congress pass a bill in 2024 to ban TikTok in the US or force it to change ownership? 1.98329 -1.98329 1.98329 -1.98329
Will Joe Biden get impeached in his first term? 84.8219 84.8219 84.8219 -84.8219
Will Aella be romantically or sexually involved with Destiny by the end of 2024? 85.3588 85.3588 85.3588 -85.3588
Did COVID-19 come from a laboratory? 9.67587 -9.67587 -9.67587 9.67587
Will AI wipe out humanity before the year 2030? 93.3281 -93.3281 93.3281 -93.3281
Will Threads have more daily active users than Twitter by the end of 2024? 92.3068 92.3068 92.3068 -92.3068
Will AI pass the Longbets version of the Turing test by the end of 2029? -14.1888 14.1888 -14.1888 14.1888
Will a large language model beat a super grandmaster playing chess by 2028? 19.3749 19.3749 19.3749 -19.3749
Will AI wipe out humanity before the year 2100 -75.4511 -75.4511 75.4511 -75.4511
Will Jimmy Carter become a centenarian? 14.9105 14.9105 -14.9105 14.9105
Will a Democrat win the 2024 US presidential election? 3.00503 -3.00503 3.00503 -3.00503
Will a room-temperature, atmospheric pressure superconductor be discovered before 2030? -84.3113 -84.3113 84.3113 -84.3113
Will the US enter a recession by the end of 2024? 70.7833 70.7833 70.7833 -70.7833
Will Linda Yaccarino be the CEO of X on April 13, 2024? 96.5069 96.5069 -96.5069 96.5069
By the end of 2026, will we have transparency into any useful internal pattern within a Large Language Model whose semantics would have been unfamiliar to AI and cognitive science in 2006? -2.02417 2.02417 2.02417 -2.02417
Will the Apple Vision Pro be successful enough to revive interest in mixed reality and the metaverse? -44.1657 44.1657 44.1657 -44.1657
Will the Meissner effect be confirmed near room temperature in copper-substituted lead apatite? -92.9598 -92.9598 92.9598 -92.9598
Will this Yudkowsky tweet on AI video generation hold up in 2024? -36.7308 36.7308 36.7308 -36.7308
Will Kizaru betray/defect the Marines? 12.4284 12.4284 -12.4284 12.4284
Will there be an AI language model that surpasses ChatGPT and other OpenAI models before the end of 2024? -39.1446 -39.1446 39.1446 -39.1446
In a year, will we think that Sam Altman leaving OpenAI reduced AI risk? -78.3376 -78.3376 78.3376 -78.3376
Will Donald Trump win the 2024 US Presidential Election? -1.45681 1.45681 -1.45681 1.45681
Will Sam Altman be the CEO of OpenAI at the end of 2024? 92.8701 -92.8701 -92.8701 92.8701
Will Hezbollah directly engage in combat operations against Israel? -0.148389 0.148389 -0.148389 0.148389
Will China launch a full-scale invasion of Taiwan before 2030? 54.3198 54.3198 54.3198 -54.3198
Will Sam Altman be a co-founder of a serious OpenAI competitor by EOY 2024? -94.4222 -94.4222 94.4222 -94.4222
Will there be a very reliable way of reading human thoughts by the end of 2024?🧠🕵️ 75.2882 -75.2882 75.2882 -75.2882
Is the "100% effective against solid tumors" cancer pill AOH1996 paper legit? [see description] 54.7358 -54.7358 54.7358 -54.7358
Will Google mostly catch up to OpenAI in LLM quality and neutralize ChatGPT's lead by the end of 2024? -15.0331 -15.0331 15.0331 -15.0331
Will this Yudkowsky tweet hold up? -67.6783 67.6783 -67.6783 67.6783
Will GPT-4 be available to ChatGPT Free Users in 2024? 39.473 -39.473 -39.473 39.473
Will a reliable and general household robot be developed before January 1st, 2030? -26.1514 26.1514 26.1514 -26.1514
Will AI wipe out humanity before the year 2040? 88.9162 -88.9162 88.9162 -88.9162
Was Ilya spooked by a capabilities advance? 83.4589 -83.4589 83.4589 -83.4589
Will Superalignment succeed? (self assessment) 53.6813 53.6813 53.6813 -53.6813
Will Benjamin Netanyahu (Bibi) be the prime minister of Israel at the end of 2024 15.4881 -15.4881 -15.4881 15.4881

@gabrielfior gabrielfior merged commit efc1ef7 into main Apr 11, 2024
6 checks passed
@gabrielfior gabrielfior deleted the gabriel/alternative-outcomes branch April 11, 2024 15:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make one of the agents think about the question more thoroughly
3 participants