Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when I run "python3 scripts/extract_repos.py --cwd ../mathlib4 --config configs/config_mathlib_v4.13.json --training_data " #6

Open
yuchen814 opened this issue Jan 8, 2025 · 12 comments

Comments

@yuchen814
Copy link

Great Work!

when I run "python3 scripts/extract_repos.py --cwd ../mathlib4 --config configs/config_mathlib_v4.13.json --training_data ", I have error like this:
[?/?] Computing build jobs⣿ [?/?] Computing build jobs⣿ [?/?] Computing build jobs⣿ [?/?] Computing build jobs⣿ [?/?] Computing build jobs⣿ [?/?] Computing bu✖ [335/339] Building scripts.training_data
trace: .> LEAN_PATH=././.lake/packages/batteries/.lake/build/lib:././.lake/packages/Qq/.lake/build/lib:././.lake/packages/aesop/.lake/build/lib:././.lake/packages/proofwidgets/.lake/build/lib:././.lake/packages/Cli/.lake/build/lib:././.lake/packages/importGraph/.lake/build/lib:././.lake/packages/LeanSearchClient/.lake/build/lib:././.lake/packages/mathlib/.lake/build/lib:././.lake/build/lib DYLD_LIBRARY_PATH= /Users/yuchen/.elan/toolchains/leanprover--lean4---v4.13.0-rc3/bin/lean -Dlinter.unusedVariables=false -Dlinter.deprecated=false ././././scripts/training_data.lean -R ./././. -o ././.lake/build/lib/scripts/training_data.olean -i ././.lake/build/lib/scripts/training_data.ilean -c ././.lake/build/ir/scripts/training_data.c --json
info: stderr:
failed to write '././.lake/build/lib/scripts/training_data.olean': 2 No such file or directory
error: Lean exited with code 1
✖ [335/339] Building scripts.training_data
trace: .> LEAN_PATH=././.lake/packages/batteries/.lake/build/lib:././.lake/packages/Qq/.lake/build/lib:././.lake/packages/aesop/.lake/build/lib:././.lake/packages/proofwidgets/.lake/build/lib:././.lake/packages/Cli/.lake/build/lib:././.lake/packages/importGraph/.lake/build/lib:././.lake/packages/LeanSearchClient/.lake/build/lib:././.lake/packages/mathlib/.lake/build/lib:././.lake/build/lib DYLD_LIBRARY_PATH= /Users/yuchen/.elan/toolchains/leanprover--lean4---v4.13.0-rc3/bin/lean -Dlinter.unusedVariables=false -Dlinter.deprecated=false ././././scripts/training_data.lean -R ./././. -o ././.lake/build/lib/scripts/training_data.olean -i ././.lake/build/lib/scripts/training_data.ilean -c ././.lake/build/ir/scripts/training_data.c --json
info: stderr:
failed to write '././.lake/build/lib/scripts/training_data.olean': 2 No such file or directory
error: Lean exited with code 1
✖ [335/339] Building scripts.training_data
trace: .> LEAN_PATH=././.lake/packages/batteries/.lake/build/lib:././.lake/packages/Qq/.lake/build/lib:././.lake/packages/aesop/.lake/build/lib:././.lake/packages/proofwidgets/.lake/build/lib:././.lake/packages/Cli/.lake/build/lib:././.lake/packages/importGraph/.lake/build/lib:././.lake/packages/LeanSearchClient/.lake/build/lib:././.lake/packages/mathlib/.lake/build/lib:././.lake/build/lib DYLD_LIBRARY_PATH= /Users/yuchen/.elan/toolchains/leanprover--lean4---v4.13.0-rc3/bin/lean -Dlinter.unusedVariables=false -Dlinter.deprecated=false ././././scripts/training_data.lean -R ./././. -o ././.lake/build/lib/scripts/training_data.olean -i ././.lake/build/lib/scripts/training_data.ilean -c ././.lake/build/ir/scripts/training_data.c --json
info: stderr:
failed to write '././.lake/build/lib/scripts/training_data.olean': 2 No such file or directory
error: Lean exited with code 1
Some required builds logged failures:

  • scripts.training_data
    error: build failed
    0%| | 0/5397 [00:07<?, ?it/s]
    Some required builds logged failures:
  • scripts.training_data
    error: build failed
    Some required builds logged failures:
  • scripts.training_data
    error: build failed

How to solve this? Thanks for your help.

@hanwenzhu
Copy link
Collaborator

Thank you for the interest in our project! I think the problem is that the argument for --cwd should be the path to ntp-toolkit, not the path to mathlib4. By default, ntp-toolkit will try to install mathlib specified by the JSON config, so you don't need to clone mathlib on your own.

@yuchen814
Copy link
Author

Thanks for your prompt reply. You are right! But I have other problems.

When I run 'python3 scripts/extract_repos.py --cwd ../ntp-toolkit --config configs/config_mathlib_v4.13.json --training_data --full_proof_training_data --premises --state_comments --full_proof_training_data_states --training_data_with_premises --declarations --imports'

All parameters are ok except 'training_data_with_premises'
Here is the error:

error: ././././scripts/training_data_with_premises.lean:263:91: application type mismatch
unfoldConstantName constName constantsMap
argument
constantsMap
has type
Std.HashMap Name ConstantInfo : Type
but is expected to have type
HashMap Name ConstantInfo : Type
error: Lean exited with code 1
Some required builds logged failures:

  • scripts.training_data_with_premises

When I remove 'training_data_with_premises' and run 'python3 scripts/extract_repos.py --cwd ../ntp-toolkit --config configs/config_mathlib_v4.13.json --training_data --full_proof_training_data --premises --state_comments --full_proof_training_data_states --declarations --imports'
At first, everything was correct. I saw that there were files in the example. However, there are the following three problems:

  1. It suddenly interrupted after running for about three hours. The error is as follows:

concurrent.futures.process._RemoteTraceback:
"""
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/concurrent/futures/process.py", line 246, in _process_worker
r = call_item.fn(*call_item.args, **call_item.kwargs)
File "/Users/yuchen/Research/lean_research_ant/ntp-toolkit/scripts/run_pipeline.py", line 68, in _extract_module
_run_cmd(
File "/Users/yuchen/Research/lean_research_ant/ntp-toolkit/scripts/run_pipeline.py", line 44, in _run_cmd
subprocess.run(
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/subprocess.py", line 526, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['lake', 'exe', 'full_proof_training_data', 'Examples.mathlib.StateComments.Batteries.Classes.RatCast']' returned non-zero exit status 1.
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/Users/yuchen/Research/lean_research_ant/ntp-toolkit/../ntp-toolkit/scripts/run_pipeline.py", line 150, in
completed.append(future.result())
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/concurrent/futures/_base.py", line 451, in result
return self.__get_result()
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
raise self._exception
subprocess.CalledProcessError: Command '['lake', 'exe', 'full_proof_training_data', 'Examples.mathlib.StateComments.Batteries.Classes.RatCast']' returned non-zero exit status 1.
Traceback (most recent call last):
File "/Users/yuchen/Research/lean_research_ant/ntp-toolkit/scripts/extract_repos.py", line 257, in
_run(
File "/Users/yuchen/Research/lean_research_ant/ntp-toolkit/scripts/extract_repos.py", line 137, in _run
subprocess.run([
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/subprocess.py", line 526, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['python3', '../ntp-toolkit/scripts/run_pipeline.py', '--output-base-dir', 'Examples/mathlib', '--cwd', '../ntp-toolkit', '--import-module', 'Mathlib', 'Init', 'Batteries', '--name', 'mathlib', '--task', 'training_data', 'full_proof_training_data', 'premises', 'state_comments', 'full_proof_training_data_states', 'imports']' returned non-zero exit status 1.

  1. I noticed that although there are new files in the folder "FullProofWithStates", no content has been written into any of these files. Many files in other folders are also empty.

  2. How to obtain data containing statistical information like the benchmark presented in the paper 'miniCTX'?

@yuchen814
Copy link
Author

Since there was another error before, could it be that the act of me correcting this error has led to other mistakes?

error: ././././scripts/state_comments.lean:60:38: unknown constant 'Lean.HashSet.ofList'
error: Lean exited with code 1
Some required builds logged failures:

  • scripts.state_comments

I changed 'Lean.HashSet.ofList' to 'Std.HashSet.ofList'. Then the error was solved.

@hanwenzhu
Copy link
Collaborator

hanwenzhu commented Jan 9, 2025

Hi @yuchen814, --state_comments and t--raining_data_with_premises specifically are not reliably supported across different Lean versions. Since I see you are using Lean & Mathlib 4.13, it is possible there are changes in the Lean core that broke this script (you need to change Lean.HashSet to Std.HashSet, etc.). If you need --state_comments, then I can try to fix this and the issue you posted; your correction of this error is right and didn't lead to the other errors above. For --training_data_with_premises, it is currently under active development under a different fork and is not ready for use.

To obtain miniCTX-style data, you can specify --premises --full_proof_training_data only, and then a script will be automatically run to convert to a miniCTX-style JSONL file. If you are evaluating a retrieval-based model, then you might also need --declarations --imports to also extract the possible premises to retrieve from. For training a model, you may also want --training_data.

@yuchen814
Copy link
Author

Thank you very much!

After I run 'python3 scripts/extract_repos.py --cwd ../ntp-toolkit --config configs/config_mathlib_v4.13.json --training_data --full_proof_training_data --premises --state_comments --full_proof_training_data_states --declarations --imports'.

I got premises like this:
image

Is it right(I ran it for about a few hours, but there were only three files.)?

@hanwenzhu
Copy link
Collaborator

hanwenzhu commented Jan 9, 2025

The output looks very weird and it isn't what I would expect. Can you first git pull the newest commit I pushed, and then remove the --state_comments --full_proof_training_data_states flags and see what the output is? Also are there other files under Examples/Mathlib?

@yuchen814
Copy link
Author

Thank you very much!

I've completed most of the tasks according to your method. However, an error occurred when converting to miniCTX at the end.

Traceback (most recent call last):
File "/Users/yuchen/Research/lean_research_ant/temp/ntp-toolkit/scripts/convert_minictx.py", line 98, in
raise ValueError(f"Unrecognized repository path in {file_path}")
ValueError: Unrecognized repository path in /Users/yuchen/.elan/toolchains/leanprover--lean4---v4.13.0-rc3/bin/../src/lean/Init/Control/Basic.lean

I found code is :
image
Is there any way to solve it? Or is it okay to skip if there is a mismatch?

@hanwenzhu
Copy link
Collaborator

Oh right, I'm sorry I forgot about this logic in the code. I pushed a new commit to try to fix this. Yes, it is okay to skip if there is a mismatch. This happened because you tried to convert some theorem/definition from the Lean core, which this logic didn't cover because of my mistake. Can you try to git pull and try the conversion code again (you can call it directly by python scripts/convert_minictx.py Examples/Mathlib Examples/Mathlib/minictx.jsonl)?

@yuchen814
Copy link
Author

Thanks for your help!
how long does it take to execute the convert script? I've been running it for two days and it hasn't finished yet. There are currently 25,000 pieces of data. I've changed the writing mode to append mode for the file.

@hanwenzhu
Copy link
Collaborator

hanwenzhu commented Jan 13, 2025

@yuchen814 I think potentially the reason is that the internal memory on your computer ran out. Can you check the memory usage? If the theorems in the script is too large then it might use disk space, and then appending to it could be slow. If that's the case, maybe I should change theorems.append in the script to out_file.write directly.

I personally did not try running the convert script on all of mathlib, because miniCTX is supposed to be used on other smaller projects (see the miniCTX paper). If you are extracting data for training a model, then probably the output files in Examples/Mathlib already suffices without conversion, and you can also use scripts/instruction_tuning.py.

@yuchen814
Copy link
Author

Thank you for your help!

@hanwenzhu
Copy link
Collaborator

Thank you for your help!

Happy to help :) Let me know if you have any other questions or feedback!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants