Error when I run "python3 scripts/extract_repos.py --cwd ../mathlib4 --config configs/config_mathlib_v4.13.json --training_data " #6

yuchen814 · 2025-01-08T07:22:04Z

Great Work!

when I run "python3 scripts/extract_repos.py --cwd ../mathlib4 --config configs/config_mathlib_v4.13.json --training_data ", I have error like this:
[?/?] Computing build jobs⣿ [?/?] Computing build jobs⣿ [?/?] Computing build jobs⣿ [?/?] Computing build jobs⣿ [?/?] Computing build jobs⣿ [?/?] Computing bu✖ [335/339] Building scripts.training_data
trace: .> LEAN_PATH=././.lake/packages/batteries/.lake/build/lib:././.lake/packages/Qq/.lake/build/lib:././.lake/packages/aesop/.lake/build/lib:././.lake/packages/proofwidgets/.lake/build/lib:././.lake/packages/Cli/.lake/build/lib:././.lake/packages/importGraph/.lake/build/lib:././.lake/packages/LeanSearchClient/.lake/build/lib:././.lake/packages/mathlib/.lake/build/lib:././.lake/build/lib DYLD_LIBRARY_PATH= /Users/yuchen/.elan/toolchains/leanprover--lean4---v4.13.0-rc3/bin/lean -Dlinter.unusedVariables=false -Dlinter.deprecated=false ././././scripts/training_data.lean -R ./././. -o ././.lake/build/lib/scripts/training_data.olean -i ././.lake/build/lib/scripts/training_data.ilean -c ././.lake/build/ir/scripts/training_data.c --json
info: stderr:
failed to write '././.lake/build/lib/scripts/training_data.olean': 2 No such file or directory
error: Lean exited with code 1
✖ [335/339] Building scripts.training_data
trace: .> LEAN_PATH=././.lake/packages/batteries/.lake/build/lib:././.lake/packages/Qq/.lake/build/lib:././.lake/packages/aesop/.lake/build/lib:././.lake/packages/proofwidgets/.lake/build/lib:././.lake/packages/Cli/.lake/build/lib:././.lake/packages/importGraph/.lake/build/lib:././.lake/packages/LeanSearchClient/.lake/build/lib:././.lake/packages/mathlib/.lake/build/lib:././.lake/build/lib DYLD_LIBRARY_PATH= /Users/yuchen/.elan/toolchains/leanprover--lean4---v4.13.0-rc3/bin/lean -Dlinter.unusedVariables=false -Dlinter.deprecated=false ././././scripts/training_data.lean -R ./././. -o ././.lake/build/lib/scripts/training_data.olean -i ././.lake/build/lib/scripts/training_data.ilean -c ././.lake/build/ir/scripts/training_data.c --json
info: stderr:
failed to write '././.lake/build/lib/scripts/training_data.olean': 2 No such file or directory
error: Lean exited with code 1
✖ [335/339] Building scripts.training_data
trace: .> LEAN_PATH=././.lake/packages/batteries/.lake/build/lib:././.lake/packages/Qq/.lake/build/lib:././.lake/packages/aesop/.lake/build/lib:././.lake/packages/proofwidgets/.lake/build/lib:././.lake/packages/Cli/.lake/build/lib:././.lake/packages/importGraph/.lake/build/lib:././.lake/packages/LeanSearchClient/.lake/build/lib:././.lake/packages/mathlib/.lake/build/lib:././.lake/build/lib DYLD_LIBRARY_PATH= /Users/yuchen/.elan/toolchains/leanprover--lean4---v4.13.0-rc3/bin/lean -Dlinter.unusedVariables=false -Dlinter.deprecated=false ././././scripts/training_data.lean -R ./././. -o ././.lake/build/lib/scripts/training_data.olean -i ././.lake/build/lib/scripts/training_data.ilean -c ././.lake/build/ir/scripts/training_data.c --json
info: stderr:
failed to write '././.lake/build/lib/scripts/training_data.olean': 2 No such file or directory
error: Lean exited with code 1
Some required builds logged failures:

scripts.training_data
error: build failed
0%| | 0/5397 [00:07<?, ?it/s]
Some required builds logged failures:
scripts.training_data
error: build failed
Some required builds logged failures:
scripts.training_data
error: build failed

How to solve this? Thanks for your help.

hanwenzhu · 2025-01-08T07:41:59Z

Thank you for the interest in our project! I think the problem is that the argument for --cwd should be the path to ntp-toolkit, not the path to mathlib4. By default, ntp-toolkit will try to install mathlib specified by the JSON config, so you don't need to clone mathlib on your own.

yuchen814 · 2025-01-09T00:17:43Z

Thanks for your prompt reply. You are right! But I have other problems.

When I run 'python3 scripts/extract_repos.py --cwd ../ntp-toolkit --config configs/config_mathlib_v4.13.json --training_data --full_proof_training_data --premises --state_comments --full_proof_training_data_states --training_data_with_premises --declarations --imports'

All parameters are ok except 'training_data_with_premises'
Here is the error:

error: ././././scripts/training_data_with_premises.lean:263:91: application type mismatch
unfoldConstantName constName constantsMap
argument
constantsMap
has type
Std.HashMap Name ConstantInfo : Type
but is expected to have type
HashMap Name ConstantInfo : Type
error: Lean exited with code 1
Some required builds logged failures:

scripts.training_data_with_premises

When I remove 'training_data_with_premises' and run 'python3 scripts/extract_repos.py --cwd ../ntp-toolkit --config configs/config_mathlib_v4.13.json --training_data --full_proof_training_data --premises --state_comments --full_proof_training_data_states --declarations --imports'
At first, everything was correct. I saw that there were files in the example. However, there are the following three problems:

It suddenly interrupted after running for about three hours. The error is as follows：

concurrent.futures.process._RemoteTraceback:
"""
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/concurrent/futures/process.py", line 246, in _process_worker
r = call_item.fn(*call_item.args, **call_item.kwargs)
File "/Users/yuchen/Research/lean_research_ant/ntp-toolkit/scripts/run_pipeline.py", line 68, in _extract_module
_run_cmd(
File "/Users/yuchen/Research/lean_research_ant/ntp-toolkit/scripts/run_pipeline.py", line 44, in _run_cmd
subprocess.run(
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/subprocess.py", line 526, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['lake', 'exe', 'full_proof_training_data', 'Examples.mathlib.StateComments.Batteries.Classes.RatCast']' returned non-zero exit status 1.
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/Users/yuchen/Research/lean_research_ant/ntp-toolkit/../ntp-toolkit/scripts/run_pipeline.py", line 150, in
completed.append(future.result())
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/concurrent/futures/_base.py", line 451, in result
return self.__get_result()
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
raise self._exception
subprocess.CalledProcessError: Command '['lake', 'exe', 'full_proof_training_data', 'Examples.mathlib.StateComments.Batteries.Classes.RatCast']' returned non-zero exit status 1.
Traceback (most recent call last):
File "/Users/yuchen/Research/lean_research_ant/ntp-toolkit/scripts/extract_repos.py", line 257, in
_run(
File "/Users/yuchen/Research/lean_research_ant/ntp-toolkit/scripts/extract_repos.py", line 137, in _run
subprocess.run([
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/subprocess.py", line 526, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['python3', '../ntp-toolkit/scripts/run_pipeline.py', '--output-base-dir', 'Examples/mathlib', '--cwd', '../ntp-toolkit', '--import-module', 'Mathlib', 'Init', 'Batteries', '--name', 'mathlib', '--task', 'training_data', 'full_proof_training_data', 'premises', 'state_comments', 'full_proof_training_data_states', 'imports']' returned non-zero exit status 1.

I noticed that although there are new files in the folder "FullProofWithStates", no content has been written into any of these files. Many files in other folders are also empty.
How to obtain data containing statistical information like the benchmark presented in the paper 'miniCTX'?

yuchen814 · 2025-01-09T00:41:10Z

Since there was another error before, could it be that the act of me correcting this error has led to other mistakes?

error: ././././scripts/state_comments.lean:60:38: unknown constant 'Lean.HashSet.ofList'
error: Lean exited with code 1
Some required builds logged failures:

scripts.state_comments

I changed 'Lean.HashSet.ofList' to 'Std.HashSet.ofList'. Then the error was solved.

hanwenzhu · 2025-01-09T01:37:30Z

Hi @yuchen814, --state_comments and t--raining_data_with_premises specifically are not reliably supported across different Lean versions. Since I see you are using Lean & Mathlib 4.13, it is possible there are changes in the Lean core that broke this script (you need to change Lean.HashSet to Std.HashSet, etc.). If you need --state_comments, then I can try to fix this and the issue you posted; your correction of this error is right and didn't lead to the other errors above. For --training_data_with_premises, it is currently under active development under a different fork and is not ready for use.

To obtain miniCTX-style data, you can specify --premises --full_proof_training_data only, and then a script will be automatically run to convert to a miniCTX-style JSONL file. If you are evaluating a retrieval-based model, then you might also need --declarations --imports to also extract the possible premises to retrieve from. For training a model, you may also want --training_data.

yuchen814 · 2025-01-09T02:00:34Z

Thank you very much!

After I run 'python3 scripts/extract_repos.py --cwd ../ntp-toolkit --config configs/config_mathlib_v4.13.json --training_data --full_proof_training_data --premises --state_comments --full_proof_training_data_states --declarations --imports'.

I got premises like this:

Is it right(I ran it for about a few hours, but there were only three files.)?

hanwenzhu · 2025-01-09T02:19:00Z

The output looks very weird and it isn't what I would expect. Can you first git pull the newest commit I pushed, and then remove the --state_comments --full_proof_training_data_states flags and see what the output is? Also are there other files under Examples/Mathlib?

yuchen814 · 2025-01-09T08:02:29Z

Thank you very much!

I've completed most of the tasks according to your method. However, an error occurred when converting to miniCTX at the end.

Traceback (most recent call last):
File "/Users/yuchen/Research/lean_research_ant/temp/ntp-toolkit/scripts/convert_minictx.py", line 98, in
raise ValueError(f"Unrecognized repository path in {file_path}")
ValueError: Unrecognized repository path in /Users/yuchen/.elan/toolchains/leanprover--lean4---v4.13.0-rc3/bin/../src/lean/Init/Control/Basic.lean

I found code is ：

Is there any way to solve it? Or is it okay to skip if there is a mismatch?

hanwenzhu · 2025-01-09T08:58:33Z

Oh right, I'm sorry I forgot about this logic in the code. I pushed a new commit to try to fix this. Yes, it is okay to skip if there is a mismatch. This happened because you tried to convert some theorem/definition from the Lean core, which this logic didn't cover because of my mistake. Can you try to git pull and try the conversion code again (you can call it directly by python scripts/convert_minictx.py Examples/Mathlib Examples/Mathlib/minictx.jsonl)?

yuchen814 · 2025-01-13T07:24:08Z

Thanks for your help!
how long does it take to execute the convert script? I've been running it for two days and it hasn't finished yet. There are currently 25,000 pieces of data. I've changed the writing mode to append mode for the file.

hanwenzhu · 2025-01-13T14:57:58Z

@yuchen814 I think potentially the reason is that the internal memory on your computer ran out. Can you check the memory usage? If the theorems in the script is too large then it might use disk space, and then appending to it could be slow. If that's the case, maybe I should change theorems.append in the script to out_file.write directly.

I personally did not try running the convert script on all of mathlib, because miniCTX is supposed to be used on other smaller projects (see the miniCTX paper). If you are extracting data for training a model, then probably the output files in Examples/Mathlib already suffices without conversion, and you can also use scripts/instruction_tuning.py.

yuchen814 · 2025-01-21T05:39:22Z

Thank you for your help!

hanwenzhu · 2025-01-21T10:38:33Z

Thank you for your help!

Happy to help :) Let me know if you have any other questions or feedback!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error when I run "python3 scripts/extract_repos.py --cwd ../mathlib4 --config configs/config_mathlib_v4.13.json --training_data " #6

Error when I run "python3 scripts/extract_repos.py --cwd ../mathlib4 --config configs/config_mathlib_v4.13.json --training_data " #6

yuchen814 commented Jan 8, 2025

hanwenzhu commented Jan 8, 2025

yuchen814 commented Jan 9, 2025

yuchen814 commented Jan 9, 2025

hanwenzhu commented Jan 9, 2025 •

edited

Loading

yuchen814 commented Jan 9, 2025

hanwenzhu commented Jan 9, 2025 •

edited

Loading

yuchen814 commented Jan 9, 2025

hanwenzhu commented Jan 9, 2025

yuchen814 commented Jan 13, 2025

hanwenzhu commented Jan 13, 2025 •

edited

Loading

yuchen814 commented Jan 21, 2025

hanwenzhu commented Jan 21, 2025

Error when I run "python3 scripts/extract_repos.py --cwd ../mathlib4 --config configs/config_mathlib_v4.13.json --training_data " #6

Error when I run "python3 scripts/extract_repos.py --cwd ../mathlib4 --config configs/config_mathlib_v4.13.json --training_data " #6

Comments

yuchen814 commented Jan 8, 2025

hanwenzhu commented Jan 8, 2025

yuchen814 commented Jan 9, 2025

yuchen814 commented Jan 9, 2025

hanwenzhu commented Jan 9, 2025 • edited Loading

yuchen814 commented Jan 9, 2025

hanwenzhu commented Jan 9, 2025 • edited Loading

yuchen814 commented Jan 9, 2025

hanwenzhu commented Jan 9, 2025

yuchen814 commented Jan 13, 2025

hanwenzhu commented Jan 13, 2025 • edited Loading

yuchen814 commented Jan 21, 2025

hanwenzhu commented Jan 21, 2025

hanwenzhu commented Jan 9, 2025 •

edited

Loading

hanwenzhu commented Jan 9, 2025 •

edited

Loading

hanwenzhu commented Jan 13, 2025 •

edited

Loading