Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: How to setup the files for the example run ? #8

Open
satvikkg opened this issue Oct 25, 2024 · 3 comments · May be fixed by #10
Open

Question: How to setup the files for the example run ? #8

satvikkg opened this issue Oct 25, 2024 · 3 comments · May be fixed by #10

Comments

@satvikkg
Copy link

Hi, great job on the package. I had no issues in setting up the environment. I was finding it hard to execute the example test run.
Here are the steps that I followed. Please let me know how to set it up correctly.
Step 1: I created a folder/directory called as test and inside the test folder I created a new folder called as input.
Step 2: I copied the contents of a3fe/a3fe/data/example_run_dir/input into the test/input folder.
Step 3: I modified the run_somd.sh file as follows:
#!/bin/bash
SBATCH -o=somd-array-gpu-%A.%a.out # I uncommented this line and added = after -o
#SBATCH -p RTXA6000
#SBATCH -n 1
#SBATCH --time 24:00:00
#SBATCH --gres=gpu:0

       lam=$1
       echo "lambda is: " $lam

       srun somd-freenrg -C template_config.cfg -l $lam -p CUDA

Step 4: I created a test.py file in the test folder with the following code:
import a3fe as a3
calc = a3.Calculation(ensemble_size=5)
calc.setup()
calc.get_optimal_lam_vals()
calc.run(adaptive=False, runtime = 5) # Run non-adaptively for 5 ns per replicate
calc.wait()
calc.set_equilibration_time(1) # Discard the first ns of simulation time
calc.analyse()
calc.save()

Step 5: I ran the python file using python3 test.py

This is the error log:
INFO - 2024-10-25 09:50:18,779 - Calculation_0 - Found all required input files for preparation stage parameterised
INFO - 2024-10-25 09:50:18,781 - Calculation_0 - Modifying/ creating legs
INFO - 2024-10-25 09:50:18,781 - Calculation_0 - Setting up bound leg...
INFO - 2024-10-25 09:50:18,782 - Leg (type = BOUND)_1 - Found all required input files for preparation stage parameterised
INFO - 2024-10-25 09:50:18,784 - Leg (type = BOUND)_1 - Setting up leg...
INFO - 2024-10-25 09:50:18,784 - Leg (type = BOUND)_1 - Creating stage input directories...
INFO - 2024-10-25 09:50:18,785 - Leg (type = BOUND)_1 - Solvating input structure. Submitting through SLURM...
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /testrun/test.py:3 in │
│ │
│ 1 import a3fe as a3 │
│ 2 calc = a3.Calculation(ensemble_size=5) │
│ ❱ 3 calc.setup() │
│ 4 calc.get_optimal_lam_vals() │
│ 5 calc.run(adaptive=False, runtime = 5) # Run non-adaptively for 5 ns per replicate │
│ 6 calc.wait() │
│ │
│ /opt/conda/envs/a3fe/lib/python3.12/site-packages/a3fe/run/calculation.py:201 in setup │
│ │
│ 198 │ │ │ │ stream_log_level=self.stream_log_level, │
│ 199 │ │ │ ) │
│ 200 │ │ │ self.legs.append(leg) │
│ ❱ 201 │ │ │ leg.setup(configs[leg_type]) │
│ 202 │ │ │
│ 203 │ │ # Save the state │
│ 204 │ │ self.setup_complete = True │
│ │
│ /opt/conda/envs/a3fe/lib/python3.12/site-packages/a3fe/run/leg.py:229 in setup │
│ │
│ 226 │ │ if self.prep_stage == _PreparationStage.STRUCTURES_ONLY: │
│ 227 │ │ │ self.parameterise_input(slurm=cfg.slurm) │
│ 228 │ │ if self.prep_stage == _PreparationStage.PARAMETERISED: │
│ ❱ 229 │ │ │ self.solvate_input(slurm=cfg.slurm) # This also adds ions │
│ 230 │ │ if self.prep_stage == _PreparationStage.SOLVATED: │
│ 231 │ │ │ system = self.minimise_input(slurm=cfg.slurm) │
│ 232 │ │ if self.prep_stage == _PreparationStage.MINIMISED: │
│ │
│ /opt/conda/envs/a3fe/lib/python3.12/site-packages/a3fe/run/leg.py:465 in solvate_input │
│ │
│ 462 │ │ │ │ fn = _system_prep.slurm_solvate_free │
│ 463 │ │ │ else: │
│ 464 │ │ │ │ raise ValueError("Invalid leg type.") │
│ ❱ 465 │ │ │ self._run_slurm(fn, wait=True, run_dir=self.input_dir, job_name=job_name) │
│ 466 │ │ │ │
│ 467 │ │ │ # Check that the required input files have been produced, since slurm can fa │
│ 468 │ │ │ for file in _PreparationStage.SOLVATED.get_simulation_input_files( │
│ │
│ /opt/conda/envs/a3fe/lib/python3.12/site-packages/a3fe/run/leg.py:883 in _run_slurm │
│ │
│ 880 │ │ │ f"{run_dir}", │
│ 881 │ │ │ f"{run_dir}/{job_name}.sh", │
│ 882 │ │ ] # The virtual queue adds sbatch │
│ ❱ 883 │ │ slurm_file_base = _get_slurm_file_base(slurm_file) │
│ 884 │ │ job = self.virtual_queue.submit(cmd_list, slurm_file_base=slurm_file_base) │
│ 885 │ │ self._logger.info(f"Submitted job {job}") │
│ 886 │ │ self.jobs.append(job) │
│ │
│ /opt/conda/envs/a3fe/lib/python3.12/site-packages/a3fe/read/_process_slurm_files.py:37 in │
│ get_slurm_file_base │
│ │
│ 34 │ │ │ │ │ │ return _os.path.join(base_dir, slurm_pattern) │
│ 35 │ │
│ 36 │ # We haven't returned - raise an error │
│ ❱ 37 │ raise RuntimeError(f"Could not find slurm output file name in {slurm_file}") │
│ 38 │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: Could not find slurm output file name in /testrun/input/solvate_bound.sh

I would really appreciate if you could let me know how to use a3fe correctly. Thanks !

@fjclark
Copy link
Collaborator

fjclark commented Oct 25, 2024

Hi @satvikkg, thanks! Glad the environment setup was painless.

Thanks for the super clear description of what you ran! The issue is that you shouldn't make this modification:

SBATCH -o=somd-array-gpu-%A.%a.out # I uncommented this line and added = after -o

because SLURM reads #SBATCH directives, but won't read options after just SBATCH (e.g. the "commented" lines are functional). From https://slurm.schedmd.com/sbatch.html: "The batch script may contain options preceded with "#SBATCH" before any executable commands in the script. sbatch will stop processing further #SBATCH directives once the first non-comment non-whitespace line has been reached in the script."

So just undo your change to the output line in run_somd.sh, e.g.

#SBATCH -o somd-array-gpu-%A.%a.out

and start again. Make sure to delete the Calculation.pkl file before rerunning, otherwise the (broken) calculation state will be read from this when you instantiate the class.

Hope that works and let me know if you have any more issues!

@satvikkg
Copy link
Author

So I reverted the changes I had previously done and this time I ran it in the /a3fe/a3fe/data/example_run_dir folder where I have the bound free input test.py files

This is the error I got:
INFO:matplotlib.font_manager:generated new fontManager
INFO:numexpr.utils:Note: NumExpr detected 32 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
INFO:numexpr.utils:NumExpr defaulting to 16 threads.
INFO - 2024-10-25 10:56:37,088 - Calculation_0 - Found all required input files for preparation stage parameterised
INFO - 2024-10-25 10:56:37,089 - Calculation_0 - Modifying/ creating legs
INFO - 2024-10-25 10:56:37,089 - Calculation_0 - Setting up bound leg...
INFO - 2024-10-25 10:56:37,090 - Leg (type = BOUND)_1 - Found all required input files for preparation stage parameterised
INFO - 2024-10-25 10:56:37,092 - Leg (type = BOUND)_1 - Setting up leg...
INFO - 2024-10-25 10:56:37,092 - Leg (type = BOUND)_1 - Creating stage input directories...
INFO - 2024-10-25 10:56:37,092 - Leg (type = BOUND)_1 - Solvating input structure. Submitting through SLURM...
INFO - 2024-10-25 10:56:37,093 - VirtualQueue - Job (virtual_job_id = 0, slurm_job_id= None), status = JobStatus.QUEUED submitted
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /a3fe/a3fe/data/example_run_dir/test.py:3 in │
│ │
│ 1 import a3fe as a3 │
│ 2 calc = a3.Calculation(ensemble_size=5) │
│ ❱ 3 calc.setup() │
│ 4 calc.get_optimal_lam_vals() │
│ 5 calc.run(adaptive=False, runtime = 5) # Run non-adaptively for 5 ns per replicate │
│ 6 calc.wait() │
│ │
│ /opt/conda/envs/a3fe/lib/python3.12/site-packages/a3fe/run/calculation.py:201 in setup │
│ │
│ 198 │ │ │ │ stream_log_level=self.stream_log_level, │
│ 199 │ │ │ ) │
│ 200 │ │ │ self.legs.append(leg) │
│ ❱ 201 │ │ │ leg.setup(configs[leg_type]) │
│ 202 │ │ │
│ 203 │ │ # Save the state │
│ 204 │ │ self.setup_complete = True │
│ │
│ /opt/conda/envs/a3fe/lib/python3.12/site-packages/a3fe/run/leg.py:229 in setup │
│ │
│ 226 │ │ if self.prep_stage == _PreparationStage.STRUCTURES_ONLY: │
│ 227 │ │ │ self.parameterise_input(slurm=cfg.slurm) │
│ 228 │ │ if self.prep_stage == _PreparationStage.PARAMETERISED: │
│ ❱ 229 │ │ │ self.solvate_input(slurm=cfg.slurm) # This also adds ions │
│ 230 │ │ if self.prep_stage == _PreparationStage.SOLVATED: │
│ 231 │ │ │ system = self.minimise_input(slurm=cfg.slurm) │
│ 232 │ │ if self.prep_stage == _PreparationStage.MINIMISED: │
│ │
│ /opt/conda/envs/a3fe/lib/python3.12/site-packages/a3fe/run/leg.py:465 in solvate_input │
│ │
│ 462 │ │ │ │ fn = _system_prep.slurm_solvate_free │
│ 463 │ │ │ else: │
│ 464 │ │ │ │ raise ValueError("Invalid leg type.") │
│ ❱ 465 │ │ │ self._run_slurm(fn, wait=True, run_dir=self.input_dir, job_name=job_name) │
│ 466 │ │ │ │
│ 467 │ │ │ # Check that the required input files have been produced, since slurm can fa │
│ 468 │ │ │ for file in _PreparationStage.SOLVATED.get_simulation_input_files( │
│ │
│ /opt/conda/envs/a3fe/lib/python3.12/site-packages/a3fe/run/leg.py:884 in _run_slurm │
│ │
│ 881 │ │ │ f"{run_dir}/{job_name}.sh", │
│ 882 │ │ ] # The virtual queue adds sbatch │
│ 883 │ │ slurm_file_base = _get_slurm_file_base(slurm_file) │
│ ❱ 884 │ │ job = self.virtual_queue.submit(cmd_list, slurm_file_base=slurm_file_base) │
│ 885 │ │ self._logger.info(f"Submitted job {job}") │
│ 886 │ │ self.jobs.append(job) │
│ 887 │ │ # Update the virtual queue to submit the job │
│ │
│ /opt/conda/envs/a3fe/lib/python3.12/site-packages/a3fe/run/_virtual_queue.py:169 in submit │
│ │
│ 166 │ │ self._pre_queue.append(job) │
│ 167 │ │ self._logger.info(f"{job} submitted") │
│ 168 │ │ # Now update - the job will be moved to the real queue if there is space │
│ ❱ 169 │ │ self.update() │
│ 170 │ │ return job │
│ 171 │ │
│ 172 │ def kill(self, job: Job) -> None: │
│ │
│ /opt/conda/envs/a3fe/lib/python3.12/site-packages/a3fe/run/_virtual_queue.py:278 in update │
│ │
│ 275 │ │ """Remove jobs from the queue if they have finished, then move jobs from │
│ 276 │ │ the pre-queue to the real queue if there is space.""" │
│ 277 │ │ # First, remove jobs from the queue if they have finished │
│ ❱ 278 │ │ running_slurm_job_ids = self._read_slurm_queue() │
│ 279 │ │ n_running_slurm_jobs = len(running_slurm_job_ids) │
│ 280 │ │ # Remove completed jobs from the queues and update their status │
│ 281 │ │ for job in self._slurm_queue: │
│ │
│ /opt/conda/envs/a3fe/lib/python3.12/site-packages/a3fe/run/_virtual_queue.py:229 in │
│ _read_slurm_queue │
│ │
│ 226 │ │ │ ] │
│ 227 │ │ │ return running_slurm_job_ids │
│ 228 │ │ │
│ ❱ 229 │ │ return _read_slurm_queue_inner() │
│ 230 │ │
│ 231 │ def _submit_job(self, job_command_list: _List[str]) -> int: │
│ 232 │ │ """ │
│ │
│ /opt/conda/envs/a3fe/lib/python3.12/site-packages/a3fe/run/_utils.py:79 in newfn │
│ │
│ 76 │ │ │ attempt = 0 │
│ 77 │ │ │ while attempt < times: │
│ 78 │ │ │ │ try: │
│ ❱ 79 │ │ │ │ │ return func(*args, **kwargs) │
│ 80 │ │ │ │ except exceptions as e: │
│ 81 │ │ │ │ │ logger.error( │
│ 82 │ │ │ │ │ │ f"Exception thrown when attempting to run {func}, attempt " │
│ │
│ /opt/conda/envs/a3fe/lib/python3.12/site-packages/a3fe/run/_virtual_queue.py:211 in │
│ _read_slurm_queue_inner │
│ │
│ 208 │ │ │ ] │
│ 209 │ │ │ │
│ 210 │ │ │ # Create a pipe for the first command │
│ ❱ 211 │ │ │ process = _subprocess.Popen(commands[0], stdout=_subprocess.PIPE) │
│ 212 │ │ │ │
│ 213 │ │ │ # Chain the commands │
│ 214 │ │ │ for cmd in commands[1:]: │
│ │
│ /opt/conda/envs/a3fe/lib/python3.12/subprocess.py:1026 in init
│ │
│ 1023 │ │ │ │ │ self.stderr = io.TextIOWrapper(self.stderr, │
│ 1024 │ │ │ │ │ │ │ encoding=encoding, errors=errors) │
│ 1025 │ │ │ │
│ ❱ 1026 │ │ │ self._execute_child(args, executable, preexec_fn, close_fds, │
│ 1027 │ │ │ │ │ │ │ │ pass_fds, cwd, env, │
│ 1028 │ │ │ │ │ │ │ │ startupinfo, creationflags, shell, │
│ 1029 │ │ │ │ │ │ │ │ p2cread, p2cwrite, │
│ │
│ /opt/conda/envs/a3fe/lib/python3.12/subprocess.py:1885 in _execute_child │
│ │
│ 1882 │ │ │ │ │ │ │ for dir in os.get_exec_path(env)) │
│ 1883 │ │ │ │ │ fds_to_keep = set(pass_fds) │
│ 1884 │ │ │ │ │ fds_to_keep.add(errpipe_write) │
│ ❱ 1885 │ │ │ │ │ self.pid = _fork_exec( │
│ 1886 │ │ │ │ │ │ │ args, executable_list, │
│ 1887 │ │ │ │ │ │ │ close_fds, tuple(sorted(map(int, fds_to_keep))), │
│ 1888 │ │ │ │ │ │ │ cwd, env_list, │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
TypeError: expected str, bytes or os.PathLike object, not NoneType

@fjclark
Copy link
Collaborator

fjclark commented Oct 25, 2024

Sorry this still isn't working.

It looks like the issue is here:
process = _subprocess.Popen(commands[0], stdout=_subprocess.PIPE)

commands[0] is ["squeue", "-h", "-u", _os.getenv("USER")], so the issue is likely that os.getenv("USER") is returning None. A short-term hack is to run

export USER="your_username"

before running a3fe, but this isn't a nice solution.

Could you please:

  • Let me know which OS you are using
  • Run echo $USER in your shell and confirm that this does not return your username.
  • Confirm that `python -c "import getpass; print(getpass.getuser())" prints your username.

I will then update the code to use getpass.getuser() instead of evaluating the USER environment variable, which should be more robust.

Thanks.

fjclark added a commit that referenced this issue Nov 18, 2024
@fjclark fjclark linked a pull request Nov 18, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants