Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Submit script for tianhe2 SLURM #93

Open
PrometheusPi opened this issue Feb 14, 2018 · 93 comments
Open

Submit script for tianhe2 SLURM #93

PrometheusPi opened this issue Feb 14, 2018 · 93 comments
Assignees

Comments

@PrometheusPi
Copy link
Member

As needed by @QJohn2017 a setup script for tianhe2 is needed (see #89). Since tianhe2 uses SLURM for scheduling, either the ./prepare_job script needs to be adjusted to an ./prepare_job_tianhe2.shscript which creates a SLURM submit file or we go directly with an submit file that focuses on MPI jobs only (since tianhe2 is large and probably has to handle quite a lot of jobs, MPI is probably the better choice for this system).

Additionally, submit scripts for other clusters (taurus, PizDaint, etc.) should be provided.

@QJohn2017 Are you planing to submit with MPI only or are you also considering running SLURM Array jobs?
If only an MPI job is planned, a simple submit script should be sufficient.

@PrometheusPi
Copy link
Member Author

@QJohn2017 some questions regarding your setup on tinahe2:

  • What is the name of the queue you are using/planing to use?
  • How many cores does a node have?
  • How much memory is available on each node?
  • Does the system requires you to set specific hardware specifiers?

If you could provide an example submit file, that would help a lot.

@QJohn2017
Copy link

@PrometheusPi
In my supercomputer center, a node has 24 cores, and i don't know the available memory on each node.
The following is one of my submit file when i use the EPOCH program on the tianhe2
#!/bin/bash echo Data | yhrun -N 4 -n 96 ./bin/epoch2d

@QJohn2017
Copy link

@PrometheusPi
sorry, the submit command is
#!/bin/bash
echo Data | yhrun -N 4 -n 96 ./bin/epoch2d

And other submit command is
_**#!/bin/bash**_
_**yhrun -N 4 -n 96 ../src/lapin lapin.inp**_

@QJohn2017
Copy link

@PrometheusPi
#!/bin/bash
yhrun -N 4 -n 96 ../src/lapin lapin.inp

@QJohn2017
Copy link

@PrometheusPi ,

When I use the submit commands above, one of the output file is named
slurm-4399462.out
where the number is the job ID

@PrometheusPi
Copy link
Member Author

Do you submit these scripts via the command sbatch?
Like for example:

siom003@login3 sbatch my_submit_file.sh

You do not need any specifications via bash comments as e.g.:

#SBATCH --partition=normal
#SBATCH --time=24:00:00

#SBATCH --job-name=NameOfMyJob
#SBATCH --nodes=8
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=12
#SBATCH --ntasks-per-core=1
#SBATCH -o stdout
#SBATCH -e stderr

What does the help file of sbatch -h return?

@PrometheusPi
Copy link
Member Author

@QJohn2017
Is yhrun an mpi distribution command like mpiexec, mpirun or srun?

@PrometheusPi
Copy link
Member Author

Could you try running yhrun -N 2 -n 48 ./path_to_clara_src/executable?

@QJohn2017
Copy link

@PrometheusPi
How do you know the siom003@login3
siom003 is our group name in Shanghai supercomputer center

@QJohn2017
Copy link

@PrometheusPi ,
In tianhe2, when i create the submit file script, e.g., named test.sh. Then I submit the job by input the command
yhbatch -N 4 ./test.sh

@QJohn2017
Copy link

@PrometheusPi ,
The yhrun is used when I submit the job without a script .sh file.
The yhbatch is used when the submit script .sh file is used

@PrometheusPi
Copy link
Member Author

The bash interface siom003@login3 is the interface you used in your first issue - I just copy and pasted it to the resemble a known interface in order to emphasize that the entry is intended to directly work on the command line.
However, if this is the computer system for your group, then sbatchmight not exist there - it might only be available at a login node of the super computer.

@PrometheusPi
Copy link
Member Author

Remark for myself:
sbatch is called yhbatch
srun is called yhrun

@PrometheusPi
Copy link
Member Author

Okay - Sorry - I got confused:
The initial error message in #89 is from your Shanghai supercomputer center.
The second message in #89 with the warnings is from tianhe2.
Is this correct?

Then please excuse my confusion and the use of siom003@login3 - then this should be executed not in Shanghai but on tianhe2.

@QJohn2017
Copy link

@PrometheusPi ,
Yes, the initial error message in #89 is from Shanghai supercomputer center. The other error massages all come from the tianhe2.

@QJohn2017
Copy link

@PrometheusPi ,
I want to compute this program on tianhe2 firstly. When the program compute correctly on tianhe2, I will consider it on Shanghai supercomputer center

@PrometheusPi
Copy link
Member Author

PrometheusPi commented Feb 14, 2018

Okay - then please excuse my confusion. I thought it is the same machine but different nodes.

Than let us start with finding a setup for tianhe2 first.

I know tianhe2 has a SLURM submit system with the above mentioned renaming.
You mentioned, that you submit via yhbatch -N 4 ./test.sh.

In test.sh, do you specify anything via bash comments #SBATCH ... or #YHBATCH?

@PrometheusPi
Copy link
Member Author

Could you copy the help of yhbatch to this issue?
(This can probably be done by yhbatch -h or yhbatch --help or man yhbatch)
I assume everything will be similar or equal to sbatch, but I don't know.

@QJohn2017
Copy link

QJohn2017 commented Feb 14, 2018

@PrometheusPi ,
Just now, I submit the job by a script file named test.sh. The commands in test.sh file is
#!/bin/bash
yhrun -N 2 -n 48 executable

When I input the following commands to submit the job
yhbatch -N 2 ./test.sh

There are some errors appeared as follows

executable: error while loading shared libraries: libfftw3.so.3: cannot open shared object file: No such file or directory
executable: error while loading shared libraries: libfftw3.so.3: cannot open shared object file: No such file or directory
executable: error while loading shared libraries: libfftw3.so.3: cannot open shared object file: No such file or directory
executable: error while loading shared libraries: libfftw3.so.3: cannot open shared object file: No such file or directory
executable: error while loading shared libraries: libfftw3.so.3: cannot open shared object file: No such file or directory
executable: error while loading shared libraries: libfftw3.so.3: cannot open shared object file: No such file or directory
executable: error while loading shared libraries: libfftw3.so.3: cannot open shared object file: No such file or directory
executable: error while loading shared libraries: libfftw3.so.3: cannot open shared object file: No such file or directory
executable: error while loading shared libraries: libfftw3.so.3: cannot open shared object file: No such file or directory
executable: error while loading shared libraries: libfftw3.so.3: cannot open shared object file: No such file or directory
executable: error while loading shared libraries: libfftw3.so.3: cannot open shared object file: No such file or directory
executable: error while loading shared libraries: libfftw3.so.3: cannot open shared object file: No such file or directory
executable: error while loading shared libraries: libfftw3.so.3: cannot open shared object file: No such file or directory
executable: error while loading shared libraries: libfftw3.so.3: cannot open shared object file: No such file or directory
executable: error while loading shared libraries: libfftw3.so.3: cannot open shared object file: No such file or directory
executable: error while loading shared libraries: libfftw3.so.3: cannot open shared object file: No such file or directory
executable: error while loading shared libraries: libfftw3.so.3: cannot open shared object file: No such file or directory
executable: error while loading shared libraries: libfftw3.so.3: cannot open shared object file: No such file or directory
executable: error while loading shared libraries: libfftw3.so.3: cannot open shared object file: No such file or directory
executable: error while loading shared libraries: libfftw3.so.3: cannot open shared object file: No such file or directory
executable: error while loading shared libraries: libfftw3.so.3: cannot open shared object file: No such file or directory
executable: error while loading shared libraries: libfftw3.so.3: cannot open shared object file: No such file or directory
executable: error while loading shared libraries: libfftw3.so.3: cannot open shared object file: No such file or directory
executable: error while loading shared libraries: libfftw3.so.3: cannot open shared object file: No such file or directory
executable: error while loading shared libraries: libfftw3.so.3: cannot open shared object file: No such file or directory
executable: error while loading shared libraries: libfftw3.so.3: cannot open shared object file: No such file or directory
executable: error while loading shared libraries: libfftw3.so.3: cannot open shared object file: No such file or directory
executable: error while loading shared libraries: libfftw3.so.3: cannot open shared object file: No such file or directory
executable: error while loading shared libraries: libfftw3.so.3: cannot open shared object file: No such file or directory
executable: error while loading shared libraries: libfftw3.so.3: cannot open shared object file: No such file or directory
executable: error while loading shared libraries: libfftw3.so.3: cannot open shared object file: No such file or directory
executable: error while loading shared libraries: libfftw3.so.3: cannot open shared object file: No such file or directory
executable: error while loading shared libraries: libfftw3.so.3: cannot open shared object file: No such file or directory
executable: error while loading shared libraries: libfftw3.so.3: cannot open shared object file: No such file or directory
executable: error while loading shared libraries: libfftw3.so.3: cannot open shared object file: No such file or directory
executable: error while loading shared libraries: libfftw3.so.3: cannot open shared object file: No such file or directory
executable: error while loading shared libraries: libfftw3.so.3: cannot open shared object file: No such file or directory
executable: error while loading shared libraries: libfftw3.so.3: cannot open shared object file: No such file or directory
executable: error while loading shared libraries: libfftw3.so.3: cannot open shared object file: No such file or directory
executable: error while loading shared libraries: libfftw3.so.3: cannot open shared object file: No such file or directory
executable: error while loading shared libraries: libfftw3.so.3: cannot open shared object file: No such file or directory
executable: error while loading shared libraries: libfftw3.so.3: cannot open shared object file: No such file or directory
executable: error while loading shared libraries: libfftw3.so.3: cannot open shared object file: No such file or directory
executable: error while loading shared libraries: libfftw3.so.3: cannot open shared object file: No such file or directory
executable: error while loading shared libraries: libfftw3.so.3: cannot open shared object file: No such file or directory
executable: error while loading shared libraries: libfftw3.so.3: cannot open shared object file: No such file or directory
executable: error while loading shared libraries: libfftw3.so.3: cannot open shared object file: No such file or directory
executable: error while loading shared libraries: libfftw3.so.3: cannot open shared object file: No such file or directory
yhrun: error: cn10352: tasks 0-23: Exited with exit code 127

@QJohn2017
Copy link

QJohn2017 commented Feb 14, 2018

@PrometheusPi ,
When I input the comand yhbatch -h, the returned results are as follows:

[ac_siom_jsliu_1@ln3%tianhe2-C src]$ yhbatch -h
Usage: sbatch [OPTIONS...] executable [args...]

Parallel run options:
  -a, --array=indexes        job array index values
  -A, --account=name          charge job to specified account
      --begin=time            defer job until HH:MM MM/DD/YY
  -c, --cpus-per-task=ncpus   number of cpus required per task
      --comment=name          arbitrary comment
  -d, --dependency=type:jobid defer job until condition on jobid is satisfied
  -D, --workdir=directory     set working directory for batch script
  -e, --error=err             file for batch script's standard error
      --export[=names]        specify environment variables to export
      --export-file=file|fd   specify environment variables file or file descriptor to export
      --get-user-env          load environment from local cluster
      --gid=group_id          group ID to run job as (user root only)
      --gres=list             required generic resources
  -H, --hold                  submit job in held state
  -i, --input=in              file for batch script's standard input
  -I, --immediate             exit if resources are not immediately available
      --jobid=id              run under already allocated job
  -J, --job-name=jobname      name of job
  -k, --no-kill               do not kill job on node failure
  -L, --licenses=names        required license, comma separated
  -m, --distribution=type     distribution method for processes to nodes
                              (type = block|cyclic|arbitrary)
  -M, --clusters=names        Comma separated list of clusters to issue
                              commands to.  Default is current cluster.
                              Name of 'all' will submit to run on all clusters.
      --mail-type=type        notify on state change: BEGIN, END, FAIL or ALL
      --mail-user=user        who to send email notification for job state
                              changes
  -n, --ntasks=ntasks         number of tasks to run
      --nice[=value]          decrease scheduling priority by value
      --no-requeue            if set, do not permit the job to be requeued
      --ntasks-per-node=n     number of tasks to invoke on each node
  -N, --nodes=N               number of nodes on which to run (N = min[-max])
  -o, --output=out            file for batch script's standard output
  -O, --overcommit            overcommit resources
  -p, --partition=partition   partition requested
      --profile=value         enable acct_gather_profile for detailed data
                              value is all or none or any combination of
                              energy, lustre, network or task
      --propagate[=rlimits]   propagate all [or specific list of] rlimits
      --qos=qos               quality of service
  -Q, --quiet                 quiet mode (suppress informational messages)
      --requeue               if set, permit the job to be requeued
  -t, --time=minutes          time limit
      --time-min=minutes      minimum time limit (if distinct)
  -s, --share                 share nodes with other jobs
      --uid=user_id           user ID to run job as (user root only)
  -v, --verbose               verbose mode (multiple -v's increase verbosity)
      --wrap[=command string] wrap commmand string in a sh script and submit
      --switches=max-switches{@max-time-to-wait}
                              Optimum switches and max time to wait for optimum
      --ignore-pbs            Ignore #PBS options in the batch script

Constraint options:
      --contiguous            demand a contiguous range of nodes
  -C, --constraint=list       specify a list of constraints
  -F, --nodefile=filename     request a specific list of hosts
      --mem=MB                minimum amount of real memory
      --mincpus=n             minimum number of logical processors (threads) per node
      --reservation=name      allocate resources from named reservation
      --tmp=MB                minimum amount of temporary disk
  -w, --nodelist=hosts...     request a specific list of hosts
  -x, --exclude=hosts...      exclude a specific list of hosts

Consumable resources related options:
      --exclusive             allocate nodes in exclusive mode when
                              cpu consumable resource is enabled
      --mem-per-cpu=MB        maximum amount of real memory per allocated
                              cpu required by the job.
                              --mem >= --mem-per-cpu if --mem is specified.

Affinity/Multi-core options: (when the task/affinity plugin is enabled)
  -B  --extra-node-info=S[:C[:T]]            Expands to:
       --sockets-per-node=S   number of sockets per node to allocate
       --cores-per-socket=C   number of cores per socket to allocate
       --threads-per-core=T   number of threads per core to allocate
                              each field can be 'min' or wildcard '*'
                              total cpus requested = (N x S x C x T)

      --ntasks-per-core=n     number of tasks to invoke on each core
      --ntasks-per-socket=n   number of tasks to invoke on each socket
      --cpu_bind=             Bind tasks to CPUs
                              (see "--cpu_bind=help" for options)
      --hint=                 Bind tasks according to application hints
                              (see "--hint=help" for options)
      --mem_bind=             Bind memory to locality domains (ldom)
                              (see "--mem_bind=help" for options)


Help options:
  -h, --help                  show this help message
  -u, --usage                 display brief usage message

Other options:
  -V, --version               output version information and exit

[ac_siom_jsliu_1@ln3%tianhe2-C src]$

@PrometheusPi
Copy link
Member Author

Great, it looks like clara2 was put on the cluster.
But when executed, the fftw library was not found.

Could you please run ldd on executable.
In the environment you compiled clara2, all links provided by ldd should point to a know place.
What does ldd executable returns at the computer you compiled the code at?

If you add the following line to your test.sh:

#!/bin/bash
ldd executable
yhrun -N 2 -n 48 executable

the same output should appear.

However, I assume that libfftw3.so.3 will point to ==> not found.
Is this correct?

@PrometheusPi
Copy link
Member Author

Thank you for the help file. So far, everything looks the same to sbatch.

@QJohn2017
Copy link

QJohn2017 commented Feb 14, 2018

@PrometheusPi
When I input the command ldd executable, then the results are as follows:

[ac_siom_jsliu_1@ln3%tianhe2-C src]$ ldd executable
        linux-vdso.so.1 =>  (0x00007ffffedff000)
        libfftw3.so.3 => /usr/lib64/libfftw3.so.3 (0x0000003246000000)
        libm.so.6 => /lib64/libm.so.6 (0x0000003246800000)
        libz.so.1 => /lib64/libz.so.1 (0x0000003246c00000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x0000003246400000)
        librt.so.1 => /lib64/librt.so.1 (0x0000003247000000)
        libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x000000324d400000)
        libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x000000324bc00000)
        libc.so.6 => /lib64/libc.so.6 (0x0000003245c00000)
        /lib64/ld-linux-x86-64.so.2 (0x0000003245800000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00002aacd857e000)
[ac_siom_jsliu_1@ln3%tianhe2-C src]$ ^C
[ac_siom_jsliu_1@ln3%tianhe2-C src]$

@PrometheusPi
Copy link
Member Author

PrometheusPi commented Feb 14, 2018

Great, and what does lddreturn, when executed in a submit script?

@QJohn2017
Copy link

QJohn2017 commented Feb 14, 2018

When the commands in test.sh file change to

#!/bin/bash
ldd executable
yhrun -N 2 -n 48 executable

The output errors become as follows:

	linux-vdso.so.1 =>  (0x00007fff8247e000)
	libfftw3.so.3 => not found
	libm.so.6 => /lib64/libm.so.6 (0x0000003921600000)
	libz.so.1 => /lib64/libz.so.1 (0x0000003922600000)
	libpthread.so.0 => /lib64/libpthread.so.0 (0x0000003922200000)
	librt.so.1 => /lib64/librt.so.1 (0x0000003922a00000)
	libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x0000003928200000)
	libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x0000003924e00000)
	libc.so.6 => /lib64/libc.so.6 (0x0000003921a00000)
	/lib64/ld-linux-x86-64.so.2 (0x0000003921200000)
	libdl.so.2 => /lib64/libdl.so.2 (0x0000003921e00000)
executable: error while loading shared libraries: libfftw3.so.3: cannot open shared object file: No such file or directory
...
executable: error while loading shared libraries: libfftw3.so.3: cannot open shared object file: No such file or directory
yhrun: error: cn9974: tasks 0-23: Exited with exit code 127

@PrometheusPi
Copy link
Member Author

PrometheusPi commented Feb 14, 2018

Okay - it looks like as if your bash environment is different between the node you compiled at and the node you ran via yhbatch.

Could you please copy the output of

echo $LD_LIBRARY_PATH

as returned from compile node to this issue.

And then additionally post the same result returned by yhbatch ./test.sh for the following test.sh file:

#!/bin/bash
echo $LD_LIBRARY_PATH
yhrun -N 2 -n 48 executable

There is probably some difference in the LD_LIBRARY_PATH.

@QJohn2017
Copy link

QJohn2017 commented Feb 14, 2018

The command ldd ./executable returns:

[ac_siom_jsliu_1@ln0%tianhe2-C src]$ ldd ./executable
        linux-vdso.so.1 =>  (0x00007fffb91ff000)
        libfftw3.so.3 => /usr/lib64/libfftw3.so.3 (0x0000003223c00000)
        libm.so.6 => /lib64/libm.so.6 (0x0000003224400000)
        libz.so.1 => /lib64/libz.so.1 (0x0000003224800000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x0000003224000000)
        librt.so.1 => /lib64/librt.so.1 (0x0000003224c00000)
        libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x000000322ac00000)
        libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x0000003229000000)
        libc.so.6 => /lib64/libc.so.6 (0x0000003223800000)
        /lib64/ld-linux-x86-64.so.2 (0x0000003223400000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00002ad51a3ee000)
[ac_siom_jsliu_1@ln0%tianhe2-C src]$

@PrometheusPi
Copy link
Member Author

Thank you for posting the ldd output. That is what I was afraid of, even after loading the module for fftw, it still links to the old path.
Could you (in the session where you did module load fftw/3.3.5) give the result of:

echo $LD_LIBRARY_PATH

@QJohn2017
Copy link

@PrometheusPi
Okay, I will ask the support of the tianhe2 about how to load the fftw module on the compute node, but this week is our spring festival, so I think the support people should be a holiday. So they would be respond to me after one week or more.

@QJohn2017
Copy link

QJohn2017 commented Feb 14, 2018

After I load the fftw module, and input the command echo $LD_LIBRARY_PATH, it returns:

[ac_siom_jsliu_1@ln0%tianhe2-C src]$ module load fftw/3.3.5
[ac_siom_jsliu_1@ln0%tianhe2-C src]$ echo $LD_LIBRARY_PATH
/WORK/app/fftw/3.3.5/lib:/usr/local/mpi3/lib:/WORK/app/hdf5/1.8.12/02/lib:/WORK/app/intel/Compiler/11.1/059/lib/intel64:/opt/intel/Compiler/11.1/059/ipp/em64t/sharedlib:/opt/intel/Compiler/11.1/059/mkl/lib/em64t:/opt/intel/composer_xe_2013_sp1.2.144/compiler/lib/intel64:/opt/intel/mic/coi/host-linux-release/lib:/opt/intel/mic/myo/lib:/opt/intel/composer_xe_2013_sp1.2.144/mpirt/lib/intel64:/opt/intel/composer_xe_2013_sp1.2.144/ipp/../compiler/lib/intel64:/opt/intel/composer_xe_2013_sp1.2.144/ipp/lib/intel64:/opt/intel/mic/coi/host-linux-release/lib:/opt/intel/mic/myo/lib:/opt/intel/composer_xe_2013_sp1.2.144/compiler/lib/intel64:/opt/intel/composer_xe_2013_sp1.2.144/mkl/lib/intel64:/opt/intel/composer_xe_2013_sp1.2.144/tbb/lib/intel64/gcc4.4
[ac_siom_jsliu_1@ln0%tianhe2-C src]$

@PrometheusPi
Copy link
Member Author

Thank you for uploading the LD_LIBRARY_PATH again. It looks correct. It starts with /WORK/app/fftw/3.3.5/lib:.

Okay - then let's come back to this discussion after the support replied.

If I have another idea how to avoid the modules, I will post it here.
Hope to hear from you (and the support) soon.
Thanks for sticking with me for so long - it must be really late for you.

@QJohn2017
Copy link

@PrometheusPi ,
Okay, thank you very much that you can take so much time to help me solve this problem, once the support reply to me, I will try it again and post the results here. Thank you again!

@QJohn2017
Copy link

QJohn2017 commented Feb 20, 2018

Hi, @PrometheusPi ,
I have asked the support of tianhe2 supercomputer center and they have responded to me that the submit script file should be:

#!/bin/bash

source /WORK/app/toolshs/cnmodule.sh

module load fftw/3.3.4-double

yhrun -N 2 -n 48 executable

According to their suggestions, I have changed the submit script file as above, but there are still some errors as follows:

executable: symbol lookup error: /opt/intel/composer_xe_2013_sp1.2.144/compiler/lib/intel64/libirng.so: undefined symbol: __intel_cpu_feature_indicator
executable: symbol lookup error: /opt/intel/composer_xe_2013_sp1.2.144/compiler/lib/intel64/libirng.so: undefined symbol: __intel_cpu_feature_indicator
executable: symbol lookup error: /opt/intel/composer_xe_2013_sp1.2.144/compiler/lib/intel64/libirng.so: undefined symbol: __intel_cpu_feature_indicator
executable: symbol lookup error: /opt/intel/composer_xe_2013_sp1.2.144/compiler/lib/intel64/libirng.so: undefined symbol: __intel_cpu_feature_indicator
executable: symbol lookup error: /opt/intel/composer_xe_2013_sp1.2.144/compiler/lib/intel64/libirng.so: undefined symbol: __intel_cpu_feature_indicator
executable: symbol lookup error: /opt/intel/composer_xe_2013_sp1.2.144/compiler/lib/intel64/libirng.so: undefined symbol: __intel_cpu_feature_indicator
executable: symbol lookup error: /opt/intel/composer_xe_2013_sp1.2.144/compiler/lib/intel64/libirng.so: undefined symbol: __intel_cpu_feature_indicator
executable: symbol lookup error: /opt/intel/composer_xe_2013_sp1.2.144/compiler/lib/intel64/libirng.so: undefined symbol: __intel_cpu_feature_indicator
executable: symbol lookup error: /opt/intel/composer_xe_2013_sp1.2.144/compiler/lib/intel64/libirng.so: undefined symbol: __intel_cpu_feature_indicator
executable: symbol lookup error: /opt/intel/composer_xe_2013_sp1.2.144/compiler/lib/intel64/libirng.so: undefined symbol: __intel_cpu_feature_indicator
executable: symbol lookup error: /opt/intel/composer_xe_2013_sp1.2.144/compiler/lib/intel64/libirng.so: undefined symbol: __intel_cpu_feature_indicator
executable: symbol lookup error: /opt/intel/composer_xe_2013_sp1.2.144/compiler/lib/intel64/libirng.so: undefined symbol: __intel_cpu_feature_indicator
executable: symbol lookup error: /opt/intel/composer_xe_2013_sp1.2.144/compiler/lib/intel64/libirng.so: undefined symbol: __intel_cpu_feature_indicator
executable: symbol lookup error: /opt/intel/composer_xe_2013_sp1.2.144/compiler/lib/intel64/libirng.so: undefined symbol: __intel_cpu_feature_indicator
executable: symbol lookup error: /opt/intel/composer_xe_2013_sp1.2.144/compiler/lib/intel64/libirng.so: undefined symbol: __intel_cpu_feature_indicator
executable: symbol lookup error: /opt/intel/composer_xe_2013_sp1.2.144/compiler/lib/intel64/libirng.so: undefined symbol: __intel_cpu_feature_indicator
executable: symbol lookup error: /opt/intel/composer_xe_2013_sp1.2.144/compiler/lib/intel64/libirng.so: undefined symbol: __intel_cpu_feature_indicator
executable: symbol lookup error: /opt/intel/composer_xe_2013_sp1.2.144/compiler/lib/intel64/libirng.so: undefined symbol: __intel_cpu_feature_indicator
executable: symbol lookup error: /opt/intel/composer_xe_2013_sp1.2.144/compiler/lib/intel64/libirng.so: undefined symbol: __intel_cpu_feature_indicator
executable: symbol lookup error: /opt/intel/composer_xe_2013_sp1.2.144/compiler/lib/intel64/libirng.so: undefined symbol: __intel_cpu_feature_indicator
executable: symbol lookup error: /opt/intel/composer_xe_2013_sp1.2.144/compiler/lib/intel64/libirng.so: undefined symbol: __intel_cpu_feature_indicator
executable: symbol lookup error: /opt/intel/composer_xe_2013_sp1.2.144/compiler/lib/intel64/libirng.so: undefined symbol: __intel_cpu_feature_indicator
executable: symbol lookup error: /opt/intel/composer_xe_2013_sp1.2.144/compiler/lib/intel64/libirng.so: undefined symbol: __intel_cpu_feature_indicator
executable: symbol lookup error: /opt/intel/composer_xe_2013_sp1.2.144/compiler/lib/intel64/libirng.so: undefined symbol: __intel_cpu_feature_indicator
executable: symbol lookup error: /opt/intel/composer_xe_2013_sp1.2.144/compiler/lib/intel64/libirng.so: undefined symbol: __intel_cpu_feature_indicator
executable: symbol lookup error: /opt/intel/composer_xe_2013_sp1.2.144/compiler/lib/intel64/libirng.so: undefined symbol: __intel_cpu_feature_indicator
executable: symbol lookup error: /opt/intel/composer_xe_2013_sp1.2.144/compiler/lib/intel64/libirng.so: undefined symbol: __intel_cpu_feature_indicator
executable: symbol lookup error: /opt/intel/composer_xe_2013_sp1.2.144/compiler/lib/intel64/libirng.so: undefined symbol: __intel_cpu_feature_indicator
executable: symbol lookup error: /opt/intel/composer_xe_2013_sp1.2.144/compiler/lib/intel64/libirng.so: undefined symbol: __intel_cpu_feature_indicator
executable: symbol lookup error: /opt/intel/composer_xe_2013_sp1.2.144/compiler/lib/intel64/libirng.so: undefined symbol: __intel_cpu_feature_indicator
executable: symbol lookup error: /opt/intel/composer_xe_2013_sp1.2.144/compiler/lib/intel64/libirng.so: undefined symbol: __intel_cpu_feature_indicator
executable: symbol lookup error: /opt/intel/composer_xe_2013_sp1.2.144/compiler/lib/intel64/libirng.so: undefined symbol: __intel_cpu_feature_indicator
executable: symbol lookup error: /opt/intel/composer_xe_2013_sp1.2.144/compiler/lib/intel64/libirng.so: undefined symbol: __intel_cpu_feature_indicator
executable: symbol lookup error: /opt/intel/composer_xe_2013_sp1.2.144/compiler/lib/intel64/libirng.so: undefined symbol: __intel_cpu_feature_indicator
executable: symbol lookup error: /opt/intel/composer_xe_2013_sp1.2.144/compiler/lib/intel64/libirng.so: undefined symbol: __intel_cpu_feature_indicator
executable: symbol lookup error: /opt/intel/composer_xe_2013_sp1.2.144/compiler/lib/intel64/libirng.so: undefined symbol: __intel_cpu_feature_indicator
executable: symbol lookup error: /opt/intel/composer_xe_2013_sp1.2.144/compiler/lib/intel64/libirng.so: undefined symbol: __intel_cpu_feature_indicator
executable: symbol lookup error: /opt/intel/composer_xe_2013_sp1.2.144/compiler/lib/intel64/libirng.so: undefined symbol: __intel_cpu_feature_indicator
executable: symbol lookup error: /opt/intel/composer_xe_2013_sp1.2.144/compiler/lib/intel64/libirng.so: undefined symbol: __intel_cpu_feature_indicator
executable: symbol lookup error: /opt/intel/composer_xe_2013_sp1.2.144/compiler/lib/intel64/libirng.so: undefined symbol: __intel_cpu_feature_indicator
executable: symbol lookup error: /opt/intel/composer_xe_2013_sp1.2.144/compiler/lib/intel64/libirng.so: undefined symbol: __intel_cpu_feature_indicator
executable: symbol lookup error: /opt/intel/composer_xe_2013_sp1.2.144/compiler/lib/intel64/libirng.so: undefined symbol: __intel_cpu_feature_indicator
executable: symbol lookup error: /opt/intel/composer_xe_2013_sp1.2.144/compiler/lib/intel64/libirng.so: undefined symbol: __intel_cpu_feature_indicator
executable: symbol lookup error: /opt/intel/composer_xe_2013_sp1.2.144/compiler/lib/intel64/libirng.so: undefined symbol: __intel_cpu_feature_indicator
executable: symbol lookup error: /opt/intel/composer_xe_2013_sp1.2.144/compiler/lib/intel64/libirng.so: undefined symbol: __intel_cpu_feature_indicator
executable: symbol lookup error: /opt/intel/composer_xe_2013_sp1.2.144/compiler/lib/intel64/libirng.so: undefined symbol: __intel_cpu_feature_indicator
executable: symbol lookup error: /opt/intel/composer_xe_2013_sp1.2.144/compiler/lib/intel64/libirng.so: undefined symbol: __intel_cpu_feature_indicator
executable: symbol lookup error: /opt/intel/composer_xe_2013_sp1.2.144/compiler/lib/intel64/libirng.so: undefined symbol: __intel_cpu_feature_indicator
yhrun: error: cn9816: tasks 24-47: Exited with exit code 127

@QJohn2017
Copy link

QJohn2017 commented Feb 22, 2018

@PrometheusPi ,
when I send the error massages above to the tianhe2 supporters, they suggested me to change the submit script file as

#!/bin/bash


source /WORK/app/toolshs/cnmodule.sh

module load fftw/3.3.4-double

source /HOME/intel/composer_xe_2013_sp1.2.144/bin/compilervars.sh intel64

yhrun -N 2 -n 48 executable

When I submit the job again, the output resluts are as follows:

start
start
start
start
start
start
start
start
start
start
start
start
start
start
start
start
start
start
start
start
start
start
start
start
start
start
start
start
start
start
start
start
start
start
start
start
start
start
start
start
start
start
start
start
start
start
start
start
Number of tasks= 48 My rank= 33
this is job    33 of  2001 jobs in the array (on (null) = rank: 33)
Number of tasks= 48 My rank= 38
this is job    38 of  2001 jobs in the array (on (null) = rank: 38)
Number of tasks= 48 My rank= 44
this is job    44 of  2001 jobs in the array (on (null) = rank: 44)
Number of tasks= 48 My rank= 3
this is job     3 of  2001 jobs in the array (on (null) = rank: 3)
Number of tasks= 48 My rank= 46
this is job    46 of  2001 jobs in the array (on (null) = rank: 46)
Number of tasks= 48 My rank= 4
this is job     4 of  2001 jobs in the array (on (null) = rank: 4)
Number of tasks= 48 My rank= 13
this is job    13 of  2001 jobs in the array (on (null) = rank: 13)
Number of tasks= 48 My rank= 17
this is job    17 of  2001 jobs in the array (on (null) = rank: 17)
Number of tasks= 48 My rank= 29
this is job    29 of  2001 jobs in the array (on (null) = rank: 29)
Number of tasks= 48 My rank= 19
this is job    19 of  2001 jobs in the array (on (null) = rank: 19)
Number of tasks= 48 My rank= 35
this is job    35 of  2001 jobs in the array (on (null) = rank: 35)
Number of tasks= 48 My rank= 1
this is job     1 of  2001 jobs in the array (on (null) = rank: 1)
Number of tasks= 48 My rank= 24
this is job    24 of  2001 jobs in the array (on (null) = rank: 24)
Number of tasks= 48 My rank= 27
this is job    27 of  2001 jobs in the array (on (null) = rank: 27)
Number of tasks= 48 My rank= 34
this is job    34 of  2001 jobs in the array (on (null) = rank: 34)
Number of tasks= 48 My rank= 20
this is job    20 of  2001 jobs in the array (on (null) = rank: 20)
Number of tasks= 48 My rank= 23
this is job    23 of  2001 jobs in the array (on (null) = rank: 23)
Number of tasks= 48 My rank= 25
this is job    25 of  2001 jobs in the array (on (null) = rank: 25)
Number of tasks= 48 My rank= 0
this is job     0 of  2001 jobs in the array (on (null) = rank: 0)
Number of tasks= 48 My rank= 28
this is job    28 of  2001 jobs in the array (on (null) = rank: 28)
Number of tasks= 48 My rank= 30
this is job    30 of  2001 jobs in the array (on (null) = rank: 30)
Number of tasks= 48 My rank= 37
this is job    37 of  2001 jobs in the array (on (null) = rank: 37)
Number of tasks= 48 My rank= 2
this is job     2 of  2001 jobs in the array (on (null) = rank: 2)
Number of tasks= 48 My rank= 5
this is job     5 of  2001 jobs in the array (on (null) = rank: 5)
Number of tasks= 48 My rank= 11
this is job    11 of  2001 jobs in the array (on (null) = rank: 11)
Number of tasks= 48 My rank= 26
this is job    26 of  2001 jobs in the array (on (null) = rank: 26)
Number of tasks= 48 My rank= 7
this is job     7 of  2001 jobs in the array (on (null) = rank: 7)
Number of tasks= 48 My rank= 6
this is job     6 of  2001 jobs in the array (on (null) = rank: 6)
Number of tasks= 48 My rank= 43
this is job    43 of  2001 jobs in the array (on (null) = rank: 43)
Number of tasks= 48 My rank= 8
this is job     8 of  2001 jobs in the array (on (null) = rank: 8)
Number of tasks= 48 My rank= 31
this is job    31 of  2001 jobs in the array (on (null) = rank: 31)
Number of tasks= 48 My rank= 16
this is job    16 of  2001 jobs in the array (on (null) = rank: 16)
Number of tasks= 48 My rank= 32
this is job    32 of  2001 jobs in the array (on (null) = rank: 32)
Number of tasks= 48 My rank= 15
this is job    15 of  2001 jobs in the array (on (null) = rank: 15)
Number of tasks= 48 My rank= 41
this is job    41 of  2001 jobs in the array (on (null) = rank: 41)
Number of tasks= 48 My rank= 9
this is job     9 of  2001 jobs in the array (on (null) = rank: 9)
Number of tasks= 48 My rank= 14
this is job    14 of  2001 jobs in the array (on (null) = rank: 14)
Number of tasks= 48 My rank= 39
this is job    39 of  2001 jobs in the array (on (null) = rank: 39)
Number of tasks= 48 My rank= 42
this is job    42 of  2001 jobs in the array (on (null) = rank: 42)
Number of tasks= 48 My rank= 10
this is job    10 of  2001 jobs in the array (on (null) = rank: 10)
Number of tasks= 48 My rank= 40
this is job    40 of  2001 jobs in the array (on (null) = rank: 40)
Number of tasks= 48 My rank= 18
this is job    18 of  2001 jobs in the array (on (null) = rank: 18)
Number of tasks= 48 My rank= 45
this is job    45 of  2001 jobs in the array (on (null) = rank: 45)
Number of tasks= 48 My rank= 21
this is job    21 of  2001 jobs in the array (on (null) = rank: 21)
Number of tasks= 48 My rank= 36
this is job    36 of  2001 jobs in the array (on (null) = rank: 36)
Number of tasks= 48 My rank= 22
this is job    22 of  2001 jobs in the array (on (null) = rank: 22)
Number of tasks= 48 My rank= 12
this is job    12 of  2001 jobs in the array (on (null) = rank: 12)
Number of tasks= 48 My rank= 47
this is job    47 of  2001 jobs in the array (on (null) = rank: 47)

It seems right, but I don't know how to do next step. Please respond to me, thank you!

@PrometheusPi
Copy link
Member Author

PrometheusPi commented Feb 23, 2018

Hi @QJohn2017,

thank you for the update. Please excuse my late reply, it was a busy week at work.

Glad to hear that it seems to work!

Interesting that sourcing the file

source /HOME/intel/composer_xe_2013_sp1.2.144/bin/compilervars.sh intel64

helped executing the script.

Did they tell you what configuration this file changed?
Could you post the content of the file here?

I assume that by loading a module for the compiler icc and the openMPI, both during compilation and during execution would lead to a similar result.

Looks looks like the initial 48 trajectories are going to be processed.
Did you specify existing trajectories in settings.hpp?

Was there an error output (commonly a file name containing stderr)?
It would be great if you could run ldd executable before running yhrun -N 2 -n 48 executable so we can check wether the code is linked against the correct libraries. Right now, it appears to find all libraries.

The (NULL) you see comes from a different setting of environment variables. In this piece of code, the code extracts MYHOSTNAME from the system. I assume it is just called HOSTNAME in the way you run clara2.

In between these lines of code, I redirected the output into files named like this.
Thus, this redirection might not work on your system. It is very UNIX/LINUX specific.
Were files named my_output.txt-... and/or my_error.txt-... created in the directory you executed clara2?

Were the results with this name my_spectrum_trace*.dat produces?

My first guess would be that clara2 did not find the input trajectories?
What did you specify there?

Shall we go through all the parameters in parameters.hpp?

@PrometheusPi
Copy link
Member Author

PrometheusPi commented Feb 23, 2018

@TheresaBruemmer reported a similar issue with missing link to libiomp5.so on Maxwell/DESY cluster via email. She forwarded me an email by @belfhi who seems to have a solution for Maxwell. (Since @Belfi used the masterbranch, he in parallel fixed the bug we fixed with pull request #90 - currently only available in dev).

@Belfi Are you willing to submit a pull request with your solution for Maxwell or may I cherry-pick your solution to make it available in the main repo?

@PrometheusPi
Copy link
Member Author

PrometheusPi commented Feb 24, 2018

@TheresaBruemmer Perhaps sourcing the compiler variables as @QJohn2017 did will help you when using icc.
I thought that loading the icc as module should be enough, but I do not know how Maxwell is configured. Perhaps you need the set the compiler variables by hand like:

source /opt/intel/bin/compilervars.sh intel64

or you try to locate libiomp5.so via

find / -iname "libiomp5.so"

and ad the path to the LD_LIBRARY_PATHby hand.

EDIT:
a link on finding libiomp5.so

@QJohn2017
Copy link

QJohn2017 commented Feb 24, 2018

Hi, @PrometheusPi
When I input the command vi /HOME/intel/composer_xe_2013_sp1.2.144/bin/compilervars.sh intel64, It returns as follows:

#!/bin/sh
#
# Copyright  (C) 1985-2014 Intel Corporation. All rights reserved.
#
# The information and source code contained herein is the exclusive property
# of Intel Corporation and may not be disclosed, examined, or reproduced in
# whole or in part without explicit written authorization from the Company.
#

PROD_DIR=/HOME/intel/composer_xe_2013_sp1.2.144

if [ "$1" != "ia32" -a "$1" != "intel64" ]; then
  echo "ERROR: Unknown switch '$1'. Accepted values: ia32, intel64"
  return 1;
fi


if [ -e $PROD_DIR/pkg_bin/idbvars.sh ]; then
   . $PROD_DIR/pkg_bin/idbvars.sh $1
fi
if [ -e $PROD_DIR/pkg_bin/debuggervars.sh ]; then
   . $PROD_DIR/pkg_bin/debuggervars.sh $1
fi
if [ -e $PROD_DIR/tbb/bin/tbbvars.sh ]; then
   . $PROD_DIR/tbb/bin/tbbvars.sh $1
fi
if [ -e $PROD_DIR/mkl/bin/mklvars.sh ]; then
   . $PROD_DIR/mkl/bin/mklvars.sh $1
fi
if [ -e $PROD_DIR/ipp/bin/ippvars.sh ]; then
   . $PROD_DIR/ipp/bin/ippvars.sh $1
fi
if [ -e $PROD_DIR/pkg_bin/compilervars_arch.sh ]; then
   . $PROD_DIR/pkg_bin/compilervars_arch.sh $1
fi
~
~
~
~
~
~
~
~
~

@belfhi
Copy link

belfhi commented Feb 24, 2018

Hello,
I fixed the bug with the input and output and changed the makefile to use the intel compilers and to include my own compiled fftw3 library (I hadn't realized it was already installed).
Also, the code works fine for me until it tries to open the files which I don't have. But go ahead and cherry pick changes you want to apply.
@QJohn2017 you need to source the file not edit it with vi

@QJohn2017
Copy link

@PrometheusPi ,
I don't specify the location of the trajectory files and I don't know the format and contens of the trajectory files? could you give me an example of the trajectory files? There are 2000 files named my_output.txt and my_error.txt

@QJohn2017
Copy link

@PrometheusPi ,There is no output file named *.dat produced

@QJohn2017
Copy link

@PrometheusPi , Can you help me to go through the main parameters in settings.hpp?

@QJohn2017
Copy link

QJohn2017 commented Feb 24, 2018

@PrometheusPi ,
When I input the command ldd executable before yhrun -N 2 -n 48 executable, it returns

	linux-vdso.so.1 =>  (0x00007fffac1ff000)
	libfftw3.so.3 => /WORK/app/fftw/3.3.4-double/lib/libfftw3.so.3 (0x00002b41456ed000)
	libm.so.6 => /lib64/libm.so.6 (0x0000003921600000)
	libz.so.1 => /lib64/libz.so.1 (0x0000003922600000)
	libpthread.so.0 => /lib64/libpthread.so.0 (0x0000003922200000)
	librt.so.1 => /lib64/librt.so.1 (0x0000003922a00000)
	libiomp5.so => /HOME/intel/composer_xe_2013_sp1.2.144/compiler/lib/intel64/libiomp5.so (0x00002b4145a2f000)
	libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x0000003928200000)
	libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x0000003924e00000)
	libc.so.6 => /lib64/libc.so.6 (0x0000003921a00000)
	/lib64/ld-linux-x86-64.so.2 (0x0000003921200000)
	libdl.so.2 => /lib64/libdl.so.2 (0x0000003921e00000)
	libimf.so => /HOME/intel/composer_xe_2013_sp1.2.144/compiler/lib/intel64/libimf.so (0x00002b4145d49000)
	libsvml.so => /HOME/intel/composer_xe_2013_sp1.2.144/compiler/lib/intel64/libsvml.so (0x00002b414620c000)
	libirng.so => /HOME/intel/composer_xe_2013_sp1.2.144/compiler/lib/intel64/libirng.so (0x00002b4146e08000)
	libintlc.so.5 => /HOME/intel/composer_xe_2013_sp1.2.144/compiler/lib/intel64/libintlc.so.5 (0x00002b414700f000)
start
start
start
start
start
start
start
start
start
start
start
start
start
start
start
start
start
start
start
start
start
start
start
start
start
start
start
start
start
start
start
start
start
start
start
start
start
start
start
start
start
start
start
start
start
start
start
start
Number of tasks= 48 My rank= 2
this is job     2 of  2001 jobs in the array (on (null) = rank: 2)
Number of tasks= 48 My rank= 1
this is job     1 of  2001 jobs in the array (on (null) = rank: 1)
Number of tasks= 48 My rank= 24
this is job    24 of  2001 jobs in the array (on (null) = rank: 24)
Number of tasks= 48 My rank= 29
this is job    29 of  2001 jobs in the array (on (null) = rank: 29)
Number of tasks= 48 My rank= 3
this is job     3 of  2001 jobs in the array (on (null) = rank: 3)
Number of tasks= 48 My rank= 30
this is job    30 of  2001 jobs in the array (on (null) = rank: 30)
Number of tasks= 48 My rank= 0
this is job     0 of  2001 jobs in the array (on (null) = rank: 0)
Number of tasks= 48 My rank= 10
this is job    10 of  2001 jobs in the array (on (null) = rank: 10)
Number of tasks= 48 My rank= 27
this is job    27 of  2001 jobs in the array (on (null) = rank: 27)
Number of tasks= 48 My rank= 11
this is job    11 of  2001 jobs in the array (on (null) = rank: 11)
Number of tasks= 48 My rank= 12
this is job    12 of  2001 jobs in the array (on (null) = rank: 12)
Number of tasks= 48 My rank= 14
this is job    14 of  2001 jobs in the array (on (null) = rank: 14)
Number of tasks= 48 My rank= 20
this is job    20 of  2001 jobs in the array (on (null) = rank: 20)
Number of tasks= 48 My rank= 36
this is job    36 of  2001 jobs in the array (on (null) = rank: 36)
Number of tasks= 48 My rank= 37
this is job    37 of  2001 jobs in the array (on (null) = rank: 37)
Number of tasks= 48 My rank= 25
this is job    25 of  2001 jobs in the array (on (null) = rank: 25)
Number of tasks= 48 My rank= 44
this is job    44 of  2001 jobs in the array (on (null) = rank: 44)
Number of tasks= 48 My rank= 47
this is job    47 of  2001 jobs in the array (on (null) = rank: 47)
Number of tasks= 48 My rank= 26
this is job    26 of  2001 jobs in the array (on (null) = rank: 26)
Number of tasks= 48 My rank= 28
this is job    28 of  2001 jobs in the array (on (null) = rank: 28)
Number of tasks= 48 My rank= 4
this is job     4 of  2001 jobs in the array (on (null) = rank: 4)
Number of tasks= 48 My rank= 6
this is job     6 of  2001 jobs in the array (on (null) = rank: 6)
Number of tasks= 48 My rank= 8
this is job     8 of  2001 jobs in the array (on (null) = rank: 8)
Number of tasks= 48 My rank= 9
this is job     9 of  2001 jobs in the array (on (null) = rank: 9)
Number of tasks= 48 My rank= 16
this is job    16 of  2001 jobs in the array (on (null) = rank: 16)
Number of tasks= 48 My rank= 5
this is job     5 of  2001 jobs in the array (on (null) = rank: 5)
Number of tasks= 48 My rank= 18
this is job    18 of  2001 jobs in the array (on (null) = rank: 18)
Number of tasks= 48 My rank= 7
this is job     7 of  2001 jobs in the array (on (null) = rank: 7)
Number of tasks= 48 My rank= 15
this is job    15 of  2001 jobs in the array (on (null) = rank: 15)
Number of tasks= 48 My rank= 19
this is job    19 of  2001 jobs in the array (on (null) = rank: 19)
Number of tasks= 48 My rank= 13
this is job    13 of  2001 jobs in the array (on (null) = rank: 13)
Number of tasks= 48 My rank= 21
this is job    21 of  2001 jobs in the array (on (null) = rank: 21)
Number of tasks= 48 My rank= 17
this is job    17 of  2001 jobs in the array (on (null) = rank: 17)
Number of tasks= 48 My rank= 38
this is job    38 of  2001 jobs in the array (on (null) = rank: 38)
Number of tasks= 48 My rank= 22
this is job    22 of  2001 jobs in the array (on (null) = rank: 22)
Number of tasks= 48 My rank= 23
this is job    23 of  2001 jobs in the array (on (null) = rank: 23)
Number of tasks= 48 My rank= 35
this is job    35 of  2001 jobs in the array (on (null) = rank: 35)
Number of tasks= 48 My rank= 34
this is job    34 of  2001 jobs in the array (on (null) = rank: 34)
Number of tasks= 48 My rank= 39
this is job    39 of  2001 jobs in the array (on (null) = rank: 39)
Number of tasks= 48 My rank= 32
this is job    32 of  2001 jobs in the array (on (null) = rank: 32)
Number of tasks= 48 My rank= 33
this is job    33 of  2001 jobs in the array (on (null) = rank: 33)
Number of tasks= 48 My rank= 40
this is job    40 of  2001 jobs in the array (on (null) = rank: 40)
Number of tasks= 48 My rank= 41
this is job    41 of  2001 jobs in the array (on (null) = rank: 41)
Number of tasks= 48 My rank= 31
this is job    31 of  2001 jobs in the array (on (null) = rank: 31)
Number of tasks= 48 My rank= 43
this is job    43 of  2001 jobs in the array (on (null) = rank: 43)
Number of tasks= 48 My rank= 42
this is job    42 of  2001 jobs in the array (on (null) = rank: 42)
Number of tasks= 48 My rank= 45
this is job    45 of  2001 jobs in the array (on (null) = rank: 45)
Number of tasks= 48 My rank= 46
this is job    46 of  2001 jobs in the array (on (null) = rank: 46)

@PrometheusPi
Copy link
Member Author

@QJohn2017 Okay thank you. That look good.

@PrometheusPi
Copy link
Member Author

@QJohn2017
I assume that my_output.txt will contain something like

check: filename: ...

but no:

load file: ...

This means that no trajectory was found.

In the my_error.txt you probably find something like

error occured in rank ...

if the stream redirect does not work as on our system.

Thus we need to adjust the parameters.

@TheresaBruemmer
Copy link

@PrometheusPi
It seems I set a wrong path so that no output files were created. I fixed that now, so my_spectrum_trace****.txt files are created. However, I am still missing the my_spectrum_all***.txt

@PrometheusPi
Copy link
Member Author

@QJohn2017 To keep this issue clean and only focused on issues regarding the technical setup on tianhe2 I moved the discussion on how to setup clara2 to issue #96. Feel free to ask further questions there. If you still encounter issues with the compilation or execution on tianhe2 please comment here in this issue. I will not close this until you have a working version on clara2 on tinahe2.

@PrometheusPi
Copy link
Member Author

@belfhi Thank you for allowing me to cherry-pick your solution. I will try to incorporate your work as soon as possible.

@PrometheusPi
Copy link
Member Author

@TheresaBruemmer Feel free to switch to #96 if you encounter issues with the setup of clara2.

@PrometheusPi
Copy link
Member Author

@QJohn2017 Great to hear in #96 that your simulation ran 🎉
Does this mean we can close this issue?

If so , would you be so kind to provide a example submit file, so that I could add a module setup similar to the hypnos and Maxwell cluster for tinahe2?
(Since you installed fftw yourself, a rough description, how other users could do that would also be very helpful.)

Or you could provide this as your first pull request to clara2 yourself. Decide yourself what you would prefer. Thank you in advance.

Your feedback so far was already very helpful - many thanks.

@QJohn2017
Copy link

QJohn2017 commented Mar 5, 2018

@PrometheusPi ,
The contents of the submit file test.sh in my tianhe2 are as follows:

#!/bin/bash
source /WORK/app/toolshs/cnmodule.sh
module load fftw/3.3.4-double
source /HOME/intel/composer_xe_2013_sp1.2.144/bin/compilervars.sh intel64
ldd executable
yhrun -N 2 -n 48 executable

And the command to submit the job is yhbatch -N 2 ./test.sh

@PrometheusPi
Copy link
Member Author

@QJohn2017 Thank you! I will provide a default source file as soon as possible.

Do you know what prefix a submit script uses in order to transfer the arguments as -N 2?
Is it #sbatch -N 2 or is it #YHBATCH -N 2?

@QJohn2017
Copy link

@PrometheusPi , Sorry, I don't know what the prefix of submitting script, I guess it maybe #YHBATCH -N 2

@PrometheusPi
Copy link
Member Author

@QJohn2017 Then I would suggest that I write a default submit script based on #YHBATCH. It you use it and it does not work, you can tell me that and I will fix the script.

@QJohn2017
Copy link

@PrometheusPi , Okay, thank you !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants