Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmake build multiple targets #103

Merged
merged 8 commits into from
Jul 8, 2024
Merged

Conversation

hczhai
Copy link
Contributor

@hczhai hczhai commented Jul 7, 2024

@chillenb

This PR handles the multiple target build, based on your doublebuild branch. This is now a necessary feature because another user asked me about the following case. Their code is written in C++, depends on block2 libblock2.so, and also has the interface to dmrgscf (based onpyscf) which uses the Python extension of block2. So they need two targets of block2 for their github actions.

There are some adjustments (856b42e) to address a few new problems due to this change:

  1. While compiling multiple targets require less time now, we also need to ensure compiling a single target does not require more time, compared with the previous scheme. Since the Python library target requires significantly longer time to compile compared to the other parts, it should start at the beginning (considering the parallel compilation case) rather than after the "instantiation/b2_core" part finishes. To ensure this there is a pylib_obj intermediate target.
  2. My test shows that precompiling headers does not necessarily save time (depending on the disk IO speed, parallelization, and cmake options used). But on github actions it does save significant amount time for building wheels. So the default is set to ON. TARGET_PRECOMPILE_HEADERS(b2_exe REUSE_FROM b2_core) is avoided, because it will prevent the parallelization introduced in 1, and re-precompiling the headers multiple times in parallel will not cost any extra time.
  3. There is a new option BUILD_CORE. If BUILD_CORE=OFF, it will recover the previous scheme, namely, every target will build its own instantiation part (for backward compatibility). If BUILD_CORE=ON and all other targets are off, then this can be useful for syntax checking after modifying the C++ code. Default is ON, for saving time when building multiple targets.
  4. Some repeated lines are optimized, so the code of CMakeLists.txt is 27 lines shorter after introducing 1, 2, and 3 (for the temperance in adding code complexity).

Thanks for your work for making this possible and if you have any additional suggestions please let me know.

@chillenb
Copy link
Contributor

chillenb commented Jul 8, 2024

I am happy that my experiment won't go to waste! It's great that you were able to adapt it to the needs of the Forte developer(s) and make the CMake more compact. My only suggestion is to consider renaming BUILD_CORE to REUSE_OBJECTS (or something else) so that new users do not accidentally assume it affects the functionality of block2.

While compiling multiple targets require less time now, we also need to ensure compiling a single target does not require more time, compared with the previous scheme.

I recently observed the following behavior: Ninja seems to build multiple targets concurrently, while Make cannot do this (e.g. make will wait for BUILD_CORE to finish, and only then start BUILD_EXE, etc, while Ninja will compile all sources at the same time). As a result, with highly parallel builds and BUILD_CORE=ON, Make takes longer than Ninja. Does this happen for you as well?

note: Ninja doesn't seem to tell gcc how many jobs it should use for LTO.

My test shows that precompiling headers does not necessarily save time (depending on the disk IO speed, parallelization, and cmake options used).

It does seem like the speedup (or lack thereof) depends on many factors. Probably the compiler version also has an effect, but I've only tried gcc 12. Out of curiosity, did you do these tests on a very powerful machine?

@hczhai
Copy link
Contributor Author

hczhai commented Jul 8, 2024

My only suggestion is to consider renaming BUILD_CORE to REUSE_OBJECTS

OK.

Ninja seems to build multiple targets concurrently, while Make cannot do this (e.g. make will wait for BUILD_CORE to finish, and only then start BUILD_EXE, etc, while Ninja will compile all sources at the same time).

So according to your description, Ninja is smarter than Make for building multiple targets concurrently, but Make is smarter than Ninja for ensuring concurrency in LTO. Personally I use Make all the time, so I do not have much to say for Ninja.

As a result, with highly parallel builds and BUILD_CORE=ON, Make takes longer than Ninja.

So let's say we just use Make. Previously, b2_py (old name b2_pylib) depends on b2_core, so Make will touch b2_py after b2_core finishes (which is slow). This PR solved this problem by introducing an intermediate object called b2_py_obj. Now b2_py depends on b2_py_obj and b2_core, where b2_py_obj does not depend on b2_core. So Make now build b2_core and b2_py_obj concurrently. See code. Then I guess we have no problems using Make.

It does seem like the speedup (or lack thereof) depends on many factors. Probably the compiler version also has an effect, but I've only tried gcc 12. Out of curiosity, did you do these tests on a very powerful machine?

This is on an old machine that I normally use for code developement. With gcc 9.2, GNU Make 3.82, cmake 3.26, block2 up to this commit c4004f9, the options cmake .. -DUSE_MKL=ON -DBUILD_LIB=ON -DLARGE_BOND=ON -DUSE_SU2SZ=OFF -DUSE_SANY=ON -DUSE_PCH=?, and 28 threads:

When -DUSE_PCH=ON: real 3m15.843s; user 26m33.753s; sys 6m15.471s.
When -DUSE_PCH=OFF: real 2m41.251s; user 31m13.084s; sys 1m20.528s.

I guess it should be mainly the disk IO problem because the IO speed of this machine is kind of slow. The precompiled header file is 342 MB, much larger than the source files. So saving and loading precompiled headers can cost additional IO time.

For github actions (with gcc 9.4), there is significant speedup:

With -DUSE_PCH=ON: 16m 0s (see log).
With -DUSE_PCH=OFF: 19m 41s (see log).

@hczhai hczhai merged commit 52db3db into block-hczhai:master Jul 8, 2024
33 checks passed
@chillenb
Copy link
Contributor

chillenb commented Jul 8, 2024

So let's say we just use Make. Previously, b2_py (old name b2_pylib) depends on b2_core, so Make will touch b2_py after b2_core finishes (which is slow). This PR solved this problem by introducing an intermediate object called b2_py_obj. Now b2_py depends on b2_py_obj and b2_core, where b2_py_obj does not depend on b2_core. So Make now build b2_core and b2_py_obj concurrently. See code. Then I guess we have no problems using Make.

I see! That is a nice way to do it. It looks like main.cpp also got the same treatment. Then there should be no difference between make and ninja.
Though even if one of them were a bit faster than the other, I would not mind at all. The nice logic of your CMake is enough for me :)

I guess it should be mainly the disk IO problem because the IO speed of this machine is kind of slow. The precompiled header file is 342 MB, much larger than the source files. So saving and loading precompiled headers can cost additional IO time.

Thanks for sharing those test details. I'm surprised that the precompiled header file is so big---Github Actions must have pretty fast disk IO!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants