diff --git a/README.md b/README.md index 9aecd57e67..f4b3e99fce 100644 --- a/README.md +++ b/README.md @@ -18,8 +18,8 @@ MIOpen supports two programming models - * ROCm cmake modules can be installed from [here](https://github.com/RadeonOpenCompute/rocm-cmake) * [Half](http://half.sourceforge.net/) - IEEE 754-based half-precision floating point library * [Boost](http://www.boost.org/) at least version 1.58 - * MIOpen uses `boost-system` and `boost-filesystem` packages to enable persistent [kernel cache](https://github.com/ROCmSoftwarePlatform/MIOpen/blob/master/doc/src/cache.md) -* [rocBlas](https://github.com/ROCmSoftwarePlatform/rocBLAS) Minimum version 2.0.0 (recommended version 2.2.0) + * MIOpen uses `boost-system` and `boost-filesystem` packages to enable persistent [kernel cache](https://rocmsoftwareplatform.github.io/MIOpen/doc/html/cache.html) +* [rocBlas](https://github.com/ROCmSoftwarePlatform/rocBLAS) Minimum version branch [master-rocm-2.6](https://github.com/ROCmSoftwarePlatform/rocBLAS/tree/master-rocm-2.6) ## Installing MIOpen with pre-built packages @@ -44,12 +44,12 @@ cmake -P install_deps.cmake --prefix /some/local/dir ``` This prefix can used to specify the dependency path during the configuration phase using the `CMAKE_PREFIX_PATH`. -MIOpen's HIP backend uses [rocBlas](https://github.com/ROCmSoftwarePlatform/rocBLAS) by default. Users can intall rocBlas minimum release by using `apt-get install rocblas`. To disable using rocBlas set the configuration flag `-DMIOPEN_USE_ROCBLAS=Off`. rocBlas is *not* available for the OpenCL backend. +MIOpen's HIP backend uses [rocBlas](https://github.com/ROCmSoftwarePlatform/rocBLAS) by default. Users can install rocBlas minimum release by using `apt-get install rocblas`. To disable using rocBlas set the configuration flag `-DMIOPEN_USE_ROCBLAS=Off`. rocBlas is *not* available for the OpenCL backend. ## Installing minimum dependencies in ROCm environment -Users who are working in a fully installed and up to date ROCm environment may not wish to additionally install rocm-cmake, clang-ocl, MIOpenGEMM, or rocBLAS. This can be done by simpily inserting the command `--minimum` into the cmake command as shown below: +Users who are working in a fully installed and up to date ROCm environment may not wish to additionally install rocm-cmake, clang-ocl, MIOpenGEMM, or rocBLAS. This can be done by simply inserting the command `--minimum` into the cmake command as shown below: ``` cmake -P install_deps.cmake --minimum --prefix /some/local/dir @@ -91,6 +91,7 @@ cmake -DMIOPEN_BACKEND=OpenCL -DCMAKE_PREFIX_PATH=/some/local/dir .. Set the C++ compiler to `hcc`. ``` +export CXX= cmake -DMIOPEN_BACKEND=HIP -DCMAKE_PREFIX_PATH=";;" .. ``` An example cmake step can be: @@ -118,7 +119,7 @@ Database paths can be explicitly customized by means of `MIOPEN_SYSTEM_DB_PATH` If the user installs a new version of MIOpen, it is recommended that the user move, or delete their old user database file. The user can find the file with the suffix `*.updb.txt` in the user perf db path. -More information about the performance database can be found [here](https://github.com/ROCmSoftwarePlatform/MIOpen/blob/master/doc/src/perfdatabase.md). +More information about the performance database can be found [here](https://rocmsoftwareplatform.github.io/MIOpen/doc/html/perfdatabase.html). ### Persistent Program Cache @@ -127,7 +128,7 @@ MIOpen by default caches the device programs in the location `~/.cache/miopen/`. Users can also disable the cache during runtime using the environmental variable set as `MIOPEN_DISABLE_CACHE=1`. -If the compiler changes, or the user modifies the kernels then the cache must be deleted for the MIOpen version in use; e.g., `rm -rf ~/.cache/miopen/`. More information about the cache can be found [here](https://github.com/ROCmSoftwarePlatform/MIOpen/blob/master/doc/src/cache.md). +If the compiler changes, or the user modifies the kernels then the cache must be deleted for the MIOpen version in use; e.g., `rm -rf ~/.cache/miopen/`. More information about the cache can be found [here](https://rocmsoftwareplatform.github.io/MIOpen/doc/html/cache.html). ### Changing the cmake configuration @@ -158,7 +159,7 @@ The driver can be built using the `MIOpenDriver` target: ` cmake --build . --config Release --target MIOpenDriver ` **OR** ` make MIOpenDriver ` -Documentation on how to run the driver is [here](https://github.com/ROCmSoftwarePlatform/MIOpen/blob/master/driver/README.md). +Documentation on how to run the driver is [here](https://rocmsoftwareplatform.github.io/MIOpen/doc/html/driver.html). ## Running the tests @@ -187,7 +188,10 @@ HTML and PDFs are generated using [Sphinx](http://www.sphinx-doc.org/en/stable/i Requirements for both Sphinx, Breathe, and the ReadTheDocs theme can be filled for these in the MIOpen/doc folder: -`pip install -r ./requirements.txt` +``` +pip install -r ./requirements.txt +``` + Depending on your setup `sudo` may be required for the pip install. @@ -214,7 +218,7 @@ sudo apt-get install libboost-system-dev sudo apt-get install libboost-filesystem-dev ``` -*Note:* MIOpen by default will attempt to build with Boost staticially linked libraries. If it is needed, the user can build with dynamically linked Boost libraries by using this flag during the configruation stage: +*Note:* MIOpen by default will attempt to build with Boost statically linked libraries. If it is needed, the user can build with dynamically linked Boost libraries by using this flag during the configruation stage: ``` -DBoost_USE_STATIC_LIBS=Off ``` @@ -226,9 +230,13 @@ The `half` header needs to be installed from [here](http://half.sourceforge.net/ ## Using docker The easiest way is to use docker. You can build the top-level docker file: +``` +docker build -t miopen . +``` - docker build -t miopen . +Then to enter the development environment use `docker run`: +``` +docker run --device='/dev/kfd' --device='/dev/dri' -v=`pwd`:/data -w /data --group-add video -it miopen +``` -Then to enter the developement environment use `docker run`: - docker run --device='/dev/kfd' --device='/dev/dri' -v=`pwd`:/data -w /data --group-add video -it miopen diff --git a/doc/src/DebugAndLogging.md b/doc/src/DebugAndLogging.md index 9e119241ad..79e30fb48e 100644 --- a/doc/src/DebugAndLogging.md +++ b/doc/src/DebugAndLogging.md @@ -28,7 +28,8 @@ All logging messages output to standard error stream (`stderr`). The following e > **_NOTE:_ When asking for technical support, please include the console log obtained with the following settings:** > ``` > export MIOPEN_ENABLE_LOGGING=1 -> export MIOPEN_LOG_LEVEL=5 +> export MIOPEN_ENABLE_LOGGING_CMD=1 +> export MIOPEN_LOG_LEVEL=6 > ``` * `MIOPEN_ENABLE_LOGGING_MPMT` - When enabled, each log line is prefixed with information which allows the user to identify records printed from different processes and/or threads. Useful for debugging multi-process/multi-threaded apps. @@ -43,12 +44,13 @@ The following list of environment variables allow for enabling/disabling various > 0, no, false, disable, disabled - to disable kernels/algorithm > ``` -If a variable is not set, then MIOpen behaves as if it is set to `enabled`, unless otherwise specified. So all kinds of kernels/algorithms are enabled by default and variables can be used for disabling them. +If a variable is not set, then MIOpen behaves as if it is set to `enabled`, unless otherwise specified. So all kinds of kernels/algorithms are enabled by default and the below variables can be used for disabling them. The exception to this rule is `MIOPEN_DEBUG_CONV_IMPLICIT_GEMM` which is disabled by default: * `MIOPEN_DEBUG_CONV_FFT` – FFT convolution algorithm. * `MIOPEN_DEBUG_CONV_DIRECT` – Direct convolution algorithm. * `MIOPEN_DEBUG_CONV_GEMM` - GEMM convolution algorithm. These are implemented on top of miopengemm or rocBlas. * `MIOPEN_DEBUG_GCN_ASM_KERNELS` – Kernels written in assembly language. So far, the most of the assembly kernels are implementing the Direct convolution algorithm. +* `MIOPEN_DEBUG_CONV_IMPLICIT_GEMM` – FP32 implicit GEMM convolution algorithm, disabled by default due to compatibility issue with older compiler. Set to 1 to turn on implicit GEMM algorithm. * `MIOPEN_DEBUG_AMD_ROCM_PRECOMPILED_BINARIES` - Binary kernels. Right now all the binary kernels are Winograd ones, however, not all Winograds are binaries. To disable all Winograd algorithms, the following two vars can be used: * `MIOPEN_DEBUG_AMD_WINOGRAD_3X3` - FP32 Winograd Fwd/Bwd, filter size fixed to 3x3. * `MIOPEN_DEBUG_AMD_WINOGRAD_RXS` - FP32 and FP16 Winograd Fwd/Bwd, variable filter size. diff --git a/doc/src/Getting_Started_FusionAPI.md b/doc/src/Getting_Started_FusionAPI.md index 8ecedcbcb5..7bc4ad5e24 100644 --- a/doc/src/Getting_Started_FusionAPI.md +++ b/doc/src/Getting_Started_FusionAPI.md @@ -190,231 +190,11 @@ Once the fusion plan object is destroyed, all the operations created are destroy The tables below outlines the supported fusions for fp32 and fp16 as well as any applicable constraints. **(C = convolution, B = bias, N = batch normalization, A = activation)** -### Convolution based FP32 Fusion for Inference - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Single - Precision Floating Point
Combination
Conv Algo
Stride
Filter Dims
N Mode*
Activations
Other Constraints
CBNADirect1 and 23x3, 5x5, 7x7, 9x9, 11x11AllAllstride and padding must be either - 1 or 2
CBADirect1x1Allstride/ padding not supported
Winograd11x1, 2x2N/ARelu, Leaky Reluc >= 18
13x3Relu, Leaky Reluc >= 18 and c is even
14x4, 5x5, 6x6Relu, Leaky Relu4 x c >= 18
17x7, 8x8, 9x9Relu, Leaky Relu12 x c >= 18
110x10, 11x11, 12x12Relu, Leaky Relu16 x c >= 18
1larger filter sizesRelu, Leaky Relunone
21x1Relu, Leaky Relu2 x c >= 18
22x2, 3x3, 4x4, 5x5, 6x6Relu, Leaky Relu4 x c >= 18
27x7Relu, Leaky Relu12 x c >= 18
28x8, 9x9, 10x10, 11x11, 12x12Relu, Leaky Relu16 x c >= 18
2larger filter sizesRelu, Leaky Relunone
NA--AllAllPadding not supported
-*N mode is either spatial, or per activation. For CBA other asymmetric kernels are supported as well, but are not enumerated here for brevity. - -

- - -### Convolution based FP16 Fusion for Inference - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Half - Precision Floating Point
Combination
Conv Algo
Stride
Filter Dims
N Mode*
Activations
Other Constraints
CBNADirect1 and 23x3, 5x5, 7x7, 9x9, 11x11AllAllstride and padding must be either - 1 or 2
CBADirect1x1Allstride/ padding not supported
- -*N mode is either spatial, or per activation. -

- - -### Batch Normalization based fusion for FP32 and FP16 for Inference and Training - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Combination
N mode*
Activations
Constraints
NA for inference
All
All
None
NA forward training
All
All
None
NA backward training
All
All
None
-*N mode is either spatial, or per activation. -

+![Convolution based fp32 fusion](fp32fusions.png) +![Convolution based fp16 fusion](fp16fusions.png) + ## Performance Comparison to Non-Fused Kernels diff --git a/doc/src/apireference.rst b/doc/src/apireference.rst index 67d32b9f4d..80b9fe974b 100644 --- a/doc/src/apireference.rst +++ b/doc/src/apireference.rst @@ -7,6 +7,7 @@ API Reference :maxdepth: 4 :caption: Contents: + datatypes handle tensor activation @@ -16,4 +17,5 @@ API Reference lrn pooling softmax - fusion \ No newline at end of file + fusion + loss \ No newline at end of file diff --git a/doc/src/datatypes.md b/doc/src/datatypes.md new file mode 100644 index 0000000000..1a46e5fd62 --- /dev/null +++ b/doc/src/datatypes.md @@ -0,0 +1,38 @@ + +# Datatypes + + +MIOpen contains several datatypes at different levels of support. The enumerated datatypes are shown below: + +``` +typedef enum { + miopenHalf = 0, + miopenFloat = 1, + miopenInt32 = 2, + miopenInt8 = 3, + miopenInt8x4 = 4, + miopenBFloat16 = 5, +} miopenDataType_t; +``` + +Of these types only `miopenFloat` and `miopenHalf` are fully supported across all layers in MIOpen. Please see the individual layers in API reference section for specific datatype support and limitations. + +Type descriptions: + * `miopenHalf` - 16-bit floating point + * `miopenFloat` - 32-bit floating point + * `miopenInt32` - 32-bit integer, used primarily for `int8` convolution outputs + * `miopenInt8` - 8-bit integer, currently only supported by `int8` convolution forward path, tensor set, tensor copy, tensor cast, tensor transform, tensor transpose, and im2col. + * `miopenInt8x4` - 8-bit 4 element vector type used primarily with `int8` convolutions forward path. + * `miopenBFloat16` - brain float fp-16 (8-bit exponent, 7-bit fraction), currently only supported by convolutions, tensor set, and tensor copy. + + +Note: In addition to the standard datatypes above, pooling contains its own indexing datatypes: +``` +typedef enum { + miopenIndexUint8 = 0, + miopenIndexUint16 = 1, + miopenIndexUint32 = 2, + miopenIndexUint64 = 3, +} miopenIndexType_t; +``` + diff --git a/doc/src/driverTableCrop.png b/doc/src/driverTableCrop.png new file mode 100644 index 0000000000..0fd6395f2b Binary files /dev/null and b/doc/src/driverTableCrop.png differ diff --git a/doc/src/find_and_immediate.md b/doc/src/find_and_immediate.md new file mode 100644 index 0000000000..8da241798f --- /dev/null +++ b/doc/src/find_and_immediate.md @@ -0,0 +1,160 @@ +Find and Immediate Mode +======================= + + + +## Find API + +MIOpen contains several convolution algorithms for each stage of training or inference. Pre-MIOpen version 2.0 users needed to call Find methods in order generate a set of applicable algorithms. + +A typical workflow for the find stage: + +``` +miopenConvolutionForwardGetWorkSpaceSize(handle, + weightTensorDesc, + inputTensorDesc, + convDesc, + outputTensorDesc, + &maxWorkSpaceSize); + +// < allocate workspace > + + +// NOTE: +// miopenFindConvolution*() call is expensive in terms of execution time and required workspace. +// Therefore it is highly recommended to save off the selected algorithm and workspace required so that +// can be reused later within the lifetime of the same MIOpen handle object. +// In this way, there should be is no need to invoke miopenFind*() more than once per application lifetime. + +miopenFindConvolutionForwardAlgorithm(handle, + inputTensorDesc, + input_device_mem, + weightTensorDesc, + weight_device_mem, + convDesc, + outputTensorDesc, + output_device_mem,, + request_algo_count, + &ret_algo_count, + perf_results, + workspace_device_mem, + maxWorkSpaceSize, + 1); + +// < select fastest algorithm > + +// < free previously allocated workspace and allocate workspace required for the selected algorithm> + +miopenConvolutionForward(handle, &alpha, + inputTensorDesc, + input_device_mem, + weightTensorDesc, + weight_device_mem, + convDesc, + perf_results[0].fwd_algo, // use the fastest algo + &beta, + outputTensorDesc, + output_device_mem, + workspace_device_mem, + perf_results[0].memory); //workspace size +``` + + +The results of Find() are returned in an array of `miopenConvAlgoPerf_t` structs in order of performance, with the fastest at index 0. + +This call sequence is executed once per session as it is inherently expensive. Of those, `miopenFindConvolution*()` is the most expensive call. It caches its own results on disk, so the subsequent calls during the same MIOpen session will execute faster. However, it is better to remember results of `miopenFindConvolution*()` in the application, as recommended above. + + + + +## Immediate Mode API + +MIOpen v2.0 introduces the immediate which removes the requirement for the `miopenFindConvolution*()` calls and their associated runtime costs. In this mode, the user can query the MIOpen runtime for all the supported _solutions_ for a given convolution configuration. These solutions may either be using the same algorithm or different ones. The sequence of operations for in immediate mode is similar to launching regular convolutions in MIOpen i.e. through the use of the `miopenFindConvolution*()` API. However, in this case the different APIs have much lower runtime cost. A typical convolution call would be similar to the following sequence of calls: + +* The user constructs the MIOpen handle and relevant descriptors such as the convolution descriptor as usual. +* With the above data structures, the user calls `miopenConvolution*GetSolutionCount` to get the **maximum** number of supported solutions for the convolution descriptor in question. +* The count obtained above is used to allocate memory for the `miopenConvSolution_t` structure introduced in MIOpen v2.0 +* The user calls `miopenConvolution*GetSolution` to populate the `miopenConvSolution_t` structures allocated above. The returned list is ordered in the order of best performance, thus the first element would be the fastest. +* While the above structure returns the amount of workspace required for an alogrithm, the user may inquire the amount of a workspace required for a known solution id by using the `miopenConvolution*GetSolutionWorkspaceSize` API call. However, this is not a requirement, since the strucure returned by `miopenConvolution*GetSolution` would already have this information. +* Now the user may initiate the convolution operation in _immediate_ mode by calling `miopenConvolution*Immediate`. Which would populate the output tensor descriptor with the respective convolution result. However, the first call to `miopenConvolution*Immediate` may consume more time since the kernel may not be present in the kernel cache and may need to be compiled. +* Optionally, the user may compile the solution of choice by calling `miopenConvolution*CompileSolution` which would ensure that the kernel represented by the chosen solution is populated in the kernel cache a priori, removing the necessity for compiling the kernel in question. + + +``` +miopenConvolutionForwardGetSolutionCount(handle, + weightTensorDesc, + inputTensorDesc, + convDesc, + outputTensorDesc, + &solutionCount); + + +// < allocate an array of miopenConvSolution_t of size solutionCount > + + +miopenConvolutionForwardGetSolution(handle, + weightTensorDesc, + inputTensorDesc, + convDesc, + outputTensorDesc, + solutionCount, + &actualCount, + solutions); + +// < select a solution from solutions array > + +miopenConvolutionForwardGetSolutionWorkspaceSize(handle, + weightTensorDesc, + inputTensorDesc, + convDesc, + outputTensorDesc, + selected->solution_id, + &ws_size); + +// < allocate solution workspace of size ws_size > + + +// This stage is optional +miopenConvolutionForwardCompileSolution(handle, + weightTensorDesc, + inputTensorDesc, + convDesc, + outputTensorDesc, + selected->solution_id); + + + + miopenConvolutionForwardImmediate(handle, + weightTensor, + weight_device_mem, + inputTensorDesc, + input_device_mem, + convDesc, + outputTensorDesc, + output_device_mem, + workspace_device_mem, + ws_size, + selected->solution_id); +``` + +## Immediate Mode Fall Back + +The immediate mode is underpinned by the [Find-Db](https://rocmsoftwareplatform.github.io/MIOpen/doc/html/finddb.html), however it may not contain every configuration of interest. Immediate mode's behavior when encountering a database miss is to fallback to a GEMM algorithm. The GEMM algorithm will handle most cases, however, if the user requires performance they should run the Find stage at least once. Fallback's `miopenConvolution*GetSolution` returns only one `miopenConvSolution_t` structure and its `time` member contains negative value. Future releases will implement a more robust heuristic based fallback, which is expected to provide better (but still non-optimal) performance. + + + +## Limitations of Immediate Mode + +### Architectual Limitations +The system Find-Db has only been populated for the following architectures: + * gfx906 with 64 CUs + * gfx906 with 60 CUs + * gfx900 with 64 CUs + * gfx900 with 56 CUs + +If the user's architecture is not listed above they will need to run the Find API once on their system per application in order to take advantage of immediate mode's more efficient behavior. + + +### Backend Limitations + +OpenCL support for immediate mode via the fallback is limited to fp32 datatypes. This is because this current release's fallback path goes through GEMM which on the OpenCL is serviced through MIOpenGEMM -- which itself only contains support for fp32. The HIP backend uses rocBLAS as its fallback path which contains a richer set of datatypes. diff --git a/doc/src/finddb.md b/doc/src/finddb.md new file mode 100644 index 0000000000..39d8cfd455 --- /dev/null +++ b/doc/src/finddb.md @@ -0,0 +1,44 @@ +Find-Db Database +================ + +Prior to MIOpen 2.0, users utilized calls such as `miopenFindConvolution*Algorithm()` to gather a set of convolution algorithms in the form of an array of `miopenConvSolution_t` structs. This process is time consuming because it requires online benchmarking of competing algorithms. In MIOpen 2.0 an [immediate mode](https://rocmsoftwareplatform.github.io/MIOpen/doc/html/find_and_immediate.html) is introduced. + +Immediate mode is based on a database which contains the results of calls to the legacy Find() stage. This database is called `Find-Db`. It consists of two parts: +- **System Find-Db**, a system-wide storage which holds the pre-run values for the most applicable configurations, +- **User Find-Db**, a per-user storage which is intended to hold results for arbitrary user-run configurations. It also performs double duty as a cache for the Find() stage. + +The User Find-Db **always takes precedence** over System Find-Db. + +By default, System Find-Db resides within MIOpen's install location, while User Find-Db resides in the user's home directory. See [Setting up locations](https://rocmsoftwareplatform.github.io/MIOpen/doc/html/install.html#setting-up-locations) for more information. + + * The System Find-Db is *not* modified upon installation of MIOpen. + * There are separate Find databases for HIP and OpenCL backends. + +### Populating the User Find-Db + +MIOpen collects Find-db information during the following MIOpen API calls: +- `miopenFindConvolutionForwardAlgorithm()` +- `miopenFindConvolutionBackwardDataAlgorithm()` +- `miopenFindConvolutionBackwardWeightsAlgorithm()` + +During the call, find data entries are collected for one _problem configuration_ (implicitly defined by the tensor descriptors and convolution descriptor passed to API function). + + +### Updating MIOpen and the User Find-Db + +When the user installs a new version of MIOpen, the new version of MIOpen will _ignore_ old **User find-db*** files. Thus, the user is _not required_ to move or delete their old User find-db files. However, the user may wish to re-collect the information into their brand new **User find-db**. This should be done in the same way as it was done with the previous version of the library -- _if_ it was done. This would keep Immediate mode optimized. + + +### Disabling Find-Db + +By default MIOpen will use the Find-Db. Users can disable the Find-Db by setting the environmental variable `MIOPEN_DEBUG_DISABLE_FIND_DB` to 1: +``` +export MIOPEN_DEBUG_DISABLE_FIND_DB=1 +``` + +**Note:** The System Find-Db has the ability to be cached into memory and may increase performance dramatically. To enable this option use the cmake configuration flag: +``` +-DMIOPEN_DEBUG_FIND_DB_CACHING=On +``` + + diff --git a/doc/src/fp16fusions.png b/doc/src/fp16fusions.png new file mode 100644 index 0000000000..425a29c371 Binary files /dev/null and b/doc/src/fp16fusions.png differ diff --git a/doc/src/fp32fusions.png b/doc/src/fp32fusions.png new file mode 100644 index 0000000000..bf8ac94139 Binary files /dev/null and b/doc/src/fp32fusions.png differ diff --git a/doc/src/index.rst b/doc/src/index.rst index 2509db6110..8c12c07aa7 100644 --- a/doc/src/index.rst +++ b/doc/src/index.rst @@ -19,6 +19,8 @@ Sources and binaries can be found at `MIOpen's GitHub site