History

Beatriz Navidad Vilches 3d8bcc0859 Re-generated VS .vcxproj and .vcxproj.filters files		7 months ago
..
.gitignore	HIP "Basic" Example Suite (part 4) (#13 )	3 years ago
CMakeLists.txt	Modify cmake output directory of binaries	1 year ago
Makefile	HIP "Basic" Example Suite (part 4) (#13 )	3 years ago
README.md	Add linting action for documentation (#119 )	1 year ago
inline_assembly_vs2017.sln	VS2017 and VS2022 supported. Bugs fixed. (#22 )	2 years ago
inline_assembly_vs2017.vcxproj	Re-generated VS .vcxproj and .vcxproj.filters files	7 months ago
inline_assembly_vs2017.vcxproj.filters	Re-generated VS .vcxproj and .vcxproj.filters files	7 months ago
inline_assembly_vs2019.sln	Develop Stream: repeating unique GUIDs in filter files (#137 )	1 year ago
inline_assembly_vs2019.vcxproj	Re-generated VS .vcxproj and .vcxproj.filters files	7 months ago
inline_assembly_vs2019.vcxproj.filters	Re-generated VS .vcxproj and .vcxproj.filters files	7 months ago
inline_assembly_vs2022.sln	VS2017 and VS2022 supported. Bugs fixed. (#22 )	2 years ago
inline_assembly_vs2022.vcxproj	Resolve "Generate VS files from external meta-data repository"	7 months ago
inline_assembly_vs2022.vcxproj.filters	Re-generated VS .vcxproj and .vcxproj.filters files	7 months ago
main.hip	VS2017 and VS2022 supported. Bugs fixed. (#22 )	2 years ago

README.md

HIP-Basic Inline Assembly Example

Description

This program showcases an implementation of a simple matrix transpose kernel, which uses inline assembly and works on both AMD and NVIDIA hardware.

By using inline assembly in your kernels, you may be able to gain extra performance. It could also enable you to use special GPU hardware features which are not available through compiler intrinsics.

For more insights, please read the following blogs by Ben Sander: The Art of AMDGCN Assembly: How to Bend the Machine to Your Will & AMD GCN Assembly: Cross-Lane Operations

For more information: AMD ISA documentation for current architectures & User Guide for LLVM AMDGPU Back-end

Application flow

A number of variables are defined to control the problem details and the kernel launch parameters.
Input matrix is set up in host memory.
The necessary amount of device memory is allocated and input is copied to the device.
The GPU transposition kernel is launched with previously defined arguments.
The kernel will use different inline assembly for its data movement, depending on the target platform.
The transposed matrix is copied back to the host and all device memory is freed.
The elements of the result matrix are compared with the expected result. The result of the comparison is printed to the standard output.

Key APIs and Concepts

Using inline assembly in GPU kernels is somewhat similar to using inline assembly in host-side code. The volatile statement tells the compiler to not remove the assembly statement during optimizations.

asm volatile("v_mov_b32_e32 %0, %1" : "=v"(variable_0) : "v"(variable_1))

However, since the instruction set differs between GPU architectures, you usually want to use the appropriate GPU architecture compiler defines to support multiple architectures (see the gpu_arch example for more fine-grained architecture control).

README.md

HIP-Basic Inline Assembly Example

Description

Application flow

Key APIs and Concepts

Demonstrated API Calls

HIP runtime

Device symbols

Host symbols