Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[OpenMP] Memory transfer optimizations #180

Open
4 tasks
jdoerfert opened this issue Mar 14, 2020 · 1 comment
Open
4 tasks

[OpenMP] Memory transfer optimizations #180

jdoerfert opened this issue Mar 14, 2020 · 1 comment
Labels
gsoc Google Summer of Code help wanted Indicates that a maintainer wants help. Not [good first issue]. openmp

Comments

@jdoerfert
Copy link
Member

When we have memory transfers from the host to a device, or any long running (I/O) method that can be split in a begin and wait part, we can try to hide the latency. (For now this is focused on memory transfers in OpenMP target offloading but the scheme should apply to CUDA and other languages as well.)

Given a blocking cross device memory transfer such as blocking_memcpy_host2device(Dst, Src, N), we want to first split it in two parts, the "issue" and the "wait", something like:
handle = async_issue_memcpy_host2device(Dst, Src, N); wait(handle, Dst, Src, N). Then, we want to move the two calls apart, thus causing the issue to be executed earlier and the wait later. There is a chance that the code we can legally move in-between is now executed while the memcpy is performed, effectively reducing the latency. Note that this also works if we start with a async version.

  • Determine if we need new API functions (in the OpenMP runtime) to represent the two parts of a blocking memory transfer afte we split it. (Note that the async versions might sufficient or can be extended.) Perform the required action.
  • Recognize calls to the APIs that we want to support in the OpenMPOpt pass and identify the memory regions involved.
  • Implement an optimization that, given a (set of) instructions and a set of memory ranges, moves the instruction(s) earlier or respectively later in the execution path.
  • Try to generalize the optimization and apply it to other use cases, e.g. earlier task issuing.
@jdoerfert jdoerfert added help wanted Indicates that a maintainer wants help. Not [good first issue]. A-openmp gsoc Google Summer of Code labels Mar 14, 2020
@asl asl removed the A-openmp label Apr 3, 2020
am11 pushed a commit to am11/llvm-project that referenced this issue Mar 29, 2022
…307.6 (llvm#180)

[objwriter/12.x] Update dependencies from dotnet/arcade
trevor-m pushed a commit to trevor-m/llvm-project that referenced this issue Apr 20, 2023
Summary:
The complex DOT instructions perform a dot-product on quadtuplets from
two source vectors and the resuling wide real or wide imaginary is
accumulated into the destination register. The instructions come in two
forms:

Vector form, e.g.
  cdot z0.s, z1.b, z2.b, llvm#90    - complex dot product on four 8-bit quad-tuplets,
                                  accumulating results in 32-bit elements. The
                                  complex numbers in the second source vector are
                                  rotated by 90 degrees.

  cdot z0.d, z1.h, z2.h, llvm#180   - complex dot product on four 16-bit quad-tuplets,
                                  accumulating results in 64-bit elements.
                                  The complex numbers in the second source
                                  vector are rotated by 180 degrees.

Indexed form, e.g.
  cdot z0.s, z1.b, z2.b[3], #0  - complex dot product on four 8-bit quad-tuplets,
                                  with specified quadtuplet from second source vector,
                                  accumulating results in 32-bit elements.
  cdot z0.d, z1.h, z2.h[1], #0  - complex dot product on four 16-bit quad-tuplets,
                                  with specified quadtuplet from second source vector,
                                  accumulating results in 64-bit elements.

The specification can be found here:
https://developer.arm.com/docs/ddi0602/latest

Reviewed By: SjoerdMeijer, rovka

Differential Revision: https://reviews.llvm.org/D61903

llvm-svn: 360870
trevor-m pushed a commit to trevor-m/llvm-project that referenced this issue Apr 20, 2023
Summary:
feature coverage is a useful signal that is available during the merge
process, but was not printed previously.

Output example:

```
$ ./fuzzer -use_value_profile=1 -merge=1 new_corpus/ seed_corpus/
INFO: Seed: 1676551929
INFO: Loaded 1 modules   (2380 inline 8-bit counters): 2380 [0x90d180, 0x90dacc), 
INFO: Loaded 1 PC tables (2380 PCs): 2380 [0x684018,0x68d4d8), 
MERGE-OUTER: 180 files, 78 in the initial corpus
MERGE-OUTER: attempt 1
INFO: Seed: 1676574577
INFO: Loaded 1 modules   (2380 inline 8-bit counters): 2380 [0x90d180, 0x90dacc), 
INFO: Loaded 1 PC tables (2380 PCs): 2380 [0x684018,0x68d4d8), 
INFO: -max_len is not provided; libFuzzer will not generate inputs larger than 1048576 bytes
MERGE-INNER: using the control file '/tmp/libFuzzerTemp.111754.txt'
MERGE-INNER: 180 total files; 0 processed earlier; will process 180 files now
llvm#1	pulse  cov: 134 ft: 330 exec/s: 0 rss: 37Mb
llvm#2	pulse  cov: 142 ft: 462 exec/s: 0 rss: 38Mb
llvm#4	pulse  cov: 152 ft: 651 exec/s: 0 rss: 38Mb
llvm#8	pulse  cov: 152 ft: 943 exec/s: 0 rss: 38Mb
llvm#16	pulse  cov: 520 ft: 2783 exec/s: 0 rss: 39Mb
llvm#32	pulse  cov: 552 ft: 3280 exec/s: 0 rss: 41Mb
llvm#64	pulse  cov: 576 ft: 3641 exec/s: 0 rss: 50Mb
llvm#78	LOADED cov: 602 ft: 3936 exec/s: 0 rss: 88Mb
llvm#128	pulse  cov: 611 ft: 3996 exec/s: 0 rss: 93Mb
llvm#180	DONE   cov: 611 ft: 4016 exec/s: 0 rss: 155Mb
MERGE-OUTER: succesfull in 1 attempt(s)
MERGE-OUTER: the control file has 39741 bytes
MERGE-OUTER: consumed 0Mb (37Mb rss) to parse the control file
MERGE-OUTER: 9 new files with 80 new features added; 9 new coverage edges
```

Reviewers: hctim, morehouse

Reviewed By: morehouse

Subscribers: delcypher, #sanitizers, llvm-commits, kcc

Tags: #llvm, #sanitizers

Differential Revision: https://reviews.llvm.org/D66030

llvm-svn: 368617
@Endilll Endilll added the openmp label Jan 20, 2024
@llvmbot
Copy link
Collaborator

llvmbot commented Jan 20, 2024

@llvm/issue-subscribers-openmp

Author: Johannes Doerfert (jdoerfert)

When we have memory transfers from the host to a device, or any long running (I/O) method that can be split in a begin and wait part, we can try to hide the latency. (For now this is focused on memory transfers in OpenMP target offloading but the scheme should apply to CUDA and other languages as well.)

Given a blocking cross device memory transfer such as blocking_memcpy_host2device(Dst, Src, N), we want to first split it in two parts, the "issue" and the "wait", something like:
handle = async_issue_memcpy_host2device(Dst, Src, N); wait(handle, Dst, Src, N). Then, we want to move the two calls apart, thus causing the issue to be executed earlier and the wait later. There is a chance that the code we can legally move in-between is now executed while the memcpy is performed, effectively reducing the latency. Note that this also works if we start with a async version.

  • Determine if we need new API functions (in the OpenMP runtime) to represent the two parts of a blocking memory transfer afte we split it. (Note that the async versions might sufficient or can be extended.) Perform the required action.
  • Recognize calls to the APIs that we want to support in the OpenMPOpt pass and identify the memory regions involved.
  • Implement an optimization that, given a (set of) instructions and a set of memory ranges, moves the instruction(s) earlier or respectively later in the execution path.
  • Try to generalize the optimization and apply it to other use cases, e.g. earlier task issuing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
gsoc Google Summer of Code help wanted Indicates that a maintainer wants help. Not [good first issue]. openmp
Projects
None yet
Development

No branches or pull requests

4 participants