Revision as of 14:30, 26 August 2016 editFmadd (talk | contribs)Extended confirmed users10,468 editsNo edit summary← Previous edit | Latest revision as of 17:39, 2 December 2023 edit undoCitation bot (talk | contribs)Bots5,460,005 edits Misc citation tidying. | Use this bot. Report bugs. | #UCB_CommandLine | ||
(35 intermediate revisions by 16 users not shown) | |||
Line 1: | Line 1: | ||
{{Short description|Type of memory addressing}} | |||
{{about|the vector addressing type|the I/O method| |
{{about|the vector addressing type|the I/O method|vectored I/O|register gather/scatter, part of ]|permute instruction}} | ||
'''Gather/scatter''' is a type of ] addressing that at once collects (gathers) from, or stores (scatters) data to, multiple, arbitrary indices. Examples of its use include ] ] operations,<ref>{{cite journal |last1=Lewis |first1=John G. |last2=Simon |first2=Horst D. |title=The Impact of Hardware Gather/Scatter on Sparse Gaussian Elimination |journal=SIAM Journal on Scientific and Statistical Computing |date=1 March 1988 |volume=9 |issue=2 |pages=304–311 |doi=10.1137/0909019}}</ref> sorting algorithms, ]s,<ref name="he">{{cite book |last1=He |first1=Bingsheng |last2=Govindaraju |first2=Naga K. |last3=Luo |first3=Qiong |last4=Smith |first4=Burton |title=Proceedings of the 2007 ACM/IEEE conference on Supercomputing |chapter=Efficient gather and scatter operations on graphics processors |date=2007 |pages=1–12 |doi=10.1145/1362622.1362684|isbn=9781595937643 |s2cid=2928233 |url=http://repository.ust.hk/ir/bitstream/1783.1-3280/1/scatter_sc07.pdf }}</ref> and some computational graph theory problems.<ref>{{cite book |last1=Kumar |first1=Manoj |last2=Serrano |first2=Mauricio |last3=Moreira |first3=Jose |last4=Pattnaik |first4=Pratap |last5=Horn |first5=W P |last6=Jann |first6=Joefon |last7=Tanase |first7=Gabriel |title=2016 IEEE High Performance Extreme Computing Conference (HPEC) |chapter=Efficient implementation of scatter-gather operations for large scale graph analytics |date=September 2016 |pages=1–7 |doi=10.1109/HPEC.2016.7761578|isbn=978-1-5090-3525-0 |s2cid=10566760 }}</ref> It is the vector equivalent of ], with gather involving indexed reads, and scatter, indexed writes. ]s (and some ] units in ]s) have hardware support for gather and scatter operations, as do many ] systems, allowing large data sets to be transferred to ] more rapidly. | |||
'''Gather-scatter''' is a type of memory addressing that often arises when addressing | |||
vectors in ] ] operations. It is the | |||
The concept is somewhat similar to ], which is sometimes also referred to as scatter-gather I/O. This system differs in that it is used to map multiple sources of data from contiguous structures into a single stream for reading or writing. A common example is writing out a series of ], which in most ]s would be stored in separate memory locations. | |||
vector-equivalent of ], with gather involving indexed | |||
reads and scatter indexed writes. ]s (and some ] units in ]s) have | |||
hardware support for gather-scatter operations, providing instructions such as | |||
''Load Vector Indexed'' for gather and ''Store Vector Indexed'' for scatter. | |||
==Definitions== | ==Definitions== | ||
===Gather=== | ===Gather=== | ||
A ] <math>y</math> holding <math>N</math> non-empty elements can be represented by two densely populated vectors of length <math>N</math>; <math>x</math> containing the non-empty elements of <math>y</math>, and <math>idx</math> giving the index in <math>y</math> where <math>x</math>'s element is located. The gather of <math>y</math> into <math>x</math>, denoted <math>x \leftarrow y|_x</math>, assigns <math>x(i)=y(idx(i))</math> with <math>idx</math> having already been calculated.<ref></ref> Assuming no ] between x, y,idx, a ] implementation is | |||
A ] <math>y</math> | |||
holding <math>N</math> non-empty elements | |||
can be represented by two densely-populated vectors of length <math>N</math>; | |||
<math>x</math> containing the non-empty elements of <math>y</math>, | |||
and <math>idx</math> giving the index in <math>y</math> where <math>x</math>'s element is located. | |||
The gather of <math>y</math> into <math>x</math>, | |||
denoted <math>x \leftarrow y|_x</math>, | |||
assigns <math>x(i)=y(idx(i))</math> with <math>idx</math> having already been calculated.<ref></ref> | |||
A C implementation is | |||
< |
<syntaxhighlight lang="c"> | ||
for (i=0; i<N; ++i) | for (i = 0; i < N; ++i) | ||
x = y]; | x = y]; | ||
</syntaxhighlight> | |||
</source> | |||
===Scatter=== | ===Scatter=== | ||
The sparse scatter, denoted <math>y|_x \leftarrow x</math> is the reverse operation. | The sparse scatter, denoted <math>y|_x \leftarrow x</math> is the reverse operation. It copies the values of <math>x</math> into the corresponding locations in the sparsely populated vector <math>y</math>, i.e. <math>y(idx(i))=x(i)</math>. | ||
It copies the values of <math>x</math> into the corresponding | |||
<syntaxhighlight lang="c"> | |||
locations in the sparsely-populated vector <math>y</math>, i.e. <math>y(idx(i))=x(i)</math>. | |||
⚫ | for (i = 0; i < N; ++i) | ||
⚫ | y] = x; | ||
</syntaxhighlight> | |||
== Support == | |||
Scatter/gather units were also a part of most vector computers, notably the ]. In this case, the purpose was to efficiently store values in the limited resource of the vector registers. For instance, the Cray-1 had eight 64-word vector registers, so data that contained values that had no effect on the outcome, like zeros in an addition, were using up valuable space that would be better used. By gathering non-zero values into the registers, and scattering the results back out, the registers could be used much more efficiently, leading to higher performance. Such machines generally implemented two access models, scatter/gather and "stride", the latter designed to quickly load contiguous data.<ref>{{cite tech report |title=A Seymour Cray Perspective |first=Gordon |last=Bell |date=25 January 1998 |url=https://gordonbell.azurewebsites.net/craytalk/sld063.htm}}</ref> This basic layout was widely copied in later ] designs, especially on the variety of models from Japan. | |||
As ] design improved during the 1990s, commodity CPUs began to add vector processing units. At first these tended to be simple, sometimes overlaying the CPU's general purpose registers, but over time these evolved into increasingly powerful systems that met and then surpassed the units in high-end supercomputers. By this time, scatter/gather instructions had been added to many of these designs. | |||
] CPUs which support the ] instruction set can gather 32-bit and 64-bit elements with memory offsets from a base address. A second register determines whether the particular element is loaded, and faults occurring from invalid memory accesses by masked-out elements are suppressed.<ref name="kusswurm">{{cite book |last1=Kusswurm |first1=Daniel |title=Modern parallel programming with C++ and Assembly language : X86 SIMD development using AVX, AVX2, and AVX-512 |date=2022 |publisher=Apress Media |isbn=978-1-4842-7917-5}}</ref>{{rp|503–4}} The ] instruction set also contains (potentially masked) scatter operations.<ref name="kusswurm" />{{rp|539}}<ref name="hossain">{{cite book |last1=Hossain |first1=Md Maruf |last2=Saule |first2=Erik |title=50th International Conference on Parallel Processing Workshop |chapter=Impact of AVX-512 Instructions on Graph Partitioning Problems |date=9 August 2021 |pages=1–9 |doi=10.1145/3458744.3473362|isbn=9781450384414 |s2cid=237350994 }}</ref> The ] instruction set's ] includes gather and scatter operations on 8-, 16-, 32- and 64-bit elements.<ref name="zhong">{{cite book |last1=Zhong |first1=Dong |last2=Shamis |first2=Pavel |last3=Cao |first3=Qinglei |last4=Bosilca |first4=George |last5=Sumimoto |first5=Shinji |last6=Miura |first6=Kenichi |last7=Dongarra |first7=Jack |title=2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID) |chapter=Using Arm Scalable Vector Extension to Optimize OPEN MPI |date=May 2020 |pages=222–231 |doi=10.1109/CCGrid49817.2020.00-71 |isbn=978-1-7281-6095-5 |s2cid=220604878 |chapter-url=https://netlib.org/utk/people/JackDongarra/PAPERS/using-arm.pdf}}</ref><ref>{{cite web |title=What is the Scalable Vector Extension? |url=https://developer.arm.com/documentation/101726/0400/Learn-about-the-Scalable-Vector-Extension--SVE-/What-is-the-Scalable-Vector-Extension- |website=ARM Developer |access-date=19 November 2022}}</ref> ] has hardware support for gather/scatter.<ref>{{cite book |last1=Gainaru |first1=Ana |last2=Graham |first2=Richard L. |last3=Polyakov |first3=Artem |last4=Shainer |first4=Gilad |title=Proceedings of the 23rd European MPI Users' Group Meeting |chapter=Using InfiniBand Hardware Gather-Scatter Capabilities to Optimize MPI All-to-All |date=25 September 2016 |pages=167–179 |doi=10.1145/2966884.2966918|isbn=9781450342346 |s2cid=15880901 }}</ref> | |||
Without instruction-level gather/scatter, efficient implementations may need to be tuned for optimal performance, for example with ]ing; libraries such as OpenMPI may provide such primitives.<ref name="he" /><ref name="zhong"/> | |||
<source lang="C"> | |||
⚫ | for (i=0; i<N; ++i) | ||
⚫ | y] = x; | ||
</source> | |||
== See also == | == See also == | ||
Line 45: | Line 42: | ||
] | ] | ||
] | |||
] |
Latest revision as of 17:39, 2 December 2023
Type of memory addressing This article is about the vector addressing type. For the I/O method, see vectored I/O. For register gather/scatter, part of vector processing, see permute instruction.Gather/scatter is a type of memory addressing that at once collects (gathers) from, or stores (scatters) data to, multiple, arbitrary indices. Examples of its use include sparse linear algebra operations, sorting algorithms, fast Fourier transforms, and some computational graph theory problems. It is the vector equivalent of register indirect addressing, with gather involving indexed reads, and scatter, indexed writes. Vector processors (and some SIMD units in CPUs) have hardware support for gather and scatter operations, as do many input/output systems, allowing large data sets to be transferred to main memory more rapidly.
The concept is somewhat similar to vectored I/O, which is sometimes also referred to as scatter-gather I/O. This system differs in that it is used to map multiple sources of data from contiguous structures into a single stream for reading or writing. A common example is writing out a series of strings, which in most programming languages would be stored in separate memory locations.
Definitions
Gather
A sparsely populated vector holding non-empty elements can be represented by two densely populated vectors of length ; containing the non-empty elements of , and giving the index in where 's element is located. The gather of into , denoted , assigns with having already been calculated. Assuming no pointer aliasing between x, y,idx, a C implementation is
for (i = 0; i < N; ++i) x = y];
Scatter
The sparse scatter, denoted is the reverse operation. It copies the values of into the corresponding locations in the sparsely populated vector , i.e. .
for (i = 0; i < N; ++i) y] = x;
Support
Scatter/gather units were also a part of most vector computers, notably the Cray-1. In this case, the purpose was to efficiently store values in the limited resource of the vector registers. For instance, the Cray-1 had eight 64-word vector registers, so data that contained values that had no effect on the outcome, like zeros in an addition, were using up valuable space that would be better used. By gathering non-zero values into the registers, and scattering the results back out, the registers could be used much more efficiently, leading to higher performance. Such machines generally implemented two access models, scatter/gather and "stride", the latter designed to quickly load contiguous data. This basic layout was widely copied in later supercomputer designs, especially on the variety of models from Japan.
As microprocessor design improved during the 1990s, commodity CPUs began to add vector processing units. At first these tended to be simple, sometimes overlaying the CPU's general purpose registers, but over time these evolved into increasingly powerful systems that met and then surpassed the units in high-end supercomputers. By this time, scatter/gather instructions had been added to many of these designs.
x86-64 CPUs which support the AVX2 instruction set can gather 32-bit and 64-bit elements with memory offsets from a base address. A second register determines whether the particular element is loaded, and faults occurring from invalid memory accesses by masked-out elements are suppressed. The AVX-512 instruction set also contains (potentially masked) scatter operations. The ARM instruction set's Scalable Vector Extension includes gather and scatter operations on 8-, 16-, 32- and 64-bit elements. InfiniBand has hardware support for gather/scatter.
Without instruction-level gather/scatter, efficient implementations may need to be tuned for optimal performance, for example with prefetching; libraries such as OpenMPI may provide such primitives.
See also
References
- Lewis, John G.; Simon, Horst D. (1 March 1988). "The Impact of Hardware Gather/Scatter on Sparse Gaussian Elimination". SIAM Journal on Scientific and Statistical Computing. 9 (2): 304–311. doi:10.1137/0909019.
- ^ He, Bingsheng; Govindaraju, Naga K.; Luo, Qiong; Smith, Burton (2007). "Efficient gather and scatter operations on graphics processors". Proceedings of the 2007 ACM/IEEE conference on Supercomputing (PDF). pp. 1–12. doi:10.1145/1362622.1362684. ISBN 9781595937643. S2CID 2928233.
- Kumar, Manoj; Serrano, Mauricio; Moreira, Jose; Pattnaik, Pratap; Horn, W P; Jann, Joefon; Tanase, Gabriel (September 2016). "Efficient implementation of scatter-gather operations for large scale graph analytics". 2016 IEEE High Performance Extreme Computing Conference (HPEC). pp. 1–7. doi:10.1109/HPEC.2016.7761578. ISBN 978-1-5090-3525-0. S2CID 10566760.
- BLAS Technical Forum standard, Chapter 3: Sparse BLAS.
- Bell, Gordon (25 January 1998). A Seymour Cray Perspective (Technical report).
- ^ Kusswurm, Daniel (2022). Modern parallel programming with C++ and Assembly language : X86 SIMD development using AVX, AVX2, and AVX-512. Apress Media. ISBN 978-1-4842-7917-5.
- Hossain, Md Maruf; Saule, Erik (9 August 2021). "Impact of AVX-512 Instructions on Graph Partitioning Problems". 50th International Conference on Parallel Processing Workshop. pp. 1–9. doi:10.1145/3458744.3473362. ISBN 9781450384414. S2CID 237350994.
- ^ Zhong, Dong; Shamis, Pavel; Cao, Qinglei; Bosilca, George; Sumimoto, Shinji; Miura, Kenichi; Dongarra, Jack (May 2020). "Using Arm Scalable Vector Extension to Optimize OPEN MPI" (PDF). 2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID). pp. 222–231. doi:10.1109/CCGrid49817.2020.00-71. ISBN 978-1-7281-6095-5. S2CID 220604878.
- "What is the Scalable Vector Extension?". ARM Developer. Retrieved 19 November 2022.
- Gainaru, Ana; Graham, Richard L.; Polyakov, Artem; Shainer, Gilad (25 September 2016). "Using InfiniBand Hardware Gather-Scatter Capabilities to Optimize MPI All-to-All". Proceedings of the 23rd European MPI Users' Group Meeting. pp. 167–179. doi:10.1145/2966884.2966918. ISBN 9781450342346. S2CID 15880901.