Sage Journals: Discover world-class research

Abstract

Recently, Fast Fourier Transform (FFT) was introduced to study pattern characterization in textiles. Accelerators are a promising platform to accelerate large-scale FFT computation. However, current accelerating FFT only uses the accelerator to compute, but uses a CPU as a mere task controller. Additionally, a specified interface for FFT is built into the China accelerator (CA), which severely restricts FFT parallelization and performance. Hence, we transform four multiplications and six additions into six Fused Multiply Add (FMA) operations, then reduce total float operations by 40%, assisted by FMAs equipped with a Vector Processing Unit (VPE). Moreover, we propose an Interface Adapter (IA) to cater to a specified interface and a Fused Algorithm for Interface Adapter (FAIA) to fully use both the CA and CPU to compute large-scale FFT with coordination. Experimental results validated successful performance.

Keywords

China Accelerator Computer Vision FFT Fused Algorithm for Interface Adapter Fused Multiply Add Interface Adapter

Introduction

The Fast Fourier Transform (FFT) is one of the most widely-used numerical algorithms today in science and engineering domains for advanced textile research, large-scale physics simulations, signal processing, and mechanical responses of materials with complex structures. ^1–6 In the field of textiles and clothing, FFT can be used in online textile and clothing quality inspection and it is essential in controlling and updating textile and clothing quality for cotton textile corporations. FFT images can also be used as elements for textile image design for items such as clothing.

Because of color sense with expansion and contraction, the real Fourier transform can automatically generate patterns with fine structure. Based on the FFT with Fused Multiply Add (FMA) for the China accelerator (CA), a clothing image program can automatically generate FFT images. Thus, samples of fashion and textile images using patterns of points and their combinations can be designed efficiently.

Zeman proposed conjugate gradient FFT-based solvers to resolve the linear elastic problem,⁷ and Kabel used composite voxels in FFT-based homogenization to consider the interface, which can really improve the local quality of the calculated strain and stress fields; homogeneous reference material should be introduced into FFT-based methods.⁸ However, large scientific and engineering computations would spend the majority of execution time on large size FFTs, which require a large amount of computing resources.

Fortunately, accelerators such as General Purpose Graphical Processing Unit (GPGPU),^1,2,9–11 Many Integrated Core (MIC),¹²–¹⁴ Digital Signal Processing (DSP),^3,4,15–¹⁸ and Field Programmable Gate Array (FPGA)^5,6,19–28 have recently proved to be a more promising platform to solve FFT problems, since accelerators have much more parallel computing resources and can often achieve an order of magnitude performance improvement over CPUs.

Efforts have been popularly focused on GPGPU accelerating FFTs such as CUFFT from Nvidia,²⁹ a GPU-based out-of-card FFT library from Gu et al.,³⁰ FFT implementation based on a GPU cluster, Hybrid GPU/CPU FFT for Large FFT problems,^8,9,25,31 and FFT accelerated using MICs.^12–14 However, literature has cited the huge computing resources used by accelerators and have barely taken computing complexity and specified restrictions on the size of FFTs into account.

Different from traditional accelerators such as GPGPU and MIC, CA is a self-controlled high-performance accelerator³² that could be widely used in physics simulations, signal processing, and data compression.

Besides different architecture and programming models from GPGPU and MIC, there is a specified interface on the FFT built into the CA, which severely restricts FFT parallelization and performance. Hence, a customized FFT for the CA is proposed to take full advantage of both the accelerator and the CPU to compute large-scale FFTs with coordination.

Overview of FFT Algorithm

FFT is a fast and efficient algorithm for computing the Discrete Fourier transform (DFT) and the Inverse Discrete Fourier transform (IDFT). FFT algorithms recursively decompose an N-point DFT into several smaller DFTs, and the divide-and-conquer approach reduces the operational complexity of a DFT from O(N²) into O(N log N) by Cooley and Tukey^33,34 In this study, we exploit the DFT algorithm for both the symmetry and the periodicity of the complex exponential $W_{N} = e^{\frac{- j 2 π}{N}}$ , known as the twiddle factor. The algorithm using the divide-and-conquer technique was proposed by Cooley and Tukey.

Architecture of CA

The CA is a self-controlled high-performance accelerator made by the National University of Defense Technology of China.

As illustrated in Fig. 1, the CA consists of one CPU node, six DSP nodes, one IO node, a GC (Global Cache) partitioned into each node, and four MCUs (Memory Control Units), and all nodes are connected by ring interconnection. Each DSP node is composed of two computing cores, one SubGC (Sub-Global Cache), and sync between GC for synchronization. Each computing core contains 16 VPEs, and every VPE contains three FMAs, and every two SubGCs connect a MCU.

Fig. 1

Architecture of China Accelerator (CA).

Details on FMA

As illustrated in Fig. 2, FMA consist of one DFU (Data Fetching Unit), one MMU (Mantissa Multiplication Unit), one FMA (Fused Multiply Add) module, and one normalized unit. Each FMA module is composed of one float MAC (Multiply-Accumulate) unit, one double MAC unit, and one addition unit. Furthermore, each unit is detailed as follows.

Fig. 2

Architecture of FMA.

The DFU would fetch operands and then separate input symbols, exponent and mantissa respectively, and output to MMU. The MMU is used to receive the mantissa of the high and low position multiplication operands, and then four float multipliers would execute the mantissa multiplication in parallel. The FMA is responsible for performing the order shift according to the index of each operand. Finally, the mantissa calculation is performed to get the result of the mantissa and output according to the mantissa operation of the addition number of the added operand and the mantissa operation.

Restrictions on FFT

The CA can attain high performance according to the customized FFT kernel with a specified and fixed interface for smaller DFT data subsequence fixed at 2²² bytes, or 256 x 1024 of complex data, or 512 x 1024 of float data, This would severely damage the flexibility and generality of the FFT ported into the CA, but the customized FFT kernel interface would maximize efficiency of the CA and would attain the expected performance of the FFT in return. Table I lists the details of the customized FFT kernel interface built into the CA according to its architecture and design philosophy.

Table I.

FFT Kernel Built in CA

DFT Size		Running on CA	Performance
2²² bytes	256 x 1024 complex	√	Very good
	512 x 1024 float	√
	….	√
Others		x	-

As detailed in Table I, only a DFT subsequence fixed with 2²² bytes (including 256 x 1024 of complex data and 512 x 1024 of float data) would run on the CA and attain the expected performance. However, other smaller DFT subsequences fixed with non-2²² bytes should not run on the CA because of the customized FFT kernel, damaging flexibility and generality.

However, the specified interface could result in an unfriendly configuration and restrict parallelization for FFT. Accordingly, an Interface Adapter (IA) configured in FAIA is recommended to recognize a friendly configuration for FFT. Furthermore, FAIA would also fully coordinate the CA and the CPU to compute large-scale FFTs.

Reduction Computations for FFT with FMA

We can also represent N-point DFT X (n) with N/2-point X₁ (k) and X₂ (k) that have periodicity with N/2. The butterfly computation transforms two complex input points to two complex output points to compute the FFT. The Radix-2 FFT butterfly splits the N-point data sequence into two N/2 point data sequences, for even and odd numbered input samples, respectively, as demonstrated in Fig. 3.

Fig. 3

Radix-2 FFT butterfly computation flow.

Each butterfly requires two complex additions and one complex multiplication because W_N^kX₂ (k) occurs twice in the top and bottom halves of the butterfly and could be computed once. In practice, one complex addition would transform into two float additions, and one complex multiplication would transform into four float multiplications and two float additions.

Complexity for Traditional FFT

As illustrated in Fig. 3, the twiddle factor is W_N⁰ for level one in the butterfly computation, and the twiddle factors W_N⁰ and W_N^N/4 for level two, and for level three, twiddle factors are W_N⁰, W_N^N/8, W_N^2N/8 and W_N^3N/8. Similarly, for level M, twiddle factors are W_N⁰, W_N¹ ···and W_N^(N/2-1).

Theoretically, for N = 2^M point DFT, there are M = log₂ N levels of butterfly computation flows, and each level has N/2 butterfly computations, in which each butterfly requires two complex additions and one complex multiplication. Generally, one complex addition would transform into two float additions, and one complex multiplication would transform into four float multiplications and 2 float additions. Accordingly, the maximal FLoat Operation Per Butterfly (FLOPB) is computed as Eq. 1.

\begin{matrix} F L O P B = 1 \times (4 \times 2) + 2 \times 2 \\ = 4 + 6 \\ = 10 \end{matrix}

Eq. 1

According to Eq. 4, there are four float multiplications and six float additions. Hence, the Total FLOPB (TFLOPB) for N =2^M point DFT is represented as Eq. 2.

\begin{matrix} T F L O P B = M \times P L O P B \times \frac{N}{2} \\ = 5 N \log_{2} N \end{matrix}

Eq. 2

As demonstrated in Eq. 5, the N-point DFT has rather large computations. Generally, there are two ways to solve the large-scale FFT problem. One is accelerating FFT by accelerators such as GPU, MIC, FPGA, and DSP. The other is reducing computing complexity as Cooley and Tukey reducing complexity of a DFF from O(N²) to O(N log N). Cooley and Tukey only pay attention to the mathematical algorithm, but barely take architectural optimization into account.

Reduction Computations Assisted by FMA

For high performance processors, there are hundreds of vector functional units such as MAC, which would assemble independent one multiplication and one addition into one MAC operation and reduce float operation and computation.

Fortunately, when we focus on taking advantage of the architectural properties in the accelerator, especially for the FMA equipped CA, the complexity of a DFT would further reduce to 4N log₂ N from 5N log₂ N for one Butterfly computation as shown in Eq. 3.

\begin{matrix} t o p h a l f b u t e r f l y : \\ F L O P B = 1 \times (4 + 2) + 1 \times 2 \\ = 4 + 4 \\ = 8 \\ b o t t o m h a l f b u t e r f l y : \\ F L O P B = 1 \times (4 + 2) + 1 \times 2 \\ = 4 + 4 \\ = 8 \end{matrix}

Eq. 3

Each butterfly computation could separate into the top half and the bottom half butterfly respectively. Fortunately, each half butterfly contains four float multiplications and four float additions and would be assembled into four FMA operations intuitively. So, therefore there are eight FMA operations per butterfly as demonstrated in Fig. 4.

Fig. 4

Operations fused with FMA.

Hence, the optimized TFLOPB for N = 2^M point DFT is shown in Eq. 4.

\begin{matrix} opt - T F L O P B = M \times P L O P B \times \frac{N}{2} \\ = 4 N \log_{2} N \end{matrix}

Eq. 4

The FMA equipped in the CA is the optimized MAC unit, which would integrate one multiplication and on addition into one FMA. Accordingly, the FMA would assemble four float multiplications and four float additions into four FMAs for both the top and bottom halves. So, eight FMA are sufficient for each butterfly.

Leveraging between CPU and CA

The CA would be used to accelerate the large-scale FFT, however, the size of the DFT subsequence for FFT is specified and fixed at the bytes built into the CA to maximize its efficiency. In other words, the CA would only accept complex or float data, if not, smaller or larger DFTs would not be accelerated or even computed by the CA.

Interface Adapter for FFT

To provide a friendly testing configuration and fully utilize both the CA and CPU to finish large-scale FFT problem with coordination, we use the IA for FFT to cater to specified interface on the FFT built into the CA.

With the assistance of IA, the size of the DFT subsequence for FFT is not only confined to 2²² bytes, the restrictive conditions for calling the FFT kernel is weaker than that of original interface. Hence, a DFT subsequence with the size of 2²² bytes or greater than 2²² bytes both would be scheduled to calling the FFT kernel with IA, while sizes less than 2²² bytes are ignored by the CA. The comparisons of IA are listed in Table II.

Table II.

FFT Called by CA with IA

Parameter		Accelerated by CA		Comments
Complex (8 bytes)	Float (4 bytes)	Without IA	With IA	Comments
<256*1024	<512*1024	×	×	FFT running on CPU only because of the size of smaller DFT data subsequence less than 2²² Bytes
=256*1024	=512*1024	√	√	FFT running on CA only
>256*1024	>512*1024	×	√	FFT running on both CPU and CA assisted by IA

AIA Scheduling Strategy

Generally, to speed up heterogeneous FFT, there are two major classifications of the FFT acceleration.

Static Division

Static Division (SD) has been used in heterogeneous supercomputers equipped with CPU and GPU in TH-1A.³⁵ FFT should be given priority and would explore optimal DFT subsequence ratio for GPU as shown in Eq. 5.

{\begin{array}{l} κ = \frac{T_{g p u}}{T_{c p u} + T_{g p u}} \times 100 % \\ s u b - S_{g p u} = κ \times S \end{array}

Eq. 5

T_cpu and T_gpu represent computing time for DFT by CPU and GPU, respectively. S is the whole DFT sequence, and sub-S_gpu is the portion distributed to GPU acceleration.

Dynamic Schedule

Dynamic Schedule (DS) has also been adopted for TH-2,³⁶ which would give accelerator, priority and dispatch smaller DFT subsequences into accelerators or CPU, according to the working queue status of CPU and accelerator.

SD is a simple and effective way to accelerate FFT in a heterogeneous system. However, SD should select the optimal ratio based on theoretical peak performance, which is usually not identical to practice. At the same time, there is lack of flexibility because of the advanced decision and non-revision on division ratio. Comparatively, DS has better flexibility than SD, but introduces complicated scheduling algorithms.

Both SD and DS can be adopted to heterogeneous systems, and there are no restrictions on size of the FFT kernel for SD and DS. However, neither SD or DS is suitable for the CA, since it only accepts 256 x 1024 complex or 512 x 1024 float data; if not, the DFT would not even be computed by the CA. Therefore, a Fused Algorithm based on the Interface Adapter (FAIA) would prune the whole FFT sequence into a specified size of the DFT subsequence to cater to the specified interface on the FFT built into the CA. This is demonstrated in Fig. 5a for 1D FFT input data and in Fig. 5b for 2D FFT input data. Furthermore, FAIA would also make full use of the CA and CPU coordination.

As demonstrated in Fig. 5, large-scale FFT should prune the FFT sequence into Sub-SFT size subsequence data inputs for the CA, and non-Sub-SFT size subsequence data would be computed by the CPU. The Sub-SFT and non-Sub-SFT should satisfy Eq. 6.

{\begin{array}{l} S = k \times S u b - S_{F T} + w \times N o n - S u b - S_{F T} \\ N o n - S u b - S_{F T} = S u b - S_{C P U} \\ η = \frac{P e a c k p e r f o r m a n c e e_{C h i n a a c c e l e r a t o e r}}{P e a c k p e r f o r m a n c e e_{C P U}} = \frac{S u b - S_{F T}}{S u b - S_{C P U}} \end{array}

Eq. 6

Fig. 5

Orchestrating FFT with coordination by using Sub-SFT accelerated by CA and non-Sub-SFT computed by CPU to cater to the specified interface. (a) 1D FFT input data and (b) 2D FFT input data. Only Sub-SFT size dataset is accelerated by the CA, and non-Sub-SFT size dataset would run on the CPU.

S is the size of the N-point FFT sequence, k = 0,1,2…, and w= 0,1,2… Moreover, FAIA is proposed based on integrating DS and SD strategies to fully take advantage of both the CA and CPU coordination to compute the large-scale FFT as listed in Algorithm 1.

Algorithm 1.

Pseudocode on FAIA

1	init CQ for CPU queue
2	init AQ for CA queue
3	activating CA and CPU and waiting for computing
4	Get architecture parameter Sub-S _FT and Sub-S_FT and $η = \frac{P e a c k p e r f o r m a n c e_{C h i n a a c c e l e r a t o e r}}{P e a c k p e r f o r m a n c e_{C P U}}$
5	Sub - S_CPU = Sub - S_FT/≠
6	pruning current large-scale DFT sequence S into 1 DFT sub-sequence Sub-S _FT and remainder S_R
7	S = S_R //label remainder S_R as S
8	if (! AQ) then
9	AQ← Sub-S _FT/queuing Sub-S _FT into AQ
10	else then
11	if (S_R < Sub-S _CPU) then
12	CQ←S_R //queuing S_R into CQ
13	go to step 30
14	else then
15	cutting Sub-S _CPU out from S_R and remainder S′_R
16	if (!CQ) then
17	CQ← Sub-S _CPU //queuing Sub-S _CPU into CQ
18	S _R = S′ _R
19	end if
20	end if
21	S = S _R
22	end if
23	evaluating waiting time for AQ T_AQ
24	evaluating waiting time for CQ T_CQ
25	if (T_AQ ≤ T_CQ), then
26	go to step 6
27	else then
28	go to step 11
29	end if
30	while (!AQ&&!CQ) then
31	Finished

As detailed in Algorithm 1, FAIA should distribute a large-scale FFT sequence into smaller DFT subsequences according to the specified interface and architectural parameters, and establish a performance evaluation model to decide the current subsequence to deliver to the CA queue or CPU queue according to queues status.

Practically, FAIA is a static and dynamic balance algorithm based on IA to fully take advantage of the CA and CPU coordination to compute large-scale FFT.

Experiments and Analysis

Testing Heterogeneous System

The testing heterogeneous system was built with an array of CNs (Computing Node), and each CN contains a computing blade, an accelerating board, and peripheral component interconnect express (PCIE) bus, as demonstrated in Fig. 6. In validating the heterogeneous system, the computing blade and accelerating board could be attached and detached on demand. There are four CAs connected with a computing blade using PCIE in an accelerating board, and there are four CPUs in a computing blade interconnected with a high express intranet by an NIO (Network Input and Output) card.

Fig. 6

Architecture of the testing node.

Using the testing heterogeneous system, architectural optimizations with FMA and FAIA algorithms with IA were validated. Experimentally, entire validating heterogeneous system was composed of 16 CNs (Fig. 6). Details on the basic experimental environment figuration of CN are listed in Table III.

Table III.

Details on Testing Environment

System	Component	Attribute	Number
Hardware	CPU	Intel(R) Xeon(R) CPU E5-2692 v2 @ 2.20 GHz	4
	CA	FT-GPDSP 2000b @1.25 GHz	4
	Memory sub-system	8 X Samsung 8G DDR3 1333 MHz	4
Software	OS	Linux kylin-phytium+	-
Software	Compiler	Lintel icc 15.0.0 + phytium Compiler	-

IA Testing

The proposed IA would encapsulate the fixed FFT interface into an ordinary one. FFT testing comparisons on IA are listed in Table II. When the FFT input point N is not equal to a complex or float, it is impossible to accelerate FFT using the CA without assistance of IA. When there is no IA, FFT is barely accelerated by the CA because of strong requirements for FFT kernel with fixed interface built into the CA. Fortunately, assisted by the IA, we could test the FFT as usual. The FFT would be accelerated by CA when N is greater than or equal to a complex or float, which are rather weaker requirements for acceleration on the CA than that of the non-IA.

FAIA Validation

The focus of the IA is to cater to the specified interface on FFT built into the CA. But IA is weak in coordinating computing load between CPU and CA. Therefore, we propose an FAIA based on integrating DS and SD strategies to fully take advantage of both CA and CPU to compute large-scale FFT with coordination between them as detailed in Algorithm 1. The performance of FAIA validation is illustrated in Fig. 7.

Fig. 7

Performance comparisons for FAIA validation.

Performance Improvement with FAIA

To validate the FAIA comprehensively, besides comparisons to SD and DS scheduling strategies, we validated FAIA performance comparisons to the famous six-step algorithm, which is an optimal cache-oblivious algorithm in high performance processors such as GPU, MIC, and other accelerators. Hence, FAIA performance comparisons to famous and optimal six-step algorithms^37,38 are also presented in Fig. 8.

Fig. 8

Performance improvement to six-step method.

When the FFT sequence was smaller than 2²⁰-point, FAIA performance was slightly inferior to the six-step algorithm. As the size of the FFT sequence increased from 2²⁰-point to 2²²-point, FAIA performance was progressively closer to the six-step algorithm. Furthermore, when the FFT sequence was greater than 2²²-point, FAIA performance was superior to the six-step algorithm. For smaller DFT subsequences (less than 2²⁰ bytes), the FFT would run on CPU only without CA acceleration for both the FAIA and the six-step method, and when the FFT sequence was increased to 2²² point, the FFT subsequence was gradually accelerated by the CA for FAIA, and still ignored by the CA for the six-step algorithm. When the FFT sequence was longer than 222 point, the FFT would run both on CPU and CA with IA for FAIA, and FFT using FAIA would give better performance than that of the six-step without IA.

Conclusions

Although FFT optimizations have been previously reported, in this study, the complexity reduction on DFT by using FMA with IA for the China Accelerator (CA) are detailed.

Assisting with architectural optimizations, the FMA would assemble four float multiplications and four float additions into four FMA operations, which further reduced to 4N log₂ N from 5N log₂ N for one Butterfly computation.

To provide a friendly testing configuration and fully use both the CA and CPU to complete large-scale FFT problems, an IA was proposed to cater to the specified interface on FFT built into the CA. Furthermore, we proposed a FAIA algorithm based on integrated DS and SD strategies to fully take advantage of both CA and CPU to compute large-scale FFTs with coordination.

Experimental results validated that we successfully achieved a performance of over 2.28 TFlops on 16 nodes of the validating heterogeneous system composed of a CPU and CA (16 nodes, 11.25 TFlops/node, 180 TFlops peak performance) for a 2³⁶-point FFT, and FAIA performance was superior to or close to the famous and optimal six-step algorithm for a heterogeneous system composed of the CPU and CA.

Footnotes

Acknowledgements

This work is partly supported by the National Numerical Wind Tunnel Key Project of China Grant No. NNW2019ZT6-B21, partly supported by the Specialized Research Fund for State Key Laboratories of Space Weather, Chinese Academy of Sciences, and partly supported by National Natural Science Foundation of China Grant No. 61602495.

References

Wood

E. J

. Textile Research Journal 1990, 60 (4), 212–220.

Bing

; Wang , . Composite Structures 2018,192, 255–263.

R. -C

.; Chen

H. -M.

; Huang

C. -C

; Chiang

C. -T.

Realization of interpolated fft algorithm on dsp for accurate harmonic analysis, Proceedings of the 4th IASTED Asian Conference on Power and Energy Systems, AsiaPES, 2008.

Arafa

A. A

.; Saleh

H. I.

; Ashour

; Salem

FFT- and DWT-Based FPGA realization of pulse shape discrimination in PET system, Proceedings of the DTIS'09-2009 4th IEEE International Conference on Design and Technology of Integrated Systems in Nanoscale Era.

Sanchez

M. A

.; Garrido

; Lopez-Vallejo

; Grajal

IEEE Transactions on Aerospace and Electronic Systems 2008, 44 (4), 1567–1585.

Jose

Rangel-Magdaleno,

; Jesus

; etc. IEEE Transactions on Instrumentation and Measurement 2010, 59 (12), 3184–3194.

Zeman

; Vondřejc

; Novak

. J. Comput. Phys. 2010, 229 (21), 8065–8071.

Kabel

; Merkert

; Schneider

Comput. Methods Appl. Mech. Eng. 2015, 294, 168–188.

Radmanovíc

M. M

.; Gajíc

D. B.

; Stankovíc

R. S.

Journal of Multiple-Valued Logic and Soft Computing 2016, 26 (3–5), 417–438.

10.

Hanawa

; Fujii

; Fujita

; etc. Evaluation of FFT for GPU cluster using tightly coupled accelerators architecture, Proceedings of IEEE International Conference on Cluster Computing, CLUSTER 2015, 2015, pp 635–641.

11.

Wang

; Chandrasekaran

; Chapman

CusFFT: A High-Performance Sparse Fast Fourier Transform Algorithm on GPUs, Proceedings of 2016 IEEE 30th International Parallel and Distributed Processing Symposium, IPDPS 2016, 2016, pp 963–972.

12.

Khelifi

; Massicotte

; Savaria

Towards efficient and concurrent FFTs implementation on Intel Xeon/MIC clusters for LTE and HPC, Proceedings of IEEE International Symposium on Circuits and Systems, ISCAS 2016, 2016, pp 2611–2614.

13.

Liu

Y. -Q

.; Li

; Zhang

Y.-Q.

; etc. Journal of Computer Science and Technology 2014, 29 (6), 989–1002.

14.

Murano

; Shimobaba

; etc. Computer Physics Communications 2014, 185 (10), 2742–2757.

15.

Tang

; Wang

; Chung

J. -G.

; etc. High-speed assembly FFT implementation with memory reference reduction on DSP processors, Proceedings of 11th IEEE International Conference on Electronics, Circuits and Systems, ICECS 2004, 2004, pp 547–550.

16.

Kharin

; Vityazev

; etc. Parallel FFT implementation on TMS320c66x multicore DSP, Proceedings of the 6th European Embedded Design in Education and Research Conference, 2014, pp 46–49.

17.

Lei

; Chen

; Peng

Computer Research and Development 2016, 53 (7), 1438–1446.

18.

Wang

; Tang

; Jiang

; etc. IEEE Transactions on Signal Processing 2007, 55 (5 II), 2338–2349.

19.

Kumar

Ch. V.

; Sastry

K. R. K.

Design and implementation of FFT pruning algorithm on FPGA, Proceedings of the 7th International Conference Confluence 2017 on Cloud Computing, Data Science and Engineering, 2017, pp 739–743.

20.

T. V

.; Panat

A. R.

FPGA implementation of FFT processor using vedic algorithm, 013 IEEE International Conference on Computational Intelligence and Computing Research, IEEE ICCIC, 2013.

21.

Arafa

A. A

.; Saleh

H. I.

; Ashour

; etc. FFT- and DWT-Based FPGA realization of pulse shape discrimination in PET system, Proceedings of the 2009 4th IEEE International Conference on Design and Technology of Integrated Systems in Nanoscale Era, 2009, pp 299–302.

22.

Zou

; Qiu

; Song

FPGA implementation of efficient FFT algorithm based on complex sequence, Proceedings-2010 IEEE International Conference on Intelligent Computing and Intelligent Systems, ICIS 2010, 2010, 2, pp 614–617.

23.

Nash

J. G

. High-throughput programmable systolic array FFT architecture and FPGA implementations, Proceedings of 2014 International Conference on Computing, Networking and Communications, ICNC 2014, 2014, pp 878–884.

24.

Wang

S. Y

.; Zhou

; Niu

L. H.

Design and realization of the FFT processor based on FPGA IP core, WIT Transactions on Information and Communication Technologies, 2014, 60, pp 437–446.

25.

Duan

; Wang

; Li

; etc. Floating-point mixed-radix FFT core generation for FPGA and comparison with GPU and CPU, Proceedings of International Conference on Field-Programmable Technology, FPT 2011, 2011.

26.

Sanchez

M. A

.; Garrido

; etc. IEEE Transactions on Aerospace and Electronic Systems 2008, 44 (4), 1567–1585.

27.

Chen

; Qu

; Luo

; etc. International Journal of Computational Intelligence Systems 2011, 4 (6), 1131–1139.

28.

Zhang

D. -L

; Huang

; Song

Y.-K.

; etc. International Journal of Control and Automation 2014, 7 (6), 177–188.

29.

Nvidia CUFFT Library. http://developer.nvidia.com (accessed November 2020).

30.

; Siegel

; Li

Using gpus to compute large out-of-card ffts. In

Proceedings of the international conference on Supercomputing, ICS '11; ACM: New York, NY, USA, 2011; pp 255–264.

31.

Ogata

; Endo

; Maruyama, N; Matsuoka

An efficient, model-based cpu-gpu heterogeneous fft library, Proceedings of the IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2008, 2008, pp 1–10.

32.

Y. T

.; etc. The Applications Leveraging Supercomputing Systems, ISC 2015.

33.

Cooley

; Tukey

Math. Comput. 1965,19, 297–301.

34.

Ukidave

; Schirner

; Kaeli

Fast Fourier Transform (FFT) on CPUs, Northeastern University, Boston, MA, USA, 2014.

35.

Jun

Y. X

.; Xiang-Ke

; Kail

, Hu

T. Q. -F.

Journal of Computer Science and Technology 2011, 26 (3), 344–351.

36.

Yang

D. Y.-F

.; Wang

C. -Q.

; Feng

; Hui-Zhan . Journal of Northeastern University 2014, 35 (10), 102–107.

37.

Takahashi

. Implementation of Parallel 1-D FFT on GPU Clusters, 2013 IEEE 16th International Conference on Computational Science and Engineering (CSE), 2013, pp 174–180.

38.

Takahashi

; Uno

; Yokokawa

An Implementation of Parallel 1-D FFT on the K Computer, Proceedings of the 2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems, 2012, pp 344–350.

Customizing FFT with FMA for China Accelerator

Abstract

Keywords

Introduction

Overview of FFT Algorithm

Architecture of CA

Details on FMA

Restrictions on FFT

Reduction Computations for FFT with FMA

Complexity for Traditional FFT

Reduction Computations Assisted by FMA

Leveraging between CPU and CA

Interface Adapter for FFT

AIA Scheduling Strategy

Static Division

Dynamic Schedule

Experiments and Analysis

Testing Heterogeneous System

IA Testing

FAIA Validation

Performance Improvement with FAIA

Conclusions

Footnotes

Acknowledgements

References