Sage Journals: Discover world-class research

Abstract

To enhance 3D simulation efficiency for large-scale patterned fabric structures, a GPU-based yarn-level parallel computing framework is developed. individual yarns are adopted as the fundamental units of parallelization. Two key algorithms are devised within this framework: a parallel spline trajectory fitting algorithm and a parallel yarn mesh generation algorithm. Comparative experiments with conventional serial methods show that a maximum speedup of up to 10-fold is achieved by the proposed parallel strategy. Moreover, the speed imbalance across different stages during fabric simulation is effectively alleviated by the method. The algorithms are implemented in OpenCL, demonstrating robust performance across different GPU hardware. The applicability of the method to a wide range of fabric types, including both woven and knitted large-patterned fabrics, is confirmed through simulation case studies, underscoring its substantial potential for real-world deployment.

Keywords

GPU acceleration parallel algorithm large-scale patterned fabric fabric simulation fabric structure

Introduction

With the rapid development of computer graphics and high-performance computing, three-dimensional (3D) simulation has emerged as a key method in both research and industry.¹ In fields such as fabric computer-aided design (CAD), game development, and animation production, fabric-structure simulation has been widely recognized for its value.^2
–4 This simulation significantly reduces the need for physical prototyping in the traditional textile industry. It also enhances the visual realism of fabrics in virtual environments. However, in textile design and rendering, achieving an optimal balance between realism and computational efficiency remains a challenge.⁵ The simulation of large-patterned fabrics is particularly demanding. The complexity of the motifs and the strict requirements for detail lead to high computational demands. Additionally, conventional methods often result in slow simulation speeds.⁶ Therefore, it is crucial to leverage the powerful parallel computing capabilities of Graphics Processing Units (GPUs) for the simulation of large-patterned fabrics.

Due to the limitations of computer hardware performance, two-dimensional color blocks were commonly used in early fabric simulation research to abstract the yarn topology. Simplified lighting models were also employed to simulate the three-dimensional visual effects of the fabric surface.⁷ Özdemir and Başer⁸ assumed that the cross-sectional shape of the yarns in woven fabrics is elliptical. Real yarn images were captured as the basic unit for fabric simulation, and the yarn centerline trajectories were constructed based on an elastic curve model. The fabric structure simulated by this method was relatively accurate, but the results ignored yarn interaction forces. Li et al.⁹ considered this phenomenon and used a spring particle model to simulate the deformation behavior of the Jacquardtronic lace structure. In a subsequent study,¹⁰ they also improved the Blinn-Phong lighting model¹¹ to capture the appearance features along the yarn axis, further enhancing the realism of the lace structure. While computationally feasible for large-patterned fabrics, these 2D methods have not been able to represent the 3D spatial arrangement of yarns or the microstructural geometry of fabric surfaces. This imposes a realism bottleneck and hinders applications in weaving simulation and virtual try-on.¹²

To enhance realism in virtual large-patterned fabrics, simulation must overcome the spatial abstraction limitations of 2D systems through yarn-level solid modeling.¹³ Zheng et al.¹⁴ developed a yarn-level spring particle model. Their approach incorporated fabric structural features, achieving accurate relaxation deformation. However, computational constraints limit this model to small fabric areas. Deng et al.¹⁵ created a woven honeycomb model using Rhino/Grasshopper, which significantly enhanced surface texture realism. Yet, due to insufficient simulation efficiency, the approach forced the reliance on texture mapping for large textiles, such as sofas and clothing. Liu et al.¹⁶ and Xu et al.¹⁷ conducted similar research, even considering fiber-scale fabric simulation. Song et al.¹⁸ proposed a Grooved Ribbon Yarn model. They optimized large-scale coil scene computations using a minimal period replication method. This model improved speed by simplifying the yarn’s geometric structure. However, it did not address parallelization issues before fabric rendering, such as yarn trajectory and mesh generation, which are serial processes. As a result, the parallel computing power of GPUs was not fully utilized.¹⁹ A large body of existing GPU-accelerated cloth simulation research is primarily based on cloth-level abstractions, where fabrics are modeled as continuous surfaces to efficiently solve their dynamic physical behavior.^20,21 Such methods have achieved significant progress in modeling draping behavior and dynamic responses. Due to the limitations of this abstraction level, yarn-level structures are typically not explicitly represented. As a result, geometric preprocessing stages, including yarn trajectory generation and mesh construction, are still largely performed using serial computation, and relatively limited research has been conducted on parallel acceleration for these processes. Significant acceleration of dynamic simulation has been achieved through the massive parallelization of the solution process of the governing physical laws. In addition, several mature 3D geometric modeling tools and frameworks for textiles (such as TexGen,²² TexMind,²³ etc.) have been recognized for their effectiveness in generating high-fidelity yarn and fabric structural models.²⁴ However, serial computation has typically been employed, and the parallelization potential of the geometric preprocessing stage has not been thoroughly explored.

In summary, existing 3D methods are feasible for fabrics with smaller pattern repeat units, but their efficiency falls short of industrial real-time requirements for large-patterned fabrics. To address this limitation, we propose a yarn-level parallel computing framework designed to overcome serial bottlenecks in yarn trajectory interpolation and mesh generation. The framework incorporates two key algorithms: the Spline Trajectory Parallel Construction Algorithm and the Mesh Parallel Generation Algorithm. The former eliminates sequential dependencies across yarns during trajectory interpolation by treating each yarn as an independent parallel unit. The latter removes serialization in the mesh generation stage by enabling parallel processing of yarn cross-sectional discretization and topology construction through dynamic GPU thread allocation. Implemented within a fabric 3D simulation system, the framework efficiently produces highly realistic virtual representations of large-patterned textiles.

Yarn-level parallel computing framework

Yarn-level fabric 3D simulation typically involves the looping or intertwining of yarns to form the fabric, with yarns as the basic unit. It can be divided into three stages: yarn geometry generation, buffer creation, and fabric rendering. While the rendering stage fully exploits the GPU’s highly parallel rasterization pipeline, the yarn geometry generation stage has traditionally been executed serially on the CPU due to strong sequential dependencies in trajectory interpolation and mesh construction. From a computational perspective, this serial paradigm fails to leverage the GPU’s massive parallel architecture, which is particularly efficient for linear algebra and vectorized operations when independent data units are available.²⁵ In large-patterned fabric scenarios, where thousands of yarns exhibit similar geometric operations but differ only in input parameters, this mismatch between algorithm structure and hardware architecture leads to a severe performance bottleneck. To address this challenge, a yarn-level parallel computing framework is proposed to restructure yarn geometry generation into GPU-friendly parallel workloads, composed of the Spline Trajectory Parallel Construction Algorithm and the Mesh Parallel Generation Algorithm.

In this framework, yarns serve as the fundamental units of parallelism. All parallel algorithms in this work assign one GPU thread to each yarn, allowing trajectory interpolation and mesh generation to be performed independently across yarns. Parallel speedup is obtained by processing multiple yarns concurrently, while the internal computations within each thread follow the same serial procedures as in the CPU-based implementation.

Spline trajectory parallel construction algorithm

During the generation of virtual yarn trajectories, the key control points of each yarn (e.g. interlacing points and contact points) are first determined based on the fabric structure and process parameters.²⁶ Subsequently, these control points are interpolated or fitted to obtain a smooth trajectory of the yarn centerline. Let P denote the set of all yarn control points, and n the total number of yarns. The subset P_i (0 ⩽ i < n) represents the control points of the i-th yarn. Let p denote the set of interpolated value points. Taking the cubic Cardinal spline as an example, the generation process of these shaping points can be expressed as follows:

p_{i} (t) = [t^{3} t^{2} t 1] \cdot [\begin{matrix} \begin{array}{l} - s \\ 2 s \\ - s \\ 0 \end{array} & \begin{array}{l} 2 - s \\ s - 3 \\ 0 \\ 1 \end{array} & \begin{array}{l} s - 2 \\ 3 - 2 s \\ s \\ 0 \end{array} & \begin{array}{l} s \\ - s \\ 0 \\ 0 \end{array} \end{matrix}] \cdot [\begin{matrix} P_{i, j - 1} \\ P_{i, j} \\ P_{i, j + 1} \\ P_{i, j + 2} \end{matrix}]

(1)

In equation (1), t controls the interpolation progress, with its value ranging from [0,1]. The points P_i,_j−1. . .P_i, _j+2(1 ⩽ j < m-2, where m is the number of control points of the i-th yarn) represent four consecutive control points. The parameter s can be expressed in terms of the interpolation parameter u (with a value range of [0,1]) as follows:

s = (1 - u) / 2

(2)

Figure 1 illustrates the generation process of discrete shaping points representing the yarn trajectory. Since the Cardinal spline is a piecewise cubic interpolation, each curve segment is generated from four consecutive control points. Therefore, two virtual endpoints, v_s and v_e, are added at the beginning and end of the control point sequence, respectively.

Figure 1.

Yarn trajectory spline interpolation process.

To ensure that the curve possesses a smooth tangent direction at both the starting and ending points, the construction of v_s can be expressed as follows:

v_{s} = 2 P_{i, 0} - P_{i, 1}

(3)

Similarly, the construction of v_e can be expressed as follows:

v_{s} = 2 P_{i, m - 1} - P_{i, m - 2}

(4)

Based on the above equations, the yarn trajectory generation process can be represented by Algorithm 1.

Algorithm 1 Serial Cardinal Spline Interpolation for Curve Fitting
Require: P (control points), u (tension), n (number of yarns), K (interpolation density) Ensure: p (yarn trajectory) 1. for i ← 0 to n-1 do 2. m ← \|P_i\| ▷ Number of control points 3. Add virtual endpoints v_s(head) and v_e(tail) ▷ According to equations (3) and (4) 4. points ← {v_s} ⊕ P_i ⊕{v_e} ▷ Extended point array (size m + 2) 5. for j ← 0 to m-2 do ▷ Process each curve segment 6. Q_i, _{j-1: j+2} ← points [j: j+4] ▷ Extract 4 consecutive control points, Q here is equivalent to P in equation (1)} 7. for k ← 0 to K do ▷ Generate interpolated points 8. t ← k / K 9. Calculate s according to equation (2), then p_i(t) according to equation (1) 10. Append p_i(t) to p ▷ Store the interpolated point 11. end for 12. end for 13. end for 14. return p

Algorithm 1 Serial Cardinal Spline Interpolation for Curve Fitting

Require: P (control points), u (tension), n (number of yarns), K (interpolation density)
Ensure: p (yarn trajectory)
1. for i ← 0 to n-1 do
2. m ← |P_i| ▷ Number of control points
3. Add virtual endpoints v_s(head) and v_e(tail) ▷ According to equations (3) and (4)
4. points ← {v_s} ⊕ P_i ⊕{v_e} ▷ Extended point array (size m + 2)
5. for j ← 0 to m-2 do ▷ Process each curve segment
6. Q_i, _{j-1: j+2} ← points [j: j+4] ▷ Extract 4 consecutive control points, Q here is equivalent to P in equation (1)}
7. for k ← 0 to K do ▷ Generate interpolated points
8. t ← k / K
9. Calculate s according to equation (2), then p_i(t) according to equation (1)
10. Append p_i(t) to p ▷ Store the interpolated point
11. end for
12. end for
13. end for
14. return p

The interpolation process in Algorithm 1 is mainly influenced by three parameters: n, m, and K. If the time required for a single interpolation is denoted as ∆t_i, the total runtime T_i of Algorithm 1 can be expressed as follows, assuming that the number of control points m is identical for each yarn:

T_{i} = n \times (m - 1) \times (K + 1) \times Δ t_{i}

(5)

As shown in equation (5), the time complexity of Algorithm 1 is O(n × m × K). Taking woven fabrics as an example, and without applying instancing or duplication algorithms. if the minimum repeating unit of the fabric is replicated λ times along both the warp and weft directions, n increases by a factor of λ, while m increases by a factor of λ² (with the control point density kept constant). Consequently, T_i grows cubically. As the fabric size increases, the execution time of the algorithm rises rapidly (for each doubling of the fabric size, T_i increases by approximately eight times). For fabrics with small repeating units, the efficiency of the algorithm remains acceptable; however, it is not suitable for large-pattern fabrics. To address this limitation, the traditional serial spline interpolation algorithm was restructured using GPU acceleration, and a parallel yarn trajectory construction algorithm was proposed, as presented in Algorithm 2.

Algorithm 2 GPU-Parallel Cardinal Spline Interpolation (Yarn-Level Parallelism)
Require: P (control points), u (tension), n (number of yarns), K (interpolation density) Ensure: p (yarn trajectory) 1. Precompute Ω^p [0..n] ▷ Output buffer offsets per yarn 2. for i ← 0 to n-1 do 3. m ← \|P_i\| ▷ Number of control points 4. Ω^p[i + 1] ← Ω^p[i] + (m-1) × K + m ▷ Calculate offset for yarn i+1, with Ω^p[0] = 0 as the starting point 5. end for 6. Allocate global output buffer p of size Ω^p[n] 7. (τ‌, Ω^τ) ← Flatten (P) ▷ τ: 1D array storing control points as (x,y,z) triplets; Ω^τ[i]: starting control-point index of yarn i in τ 8. Launch GPU kernel with n threads ▷ One thread per yarn 9. function Kernel (τ, Ω^τ, Ω^p, n, K, u, p) 10. i ← parallel_id ▷ Unique yarn identifier in [0, n-1] 11. if i ⩾ n then return 12. end if 13. startIdx ← Ω^τ[i] ▷ Start control-point index of yarn i in τ 14. m ←Ω^τ[i+1] - Ω^τ[i] ▷ Number of control points of yarn i 15. base ← 3 × startIdx ▷ Float index in τ 16. for j ← 0 to m-1 do 17. points[j] ← (τ[base + 3j], τ[base + 3j + 1], τ[base + 3j + 2]) 18. end for 19. Generate interpolation points for yarn i using Algorithm 1’s method 20. end function 21. return p

Algorithm 2 GPU-Parallel Cardinal Spline Interpolation (Yarn-Level Parallelism)

Require: P (control points), u (tension), n (number of yarns), K (interpolation density)
Ensure: p (yarn trajectory)
1. Precompute Ω^p [0..n] ▷ Output buffer offsets per yarn
2. for i ← 0 to n-1 do
3. m ← |P_i| ▷ Number of control points
4. Ω^p[i + 1] ← Ω^p[i] + (m-1) × K + m ▷ Calculate offset for yarn i+1, with Ω^p[0] = 0 as the starting point
5. end for
6. Allocate global output buffer p of size Ω^p[n]
7. (τ‌, Ω^τ) ← Flatten (P) ▷ τ: 1D array storing control points as (x,y,z) triplets; Ω^τ[i]: starting control-point index of yarn i in τ
8. Launch GPU kernel with n threads ▷ One thread per yarn
9. function Kernel (τ, Ω^τ, Ω^p, n, K, u, p)
10. i ← parallel_id ▷ Unique yarn identifier in [0, n-1]
11. if i ⩾ n then return
12. end if
13. startIdx ← Ω^τ[i] ▷ Start control-point index of yarn i in τ
14. m ←Ω^τ[i+1] - Ω^τ[i] ▷ Number of control points of yarn i
15. base ← 3 × startIdx ▷ Float index in τ
16. for j ← 0 to m-1 do
17. points[j] ← (τ[base + 3j], τ[base + 3j + 1], τ[base + 3j + 2])
18. end for
19. Generate interpolation points for yarn i using Algorithm 1’s method
20. end function
21. return p

In Algorithm 2, Ω^p denotes the offset array that records the starting position of each yarn trajectory in the global output buffer p. Similarly, Ω^τ stores the starting control-point index of each yarn in the flattened control-point array τ. The array τ is a one-dimensional array, in which P is flattened and stored prior to the execution of the parallel algorithm to facilitate efficient GPU access. Within the kernel, each thread processes one yarn independently. Given the starting control-point index startIdx = Ω^τ[i] and the number of control points m, the corresponding 3D control points are explicitly reconstructed from τ using a simple indexed loop, as shown in Algorithm 2. Let the time for a single interpolation in Algorithm 2 be denoted as ∆t_i′, then the total runtime T_i′ of Algorithm 2 can be expressed as follows:

{T_{i}}^{'} = (m - 1) \times (K + 1) \times Δ {t^{'}}_{i}

(6)

In the parallel execution phase, each logical processing unit is assigned a unique parallel_id in the range [0, n−1], corresponding to the index of the yarn it processes. The key difference between this algorithm and Algorithm 1 lies in replacing the original serial, yarn-by-yarn trajectory fitting process with the parallel generation of n yarn trajectories.

Mesh parallel generation algorithm

After fitting the yarn centerline trajectory using splines, it is still necessary to generate yarn mesh data—including vertex coordinates, normal vectors, and topological information for subsequent rendering. This process is illustrated in Figure 2.

Figure 2.

Yarn mesh generation and rendering flowchart.

As shown in Figure 2, the core of the yarn mesh data generation process lies in the computation of vertex geometric coordinates. In this study, the yarn is modeled as a tubular structure with a circular cross-section. The generation of cross-sectional circumference data is computed according to the spatial circle parametric equation¹⁷:

{\begin{cases} x = c_{x} + r \frac{B}{\sqrt{A^{2} + B^{2}}} \cos θ \\ + r \frac{A C}{\sqrt{A^{2} + B^{2}} \sqrt{A^{2} + B^{2} + C^{2}}} \sin θ \\ y = c_{y} - r \frac{A}{\sqrt{A^{2} + B^{2}}} \cos θ \\ + r \frac{B C}{\sqrt{A^{2} + B^{2}} \sqrt{A^{2} + B^{2} + C^{2}}} \sin θ \\ z = c_{z} - r \frac{\sqrt{A^{2} + B^{2}}}{\sqrt{A^{2} + B^{2} + C^{2}}} \sin θ, 0 \leq θ \leq 2 π \end{cases}

(7)

In equation (7), (c_x, c_y, c_z) represents the coordinate of the center of the circle, (A,B,C) is the normal vector of the spatial circle, r is the radius of the circle, and θ is the central angle. The traditional yarn mesh generation process can thus be represented by Algorithm 3.

Algorithm 3 Serial Yarn Mesh Construction
Require: p (yarn trajectory), r (section radius), n_c (number of cross-sectional circle subdivisions) Ensure: M ((mesh data) 1. for i ← 0 to n-1 do 2. χ← \|p_i\| ▷ \|p_i\|=(m-1) × K +m 3. for j ← 0 to χ-1 do 4. if j =χ-1 then ▷ Last point 5. (A, B, C) ← p_i[j] - p_i[j-1] 6. else ▷ Other points 7. (A, B, C) ← p_i[j +1] - p_i[j] 8. end if 9. for k ←0 to n_c-1 do 10. θ ← ×k 11. Calculate C_k (the circumference point) according to equation (4) 12. N_k ← C_k - p_i[j] ▷ Computation of the vertex normal vector N_k 13. Append C_k and N_k to M_i ▷ Temporarily store vertex data 14. end for 15. end for 16. Append M_i to M ▷ Reorder M_i vertices for triangulation before merging to M 17. end for 18. return M

Algorithm 3 Serial Yarn Mesh Construction

Require: p (yarn trajectory), r (section radius), n_c (number of cross-sectional circle subdivisions)
Ensure: M ((mesh data)
1. for i ← 0 to n-1 do
2. χ← |p_i| ▷ |p_i|=(m-1) × K +m
3. for j ← 0 to χ-1 do
4. if j =χ-1 then ▷ Last point
5. (A, B, C) ← p_i[j] - p_i[j-1]
6. else ▷ Other points
7. (A, B, C) ← p_i[j +1] - p_i[j]
8. end if
9. for k ←0 to n_c-1 do
10. θ ← ×k
11. Calculate C_k (the circumference point) according to equation (4)
12. N_k ← C_k - p_i[j] ▷ Computation of the vertex normal vector N_k
13. Append C_k and N_k to M_i ▷ Temporarily store vertex data
14. end for
15. end for
16. Append M_i to M ▷ Reorder M_i vertices for triangulation before merging to M
17. end for
18. return M

Algorithm 3 is influenced by three parameters: n, χ (the number of discrete points along the trajectory), and n_c. The number of circumferential divisions n_c is a constant factor that does not increase with the problem size, and χ is obtained through interpolation from the number of control points m, essentially making it equivalent to m. Let the time for generating a single mesh vertex be denoted as ∆t_m. Then, referring to equation (5), the execution time of Algorithm 3 can be expressed as follows:

T_{m} = n \times χ \times n_{c} \times Δ t_{m}

(8)

Given that its input size is significantly larger than that of Algorithm 1, its performance bottleneck is expected to be more severe in the rendering of large-pattern fabrics. To address this, the algorithm was restructured using a parallelization approach, as presented in Algorithm 4. Similar to Algorithm 2, Algorithm 4 adopts a yarn-level parallelization strategy. During execution, one GPU thread is assigned to each yarn to construct its mesh independently. The input trajectory data are stored in a flattened array τ, and a corresponding offset array Ω^τ provides the starting index of each yarn’s trajectory. The mesh data generated by different threads are written to disjoint segments of the global buffer M, as determined by the precomputed offset array Ω^M, which eliminates write conflicts and enables efficient parallel execution. Let the time for generating a single mesh vertex in Algorithm 4 be denoted as ∆t_m′. Then, the total execution time T_m′ can be expressed as follows:

{T^{'}}_{m} = χ \times n_{c} \times Δ {t^{'}}_{m}

(9)

To provide a comprehensive overview of the implementation strategy, Figure 3 illustrates the schematic comparison between the traditional serial pipeline and the proposed parallel framework.

Algorithm 4 GPU-Parallel Yarn Mesh Construction (Yarn-Level Parallelism)

Require: P (control points), u (tension), n (number of yarns), K (interpolation density)
Ensure: p (yarn trajectory)
1. Precompute Ω^M [0..n] ▷ Start indices per yarn
2. for i ← 0 to n-1 do
3. χ ← |p_i| ▷ Number of discrete trajectory points in yarn i
4. Ω^M[i + 1] ← Ω^M[i] +χ× n_c ▷ Calculate offset for yarn i+1, with Ω^M [0] = 0 as the starting point
5. end for
6. (τ, Ω^τ) ← Flatten (p) ▷ τ: 1D array storing trajectory points as (x,y,z) triplets; Ω^τ[i]: starting point index of yarn i in τ
7. Launch GPU kernel with n threads ▷ One thread per yarn
8. function Kernel (τ, Ω^τ, Ω^M, n, r, n_c, M)
9. i ← parallel_id ▷ Yarn identifier [0, n-1]
10. if i ⩾ n then return
11. end if
12. startIdx ← Ω^τ[i] ▷ Start trajectory-point index of yarn i in τ
13. χ ←Ω^τ[i+1] - Ω^τ[i] ▷ Number of discrete trajectory points in yarn i
14. base ← 3 × startIdx ▷ Float index in τ
15. for j ← 0 to χ -1 do
16. points[j] ← (τ[base + 3j], τ[base + 3j + 1], τ[base + 3j + 2])
17. end for
18. Generate mesh data for yarn i using Algorithm 3’s method
19. end function
20. return M

Figure 3.

Schematic comparison of the simulation pipelines: (a) serial CPU pipeline and (b) parallel GPU pipeline.

As depicted in Figure 3, the serial approach (encompassing Algorithms 1 and 3) relies on a sequential CPU execution loop. Crucially, this architecture requires all yarn meshes to be fully computed on the host before a batch transfer of the entire geometric dataset to the GPU buffer can occur. In contrast, the parallel pipeline shown in Figure 3 (implementing Algorithms 2 and 4) leverages the GPU’s massive parallelism. By mapping independent yarns to concurrent compute threads, the proposed framework executes trajectory construction and mesh generation in parallel directly on the device, thereby eliminating the serial bottleneck and minimizing data transfer latency.

Results and discussion

Performance analysis of parallel algorithms

To evaluate the performance of the proposed method in practical applications, Algorithm 2 and Algorithm 4 were implemented based on the OpenCL heterogeneous computing framework using C#. To ensure reproducibility, a yarn-level parallelization strategy was adopted. Specifically, the GPU kernel execution was configured such that each compute thread is mapped to a single yarn, processing its associated control points and geometric primitives sequentially. This strategy aligns with the physical structure of the fabric and effectively reduces memory access conflicts between adjacent yarns. The performance of the algorithms was then assessed using woven fabric as a benchmark.

Performance comparison of parallel and serial algorithms

To quantify the acceleration performance of the parallel algorithms, five virtual woven fabric samples were selected as test samples. These samples have identical organizational patterns but vary in size, as shown in Figure 4. In Figure 4, the parameter K for all virtual woven fabrics is set to 1, and n_c is set to 6. It is worth noting that the interpolation density K and the number of cross-sectional circle subdivisions n_c play critical roles in the simulation process. The parameter K determines the number of interpolated points between two adjacent control points. A larger K results in smoother yarn trajectories but significantly increases the computational cost. Considering that the density of control points in our model is already sufficient, we set K = 1 to maintain trajectory smoothness while avoiding unnecessary computational overhead. Similarly, n_c controls the discretization of the yarn cross-sectional circle. Increasing n_c improves the roundness of the yarn cross-section, but also leads to an exponential increase in the number of mesh facets, which burdens memory and rendering performance. Through comparative tests, we found that n_c = 6 provides a reasonable trade-off between visual fidelity and computational efficiency, which is why this value is adopted in the simulations.

Figure 4.

Virtual woven fabrics in different sizes: (a) 163 × 150, (b) 489 × 450, (c) 815 × 750, (d) 1141 × 1050, and (e) 1630 × 1500.

Let the total number of control points for all yarns be denoted as m′, and the number of shaping points as χ′. The main parameters used in the simulation process are shown in Table 1.

Table 1.

Parameters for fabric rendering at different sizes.

Weave pattern dimensions	n	m′	χ′	Number of triangular facets
163 × 150	313	97,485	194,659	2,332,176
489 × 450	939	879,259	1,757,581	21,079,728
815 × 750	1565	2,443,433	4,885,303	58,604,880
1141 × 1050	2191	4,790,007	9,577,825	114,907,632
1630 × 1500	3130	9,776,868	19,550,608	234,569,760

Experiments were conducted on the fabric structure shown in Figure 4 using a device equipped with an Intel(R) Core (TM) i7-8700K processor (clock speed of 3.70 GHz) and an NVIDIA GeForce RTX 5070Ti processor. The experimental results are shown in Table 2, where the execution times of each algorithm are presented in milliseconds.

Table 2.

Comparison of execution time between serial and parallel algorithms.

Weave pattern dimensions	Yarn trajectory construction		Yarn mesh generation
Weave pattern dimensions	Algorithm 1	Algorithm 2	Algorithm 3	Algorithm 4
163 × 150	131	492	777	453
489 × 450	791	364	6335	702
815 × 750	1632	434	13,544	1577
1141 × 1050	3133	614	29,126	2932
1630 × 1500	6436	907	61,108	6175

As shown in Table 2, the acceleration benefits of the parallel algorithm are not significantly observed in small-scale fabric patterns, such as 163 × 150. In some cases, the execution time is even higher than that of the serial algorithm. This phenomenon arises because, at small data scales, the fixed overhead of GPU parallel computation (including OpenCL context initialization, data transfer, and resource cleanup) outweighs the computational gains of the algorithm itself. Table 3 presents the fixed overhead of Algorithms 2 and 4 under different weave pattern dimensions. As shown, the execution time of context initialization varies little across scales. However, when the weave size is small, this fixed cost can dominate the total runtime, making the parallel algorithm less efficient than its serial counterpart.

Table 3.

Fixed overhead of parallel algorithms.

Weave pattern dimensions	Algorithm 2 fixed overhead			Algorithm 4 fixed overhead
Weave pattern dimensions	Context initialization	Data transfer	Others	Context initialization	Data transfer	Others
163 × 150	285	38	86	276	62	71
489 × 450	261	41	30	302	97	128
815 × 750	292	61	54	286	235	227
1141 × 1050	299	164	78	299	986	526
1630 × 1500	284	372	101	312	1894	874

Additionally, the acceleration ratio of Algorithm 2 is slightly lower than that of Algorithm 4. This difference is attributed to the variation in input scales. The trajectory points χ processed by Algorithm 4 are generated by interpolating m control points from Algorithm 2. To objectively evaluate scalability and facilitate cross-platform comparison, computational throughput (primitives processed per second) is adopted as a normalized metric. For the largest dataset (1630 × 1500), Algorithm 4 achieves a peak throughput of approximately 37.99 million triangular facets per second. In contrast, the serial CPU implementation saturates at roughly 3.84 million facets per second. This nearly tenfold increase in normalized throughput demonstrates the effective utilization of GPU resources in high-load scenarios, confirming the superior scalability of the proposed method.

Although equations (6) and (9) and theoretically offer an n-fold gain compared to equations (5) and (8), the performance of GPU single-threaded computation is inferior to that of the CPU for tasks with higher computational complexity. Consequently, ∆t_i′ and ∆t_m′ are much larger than ∆t_i and ∆t_m, preventing the expected n-fold acceleration. The experiments show that, with adequate computational resources, the proposed parallel algorithm exhibits significant acceleration advantages as the data scale and algorithm complexity increase. The performance gain increases non-linearly with the problem scale.

Verification of the improvement effect of speed imbalance

To verify whether the parallel algorithm improves the speed imbalance between yarn geometry generation and subsequent stages in fabric simulation, tests were conducted. The running times for the three major simulation stages on different virtual woven fabrics were compared, as shown in Figure 5. Both serial and parallel algorithms are shown for the yarn geometry generation phase. The yarn-level parallel algorithm in the geometry generation phase requires complex mathematical operations such as spline fitting, matrix calculations, and iterative computations. Its computational efficiency remains lower than that of the highly optimized rendering pipeline. After targeted optimizations, the execution time of the parallel algorithm for yarn geometry generation has been significantly reduced. The optimized version achieves a speedup of up to tenfold in this phase.

Figure 5.

Comparison of execution times for different phases of the simulation process.

Moreover, the optimized parallel algorithm reduced the execution time of the yarn geometry generation phase, bringing it closer to the time required for the buffer creation phase. The magnitudes of both are now comparable, effectively alleviating the speed imbalance issue between different stages. In the traditional serial algorithm, the high computational complexity of the yarn geometry generation phase often results in significantly longer processing times compared to other stages, which in turn affects the overall efficiency of the fabric simulation process. However, through parallel optimization, the computational efficiency of the yarn geometry generation phase has been greatly improved. This has led to a more balanced time consumption across all simulation stages, providing more efficient computational support for fabric simulation.

Performance of parallel algorithms on different GPU hardware

To evaluate the performance of the parallel algorithms on different GPU hardware, four different GPUs were selected for testing. The tests were conducted on the various virtual woven fabrics shown in Figure 4. The results, which represent the sum of the execution times of Algorithms 2 and 4, are presented in Figure 6.

Figure 6.

Performance comparison of different GPU.

As demonstrated in Figure 6, the parallel algorithm implemented via the OpenCL cross-platform framework can be efficiently executed across diverse GPU architectures. The computational advantages of parallelism are effectively leveraged on both AMD-based integrated graphics and NVIDIA-based discrete graphics, demonstrating broad applicability in heterogeneous computing environments. The observed performance differences across GPU architectures can be attributed to variations in hardware characteristics, including the number of processing cores, memory bandwidth, and overall parallel throughput. The proposed yarn-level parallel algorithms expose a large number of independent workloads across yarns. As a result, GPUs with higher core counts and greater memory bandwidth are able to benefit more significantly from this parallelism. In contrast, GPUs with fewer compute resources exhibit relatively smaller speed-ups, although acceleration over the CPU-based implementation is still consistently observed. These results indicate that the proposed framework scales with the available degree of GPU parallelism and is able to effectively exploit the underlying hardware capabilities.

For integrated GPUs, although computational capabilities are constrained compared to discrete graphics (particularly when processing large-scale datasets), significant acceleration is still achieved relative to serial CPU computation. Consequently, the proposed parallel algorithm exhibits robust hardware adaptability and is characterized as device-agnostic. This attribute enables widespread deployment across multiple hardware platforms, providing a feasible and efficient solution for 3D simulation technology in the textile domain.

Large-scale patterned fabric simulation

The proposed yarn-level parallel algorithm is focused on the yarn geometry generation phase, exhibiting pronounced scale dependency in acceleration efficiency. Specifically, performance gains are amplified with larger fabric structure dimensions. Consequently, the algorithm’s applicability remains independent of fabric types. To validate its universality, the parallel algorithm was applied to 3D structural simulations of both large-pattern woven fabrics and knitted fabrics. The resulting structural visualization is presented in Figure 7.

Figure 7.

Rendering effects of large-scale patterned fabrics: (a) woven fabric simulation, (b) weft-knitted fabric simulation, and (c) warp-knitted fabric simulation.

In this figure, (a) depict woven fabric where the patterned weave segments (green) are structured as 3/1 warp-faced twill weaves, while the ground weave segments (white) feature 1/3 weft-faced twill weaves. The complete fabric is composed of 2880 warp yarns and 1840 weft yarns, generating a total of approximately 254.47 million triangular facets. (b) is weft-knitted single jacquard knitted fabric, knitted in knit stitches and float stitches, with the whole fabric consisting of 2880 yarns, resulting in a geometric complexity of 331.71 million facets. (c) shows a warp-knitted Jacquard knitted fabric, which is knitted by two guide bars. The base tissue of this fabric is a tricot stitch (JB1 (Jacquard guide bar): 1-0/1-2//; GB2: 1-2/13D-0//;), on which the Jacquard guide bar is offset to form a pattern effect (orange part), and the whole fabric consists of 2520 yarns, producing the largest mesh model with 381.71 million facets. When applied, the proposed parallel algorithm reduces the total execution time of complete 3D simulations to approximately 10 s, even for large-pattern woven and knitted fabrics with massive yarn counts. This demonstrates the algorithm’s broad applicability across diverse fabric types and confirms its significant efficiency advantages in processing complex patterned textiles.

Conclusion

In this work, a GPU-based yarn-level parallel computing framework is presented, significantly enhancing computational efficiency for 3D simulations of large-patterned fabric structures. The framework introduces a yarn-decoupled approach and achieves yarn-level parallelization through two key algorithms: the Spline Trajectory Parallel Construction Algorithm and the Mesh Parallel Generation Algorithm. This design eliminates serial dependencies across yarns and enables concurrent trajectory interpolation and mesh generation. Experimental results show that the framework achieves up to a tenfold speedup in yarn geometry generation, reducing the total simulation time to approximately 10 s. By effectively overcoming the computational bottlenecks of the serial workflow, it provides a robust and scalable solution for large-scale, high-precision fabric simulation.

The proposed framework has direct application value in virtual prototyping and woven or knitted structure analysis, supporting rapid evaluation of complex textile patterns and accelerating the design-to-manufacturing pipeline. The current formulation assumes that yarns can be processed independently, which is applicable to scenarios where geometric preprocessing is performed under low deformation and yarn–yarn mechanical interactions or collisions are negligible. Consequently, we explicitly frame the proposed method as a geometric acceleration framework. While it does not yet address yarn–yarn collisions or penetrations, it serves as a critical foundation for future integration with physical or contact-based models. Extending the framework to incorporate such physical coupling effects will be an important direction for future work to further enhance the realism of large-patterned fabric simulations.

Footnotes

Appendix

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The author(s) disclosed receipt of the following financial support for the research, authorship and/or publication of this article: This work was supported by the Postgraduate Research & Practice Innovation Program of Jiangsu Province (grant number KYCX25_2689); the Wuxi Science and Technology Development Fund Project (grant number K20241032); the Fundamental Research Funds for the Central Universities (grant number JUSRP202501003).

ORCID iDs

Hui Xu

Haisang Liu

Data availability statement

The data that support the findings of this study are available from the corresponding author, upon reasonable request.

References

Silva

Botelho

, et al. Realistic 3D simulators for automotive: a review of main applications and features. Sens 2024; 24: 5880.

Basori

Alkawaz

Saba

, et al. An overview of interactive wet cloth simulation in virtual reality and serious games. Comput Methods Biomech Biomed Eng Imaging Vis 2018; 6: 93–100.

Jiang

Guo

, et al. Cloth simulation for Chinese traditional costumes. Multimed Tools Appl 2019; 78: 5025–5050.

Castillo

López-Moreno

Aliaga

Recent advances in fabric appearance reproduction. Comput Graph 2019; 84: 103–121.

Sperl

Sánchez-Banderas

, et al. Estimation of yarn-level simulation models for production fabrics. ACM Trans Graph 2022; 41: 1–15.

Sha

Geng

Gao

, et al. Review on the 3-D simulation for weft knitted fabric. J Eng Fiber Fabr 2021; 16: 15589250211012527.

Kovačević

Brnada

Šabarić

, et al. Limitations of the CAD-CAM system in the process of weaving. Autex Res J 2021; 21: 225–233.

Özdemir

Başer

Computer simulation of woven fabric appearances based on digital video camera recordings of moving yarns. Text Res J 2008; 78: 148–157.

Zhang

, et al. Structural deformation behavior of Jacquardtronic lace based on the mass-spring model. Text Res J 2017; 87: 1242–1250.

10.

Jiang

Zhang

, et al. Modeling and realization for appearance visualization of Textronic laces. Text Res J 2019; 89: 4526–4536.

11.

Yang

Wang

, et al. A Blinn-Phong BRDF infrared reflection model. Comput Eng Sci 2018; 40: 101.

12.

Dai

Hong

Fabric mechanical parameters for 3D cloth simulation in apparel CAD: a systematic review. Comput Des 2024; 167: 103638.

13.

Zhu

Jarabo

Aliaga

, et al. A realistic surface-based cloth rendering model. In: ACM SIGGRAPH 2023 conference proceedings, 2023, pp.1-9.

14.

Zheng

Jiang

Peng

, et al. Yarn-level deformation for weft-knitted stitches. J Eng Fiber Fabr 2023; 18: 15589250231210918.

15.

Deng

, et al. Virtual design of woven fabrics based on parametric modeling and physically based rendering. Comput Des 2024; 173: 103717.

16.

Liu

Kyosev

Jiang

, et al. Realistic fabric rendering with yarn models. Text Res J 2023; 93: 3552–3563.

17.

Zhu

Shi

, et al. Fabric rendering with fiber-scale staple yarn models. Text Res J 2026; 96: 138–151.

18.

Song

Peng

Jiang

, et al. Rapid simulation of yarn-level realism in fuzzy yarn knitted fabrics. Fibers Polym 2025; 26: 3169–3180.

19.

Leaf

Schweickart

, et al. Interactive design of periodic yarn-level cloth patterns. ACM Trans Graph 2018; 37: 1–15.

20.

Kim

Hong

Real-time cloth simulation in extended reality: comparative study between unity cloth model and position-based dynamics model with GPU. Appl Sci 2025; 15: 6611.

21.

Tang

Tong

, et al. P-cloth: interactive complex cloth simulation on multi-GPU systems using dynamic matrix assembly and pipelined implicit integrators. ACM Trans Graph 2020; 39: 1–15.

22.

Chen

, et al. Modeling of filament level plain woven Kevlar 49 fabric for accurate prediction of yarn pull-out behavior. Text Res J 2022; 92: 3704–3718.

23.

Kyosev

Generalized geometric modeling of tubular and flat braided structures with arbitrary floating length and multiple filaments. Text Res J 2016; 86: 1270–1279.

24.

Tong

Shi

Jiang

, et al. Microcrack/microscale decorated fiber-based electronics for waist rehabilitation. Engineering 2025; 55: 204–216.

25.

Yang

Buluç

Owens

JD.

GraphBLAST: a high-performance linear algebra-based graph framework on the GPU. ACM Trans Math Softw 2022; 48: 1–51.

26.

Zhu

, et al. Geometric modeling and simulation for leno fabrics. J Eng Fiber Fabr 2024; 19: 15589250241291198.

GPU-based 3D simulation of large-scale patterned fabric structures

Abstract

Keywords

Introduction

Yarn-level parallel computing framework

Spline trajectory parallel construction algorithm

Mesh parallel generation algorithm

Results and discussion

Performance analysis of parallel algorithms

Performance comparison of parallel and serial algorithms

Verification of the improvement effect of speed imbalance

Performance of parallel algorithms on different GPU hardware

Large-scale patterned fabric simulation

Conclusion

Footnotes

Appendix

Declaration of conflicting interests

Funding

ORCID iDs

Data availability statement

References