Massive Particles: Particle-based Simulations on Multiple GPUs Takahiro Harada∗ Havok

Issei Masaie† PrometechSoftware

Seiichi Koshizuka‡ The University of Tokyo

Yoichiro Kawaguchi§ The University of Tokyo

Figure 1: Screen shots from a real-time simulation using 500,000 particles.

1

Abstract

There is no study using multiple GPUs for particle-based simulation to the best of our knowledge although several researchers have been using a GPU. In this study, a particle-based simulation is parallelized on multiple GPUs. There are several challenges to accomplish it. For example, the simulation should not have serial computations that can be a bottleneck of the simulation. Particle methods cannot assign a fixed computation data to each GPU but the data has to be reassigned each iteration because particles not having any fixed connectivity can move freely. It is difficult for a particle-based simulation to scale the performance to the number of GPUs because the overhead of the parallelization can be high. We overcame these hurdles by employing not a server-client computation model but a peer-to-peer model among GPUs. Each GPU dynamically manages their own computation data without a server. A sliced-grid was used to lower the traffic among GPUs. As a result, the simulation speed scales well to the number of GPUs and the method brings it possible to simulate millions of particles in real-time. Of course, the proposed method is effective not only for a simulation on GPUs but also one on CPUs. The contribution of this study also includes a sorting technique utilizing the coherency between the time steps which was also introduced to increase the performance on a GPU.

2

Methods

When a particle-based simulation is parallelized on multiple GPUs, it is not obvious how to manage the data on them. The simplest computation model would be to put a processor managing the data of all the GPUs. However, we did not employ the computation model because the managing processor can be a bottleneck of parallelization. Instead, we employed a computation model in which all the GPUs manage their own data, respectively. The computational domain is divided into subdomains of the number of GPUs. Each GPU is responsible for the particles in the subdomain. Compared to grid-based simulations, particle-based simulations have to manage the data of each GPU dynamically because simulation entities move freely in the computational domain. Two kinds of particles have to be sent to neighboring GPUs. One is the particles get out from the subdomain of a GPU. The other is the particles located at the edge of the subdomain. The later is necessary to compute physical values of particles locating at the edge of the subdomain. We can send these particles when they are required but the send is ∗ e-mail:

[email protected]

† e-mail:[email protected] ‡ e-mail:[email protected] § e-mail:[email protected]

Table 1: Comparison of calculation time (in milliseconds). N of P 200K 400K 600K 800K 1M

1GPU Sim Total 17.57 17.57 33.53 33.53 53.82 53.82 71.96 71.96 93.76 93.76

2GPUs Sim Total 10.56 11.31 17.23 18.44 24.98 27.09 31.17 37.75 44.45 46.59

4GPUs Sim Total 6.22 7.06 9.68 13.97 13.89 18.31 19.99 24.49 23.19 30.69

not efficient because it accompanies with a lot of data transfer of small granularity. Thus, ghost regions were introduced to improve the efficiency of the transfer. By introducing it, these particles are packed in a buffer and sent once in iteration. Here, we have to care about the method to select these particles to be sent because it is an overhead of parallelization. We used the grid constructed to make the neighboring particle search efficient was used to select the data to be sent dynamically. The sliced grid was introduced to make the neighboring particle search efficient[Harada et al. 2007]. The sliced grid improves not only the memory efficiency but also computational efficiency by increasing the locality of the data stored in the grid. The sliced grid is also effective to our problem because it reduces the empty voxels and this results in smaller data to be transferred between GPUs. This paper also presents a block transition sort utilizing the frame coherency and thus suited for almost sorted lists. The block transition sort was introduced to a particle-based simulation to improve the computation speed further by increasing locality of the data.

3

Results

The program was written in C++ and CUDA[NVIDIA ] on a PC equipped with a GeForce 8800 GT and Tesla S870 (four C870 GPUs). Fig.1 shows a screenshot from a real-time simulation using 500,000 particles. They were simulated on 4 GPUs and the other GPU was used for rendering. The host invoked five threads on the CPU and each of them controls a GPU. An iteration of the simulation took about 30 ms. Several scenes were simulated on 1, 2 and 4 GPUs and the measured times are shown in Fig.1. The simulation times scale well to the number of GPUs. However, the efficiency slightly decreases on the total computational time which also includes times for communication between GPUs although the data size to be sent is optimized in our method. This is because of the data transfer. When using 2 GPUs, a GPU has to send data to another GPU. However, when 4 GPUs were used, 2 of 4 GPUs have to send data to two neighboring GPUs and it results in longer transfer time than simulations on 2 GPUs.

References H ARADA , T., KOSHIZUKA , S., AND K AWAGUCHI , Y. 2007. Sliced-data structure for particle based simulations on gpus. In Proc. of GRAPHITE, 55–62. NVIDIA. Compute unified device architecture. http://www.nvidia. com/object/cuda home.html.

Particle-based Simulations on Multiple GPUs

tion to the best of our knowledge although several researchers have been using a GPU. ... The computational domain is divided into subdomains of the number of GPUs. Each ... The host invoked five threads on the. CPU and each of them ...

1MB Sizes 1 Downloads 225 Views

Recommend Documents

Efficient Computation of Sum-products on GPUs ...
Bayesian networks on an NVIDIA GeForce 8800GTX GPU over an optimized CPU version on a single core of a 2.4 GHz Intel Core. 2 processor with 4 MB L2 cache. For sufficiently large inputs the speedups reach 2700. The significant contributor to the speed

Simulations of All-Optical Multiple-Input AND-Gate ...
Optical Communications Research Group, School of Computing, Engineering and ..... ASE. En. Figure 3. Schematic of lumped model for FWM calculation in SOA ...

Fast Multiplication in Binary Fields on GPUs via ...
Target applications: shared memory to cache input (e.g. stencil). • Our case: binary field multiplication. • Result: 50% speedup over baseline x138 over a single core CPU with Intel's CLMUL instruction. Page 3. ICS 2016. Mark Silberstein, Technio

Real-Time Particle-Based Simulation on GPUs - Semantic Scholar
†e-mail:[email protected]. ‡e-mail:[email protected]. §e-mail:[email protected] particles (spheres) as the title of this skech implies ...

Numerical Simulations on the Biophysical ...
stitute I thank Theo Geisel and Fred Wolf for hosting me under their scientific ...... IN vdS + ˆ. Ω. svdV. Here, σ,Φ,v represent the conductivity, potential and test ...

Efficient Parallel CKY Parsing on GPUs - Slav Petrov
of applications in various domains by executing a number of threads and thread blocks in paral- lel, which are specified by the programmer. Its popularity has ...

Neural GPUs Learn Algorithms
Mar 15, 2016 - Published as a conference paper at ICLR 2016. NEURAL ... One way to resolve this problem is by using an attention mechanism .... for x =9=8+1 and y =5=4+1 written in binary with least-significant bit left. Input .... We call this.

Mobile GPUs - CIS 565
Apr 11, 2012 - 23%. NVIDIA. 3%. Qualcomm. 31%. Samsung. 14%. TI. 17%. Others. 12% ... version. – Higher than nearly all desktop and laptop displays.

Real-Time Particle-Based Simulation on GPUs - Semantic Scholar
tion to these platforms. Therefore, we need to ... (GPUs) as a parallel computation platform. As a result, we can ... ∗e-mail: [email protected].

The Case for Operating System Services on GPUs
The GPUfs file system layer for GPU software makes core operating .... Linux kernel source stored in approxi- .... depend on the link structure within the HTML ...

Efficient Rate Conversion Filtering on GPUs with ...
audio signals for 1D downsampling by an integer factor, we evaluate ... are floating point values. We see that each ..... sample is a float. We downsample this audio signal while varying downsampling factors from 2 to 16. Here, T = 10 × DF. We measu

Fast Multiplication in Binary Fields on GPUs via ...
Jun 3, 2016 - We now extend the register cache-based multiplication im- plementation described in the previous section to polynomi- als of larger degrees. Doing so requires us to cope with the challenge of limited register space. The shared memory al

Running Many Molecular Dynamics Simulations on Many ... - GitHub
Using this customized version, which we call the Simulation. Manager (SM) ..... International Conference for High Performance Computing, Networking,. Storage ...

CFD Simulations on Extinction of Co-Flow Diffusion Flames - NIST
energy dissipation rate (m2/s3) ... and alternative suppression systems. .... than the heat of combustion for that mixture (i.e. not enough energy is released to ..... W.L., “A research agenda for the next generation of performance-based design.

Augmented Reality Simulations on Handheld Computers
Scientists can now experiment in a virtual world of complex, dy- namic systems in a .... Over the past decades, a growing number of educational theorists and researchers ..... used would be located near an office that performs these functions.

Cloud-based simulations on Google Exacycle ... - Research at Google
Dec 15, 2013 - timescales are historically only accessible on dedicated supercomputers. ...... SuperLooper – a prediction server for the modeling of loops in ...

CFD Simulations on Extinction of Co-Flow Diffusion ...
PO Box 1000. FI-02044 .... For uniform flow velocity, a fundamental property of the exact solution to the equations governing scalar ... different than air, for example air at room temperature has a specific heat of 1.0 kJ/kg·K and water vapor is.

Parallel Programming CPUs & GPUs
1837-71: Charles Babbage analytical engine. • 1954: IBM 704 “first real MIMD”. • 1958: parallelism in numerical calculations. • 1962: four-processor, 16 memory modules. • 1964: SIMD. • 1969: eight processors in parallel. • 1970s: more

GPUfs: Integrating a File System with GPUs
to be largely oblivious to where data is located—whether on disk, in main memory, in a .... comprising a kernel into a single global hardware queue. The hard-.

GPUfs: Integrating a File System with GPUs
parallel data analysis applications, prioritized image matching and string search, highlight the ... A design and implementation of a generic software-only buffer.

Accelerating SSL with GPUs
eavesdropping, and enables authentication of end hosts. Nowadays,. SSL plays an essential role in online-banking, e-commerce, and other Internet services to ...

Molecular dynamics simulations
Feb 1, 2002 - The solution of these equations yields the trajectories. ¢ p1. ¢t§ гждждедег ... Table 1: Schematic structure of an MD program ..... V specific heat.

Experiments and simulations
Tampere University of Technology, Institute of Physics, Optics Laboratory, ... sub-5 cm lengths of commercially-available highly nonlinear fibre are ... A similar setup was also used in our previous experiments that were carried out in the .... nonli

Performing Speech Recognition on Multiple Parallel ...
previous research using custom-built hardware. However, ... performing recognition at a much faster rate than software. For most speech ... may offer a significant financial advantage. Alternatively .... sum of forty elements, a value is only written