• gpuSPHASE

  • Referenced in 6 articles [sw22289]
  • memory caching implementation for 2D SPH using CUDA. Smoothed particle hydrodynamics (SPH) is a meshless ... graphics devices, is developed. It is optimized for simulations that must be executed with thousands ... caching algorithm for Compute Unified Device Architecture (CUDA) shared memory is proposed and implemented...
  • CCMpred

  • Referenced in 1 article [sw17014]
  • introduce CCMpred, our performance-optimized PLM implementation in C and CUDA C. Using graphics cards...
  • RAPIDS

  • Referenced in 1 article [sw35094]
  • experience. RAPIDS utilizes NVIDIA CUDA® primitives for low-level compute optimization, and exposes GPU parallelism...
  • VLI

  • Referenced in 1 article [sw33112]
  • coefficients are provided. The kernels were manually optimized in assembly language ... NVIDIA CUDA with inline PTX assembly. This way we make optimal use of today...
  • Spinsim

  • Referenced in 1 article [sw41824]
  • Spinsim: a GPU optimized python package for simulating spin-half and spin-one quantum systems ... Nvidia Cuda compatible systems using GPU parallelization. Along with other optimizations, this allows for speed...
  • Ocelot

  • Referenced in 3 articles [sw09713]
  • Ocelot: a dynamic optimization framework for bulk-synchronous applications in heterogeneous systems. Ocelot ... data parallel execution model used by NVIDIA CUDA applications onto diverse multithreaded platforms. Ocelot includes ... against over 130 applications taken from the CUDA SDK, the UIUC Parboil benchmarks ... explicitly parallel applications and traditional dynamic compiler optimizations are revisited for this new class...
  • TTC

  • Referenced in 6 articles [sw15828]
  • source domain-specific parallel compiler. TTC generates optimized parallel C++/CUDA C code that achieves ... Knights Corner as well as different CUDA-based GPUs such as NVIDIA’s Kepler...
  • CAMPARY

  • Referenced in 5 articles [sw15156]
  • precision floating-point arithmetic library using the CUDA programming language for the NVidia GPU platform ... offers the simplicity of using hardware highly optimized floating-point operations, while also allowing...
  • DeepTCR

  • Referenced in 1 article [sw34134]
  • optimal training times, we suggest training these algorithms on GPU’s (requiring CUDA, cuDNN...
  • Gunrock

  • Referenced in 3 articles [sw27063]
  • Gunrock is a CUDA library for graph-processing designed specifically for the GPU. It uses ... coupling high performance GPU computing primitives and optimization strategies with a high-level programming model...
  • DiffTaichi

  • Referenced in 1 article [sw40893]
  • language in gradient-based learning and optimization tasks on 10 different physical simulators. For example ... shorter than the hand-engineered CUDA version yet runs as fast, and is 188x faster ... differentiable programs, neural network controllers are typically optimized within only tens of iterations...
  • GOMC

  • Referenced in 2 articles [sw27534]
  • physical properties of complex fluids. GPU Optimized Monte Carlo (GOMC) is open-source software ... oriented C++, and uses OpenMP and NVIDIA CUDA to allow for execution on multi-core...
  • LightSeq

  • Referenced in 1 article [sw35775]
  • library is built on top of CUDA official library(cuBLAS, Thrust, CUB) and custom kernel ... functions which are specially fused and optimized for these widely used models. In addition...
  • Cumapz

  • Referenced in 1 article [sw09714]
  • compare the memory performance of a CUDA program. CuMAPz can help programmers explore different ways ... using shared and global memories, and optimize their program for memory behavior. CuMAPz models several...
  • NnmfPack

  • Referenced in 1 article [sw32956]
  • NnmfPack was designed for Linux environments, and optimized for shared memory architectures, including standard multicore ... architecture) and NVIDIA Fermi/Kepler architectures. OpenMP, CUDA and Intel’s tools have been used...
  • RETURNN

  • Referenced in 2 articles [sw26580]
  • modern recurrent neural network architectures. It is optimized for fast and reliable training of recurrent ... recurrent neural networks including our own fast CUDA kernel; Multidimensional LSTM (GPU only, there...
  • somoclu

  • Referenced in 2 articles [sw23271]
  • also able to boost training by using CUDA if graphics processing units are available ... from fast execution, memory use is highly optimized, enabling training large emergent maps even...
  • SAGE

  • Referenced in 2 articles [sw29919]
  • compiler that automatically generates a set of CUDA kernels with varying levels of approximation with ... user. The SAGE compiler employs three optimization techniques to generate approximate kernels that exploit...
  • CUDACLAW

  • Referenced in 1 article [sw39313]
  • PDEs to be solved via a CUDA- independent serial Riemann solver and the framework takes ... different directions, and includes a number of optimizations to minimize access to global memory...
  • cuTT

  • Referenced in 1 article [sw19961]
  • CUDA Compatible GPUs. We introduce the CUDA Tensor Transpose (cuTT) library that implements high-performance ... high performance by (a) utilizing two GPU-optimized transpose algorithms that both use a shared...