Papers

(30) A. R. Brodtkorb. Solving Hyperbolic Conservation Laws on GPUs, in preparation, preprint available on request, 2023.

Abstract: GPUs have gone from something esoteric to mainstream within scientific computing over the last decade. The aim of this paper is to demonstrate that development of solvers for hyperbolic conservation laws on GPUs can be simple, easy, and efficient using modern development techniques. This paper starts with a short introduction to hyperbolic conservation laws, and techniques for computing solutions to such equations. Then, we show how these techniques can implemented on a GPU with relative ease, yet keeping the performance high. The source code is released under an open source license, with the hope that it can serve as a tutorial or building block for other researchers to build more advanced high-performance solvers.

(29) A. R. Brodtkorb, A. Benedictow, H. Klein, A. Kylling, A. Nyiri, A. Valdebenito, E. Sollum Estimating volcanic ash emissions using retrieved satellite ash columns and inverse ash transport modelling using VolcanicAshInversion v1.2.1, within the operational eEMEP volcanic plume forecasting system (version rv4_17), 2023.
EGU preprint | Source code (Zenodo) | Satellite data (Zenodo) | Forward simulations (Zenodo) | Presentation

Abstract: This paper describes the inversion procedure being used operationally at the Norwegian Meteorological Institute for estimating ash emission rates from satellite images. The overall procedure consists of five stages: (1) run emission scenario simulations; (2) generate a priori emission estimates; (3) match observations with emission simulations; (4) build system of linear equations; and (5) solve overdetermined system. We go through the mathematical foundations for the inversion procedure, performance for synthetic cases, and performance for real-world cases. The novelties of this paper includes pruning of the linear system of equations used in the inversion and inclusion of observations of ash cloud height from the SLSTR instrument. The source code used in this work is freely available under an open source license, and is possible to use for other similar applications.

(28) A. R. Brodtkorb, M. L. Sætra Simulating the Euler equations on multiple GPUs using Python, Frontiers in Physics, DOI [Draft (PDF)], 2022.

Abstract: GPUs have become a household name in high performance computing (HPC) systems over the last 15 years. However, programming GPUs is still largely a manual and arduous task, which requires expert knowledge on the physics, mathematics, and computer science involved. Even though there have been large advances in automatic parallelization and GPU execution of serial code, it is still difficult to fully utilize the GPU hardware with such approaches. Many core numeric GPU codes are therefore still written using C++ or Fortran for the host code.

It is a coming challenge that the C++ and Fortran literacy amongst graduates is falling, and there is substantial research that shows that using high-level languages such as Python increases productivity and lowers mistakes and bugs in the source code.

We have developed a simulation software based on PyOpenCL and mpi4py in Python for solving the Euler equations on Cartesian grids. Our framework utilizes the GPU, and can automatically run on clusters using MPI as well as on shared memory architectures. The framework furthermore allows the programmer to implement low-level details in CUDA, which is important to achieve peak performance, whilst still benefiting from the productivity of Python. We show that we are able to achieve perfect strong and weak scaling on both shared and distributed memory GPU clusters.

(27) A. R. Brodtkorb, H. H. Holm. Coastal Ocean Forecasting on the GPU using a Two-Dimensional Finite-Volume Scheme, Tellus A: Dynamic Meteorology & Oceanography, arXiv preprint, 2021.

Abstract: Flexible simulations of the ocean is a challenging task that requires accurate numerical schemes that capture the relevantphysical properties and run efficiently without the need for large super computers. In this work, we take a modernhigh-resolution finite-volume scheme for solving the rotational shallow-water equations, and extend it with featuresrequired to run real-world ocean simulations efficiently on the GPU. The chosen scheme, first proposed by Chertocket al. (Chertock et al., 2017), is well-balanced with respect to the so-called geostrophic balance, and our contributionsinclude an efficient reformulation that makes it well-suited for massively parallel implementations, varying north vectorrequired for large scale domains, dynamic dry states combined with a static land mask, bottom shear stress, and windforcing. Further, we build a simulation framework that use real-world data (land mask, bathymetry, wind stress, andinitial and external boundary conditions) directly from three-dimensional operational ocean current forecasts that aremade publicly by the Norwegian Meteorological Institute. We show numerical experiments using three different real-world domains, and show that our simulation framework provides tidal predictions almost as well as the operationalmodel in the available time range. We also demonstrate numerical convergence using a non-trivial case and show weakappropriate weak scaling of the computational performance of our GPU implementation.

(26) H. Nilsen, A. Moncorge, K. Bao, O. Møyner, K.-A. Lie and A. R. Brodtkorb Comparison Between Algebraic Multigrid and Multilevel Multiscale Methods for Reservoir Simulation, Proceedings ECMOR XVII, Sep 2020, Volume 2020, p.1 - 17.
Article

Abstract: Multiscale methods for solving strongly heterogenous systems in reservoirs have a long history from the early ideas used on incompressible flow to the newly released version in commercial simulation. Much effort has been put into making the MsFV method work for fully unstructured multiphase problems. The MsRSB version is a newly developed version, which tackles most of the "real" world problems. It is to our knowledge, the only multiscale method that has been released in a commercial simulator. You can alternatively see the method as a variant of smoothed aggregation or as an iterative approach to AMG with energy minimizing basis functions. This will be discussed in detail.

So far, most work on comparing MsRSB with AMG methods has been on qualitative performance measures like iteration number rather than on pure runtime on fair code implementation. We discuss the theoretical performance and show the practical performance for our implementation. Here, we compare performance of pure AMG, standard two-level MsRSB with pure AMG as coarse solver, as well as a new truly multilevel MsRSB scheme. Our implementation uses the DUNE-ISTL framework. To limit the scope of the discussion we restrict our assessment to AMG with aggregation and smoothed aggregation and the MsRSB method. These three methods are closely related and are primarily distinguished in a preconditioner setting by the coarsening factors used, and the degree of smoothing applied to the basis. We also compare with other state-of-the-art AMG implementations, but do not investigate combinations of them with the MSRB method. For the MsRSB method, we also discuss practical considerations in different parallelization regimes including domain decomposition using MPI, shared memory using OpenMP, and GPU acceleration with CUDA.

All comparisons will focus on the setting in which many similar systems should be solved, e.g. during a large-scale, multiphase flow simulation. That is, our emphasis is on the performance of updating a preconditioner and on the apply time for the preconditioner relative to the convergence rate. Performance of the solvers will be tested for pure parabolic/elliptic problems that either arise as part of a sequential splitting procedure or as a pseudo-elliptic preconditioner/solver as a part of a CPR preconditioner for a multiphase system, for which block ILU0 is used as the outer smoother.

(25) H. H. Holm, M. L. Sætra, A. R. Brodtkorb. Data Assimilation for Ocean Drift Trajectories Using Massive Ensembles and GPUs, Finite Volumes for Complex Applications IX, 2020. [Draft (PDF)].

Abstract: In this work, we perform fully non-linear data assimilation of ocean drift trajectories using a cluster of GPUs. We use an ensemble of over 1000 members and the sequential importance resampling algorithm to assimilate observations of drift trajectories into the physical variables of the underlying shallow water simulation model. Our results show an improved drift trajectory forecast using data assimilation for a complex and realistic simulation scenario, and the implementation exhibits good weak and strong scaling.

(24) H. H. Holm, A. R. Brodtkorb, M. L. Sætra, GPU Computing with Python: Performance, EnergyEfficiency and Usability, MDPI Computation 8(1), 4, Special issue on Energy-Efficient Computing on Parallel Architectures, 2020. [DOI]

Abstract: In this work, we examine the performance, energy efficiency and usability when using Python for developing high performance computing codes running on the graphics processing unit (GPU). We investigate the portability of performance and energy efficiency between CUDA and OpenCL; between GPU generations; and between low-end, mid-range and high-end GPUs. Our findings show that the impact of using Python is negligible for our applications, and furthermore, CUDA and OpenCL applications tuned to an equivalent level can in many cases obtain the same computational performance. Our experiments show that performance in general varies more between different GPUs than between using CUDA and OpenCL. We also show that tuning for performance is a good way of tuning for energy efficiency, but that specific tuning is needed to obtain optimal energy efficiency.

(23) H.H. Holm, A.R. Brodtkorb, and M.L. Sætra. Performance and energy efficiency of CUDA and OpenCL for GPU computing using Python. In Proceedings of the ParCo 2019 conference, 2019. arXiv preprint

Abstract: We have developed a shallow water simulator in Python that utilizes the GPU for computational performance. To access the GPU, there are today several different programming languages and models, and two popular choices are Nvidia CUDA and OpenCL. CUDA is a propritary language that can run only on NVIDIA GPUs, whilst OpenCL is an open standard available for GPUs from several vendors, in addition to other architectures such as Intel CPUs and FPGAs.

In this paper, we report our experiences using both pyCUDA and pyOpenCL for computational science. There are benefits to both approaches, and we summarize and compare some of these in a table. We compare achieved performance of equivalent code, availability and maturity of tools and the development ecosystems, and other relevant metrics.

(22) A. R. Brodtkorb. Agile Supervision of Bachelor, Master, and PhD. theses, Nordic Journal of STEM Education, 2019. [Draft (PDF)] | [Paper (DOI)]

Abstract: Supervision of students can often require a significant mental effort from the supervisor to be effective for the students. When supervising one or two students it is relatively easy to remember the context for each students work from one meeting to the next. However, with a large number of students the mental complexity of the supervision becomes significant. This paper presents the authors experience with using planning boards for supervision of students on bachelor, master, and PhD level to tackle this problem.

(21) H.H. Holm, A.R. Brodtkorb, G. Broström, K.H. Christensen, and M.L. Sætra. Evaluation of selected finite-difference and finite-volume approaches to rotational shallow-water flow. Communications in Computational Physics, 2019. Paper (ntnuopen)

Abstract: Many scientific disciplines and practical problems involve computational fluid dynamics and simulation of water bodies based on the shallow-water equations. Herein, we target the challenge of validating numerical schemes and their implementations for oceanographic flow. We propose a set of test cases for the shallow-water equations in a rotational frame that represent important physical properties as seen from an oceanographic viewpoint. These are intended used by researchers developing new numerical schemes for oceanographic applications, and the test cases are demonstrated using four existing numerical schemes from the literature. The results from the test cases are presented in a systematic manner, together with quantitative and qualitative properties of the numerical schemes.

(20) A. R. Brodtkorb. Simplified Ocean Models on the GPU, Norsk Informatikkonferanse, 2018. Paper (bibsys)

Abstract: This article describes the implementation of three different simplified ocean models on a GPU (graphics processing unit) using Python and PyOpenCL. The three models are all based on the solving the shallow water equations on Cartesian grids, and our work is motivated by the aim of running very large ensembles of forecast models for fully nonlinear data assimilation. The models are the linearized shallow water equations, the non-linear shallow water equations, and the two-layer non-linear shallow water equations, respectively, and they contain progressively more physical properties of the ocean dynamics. We show how these models are discretized to run efficiently on a graphics processing unit, discuss how to implement them, and show some simulation results. The implementation is available online under an open source license, and may serve as a starting point for others to implement similar oceanographic models.

(19) T. Gierlinger, A.R. Brodtkorb, A. Stumpf, M. Weilera, and F. Michel. Visualization of marine sand dune displacements utilizing modern GPU techniques, In The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, 2015. Paper (DOI)

Abstract: Quantifying and visualizing deformation and material fluxes is an indispensable tool for many geoscientific applications at different scales comprising for example global convective models (Burstedde et al., 2013), co-seismic slip (Leprince et al., 2007) or local slope deformation (Stumpf et al., 2014b). Within the European project IQmulus (http://www.iqmulus.eu) a special focus is laid on the efficient detection and visualization of submarine sand dune displacements. In this paper we present our approaches on the visualization of the calculated displacements utilizing modern GPU techniques to enable the user to interactively analyse intermediate and final results within the whole workflow.

(18) M. L. Sætra, A. R. Brodtkorb, K-A. Lie, Efficient GPU-Implementation of Adaptive Mesh Refinement for the Shallow-Water Equations, Journal of Scientific Computing, 2014.
[Draft (PDF)] | Paper (DOI)

Abstract: The shallow-water equations model hydrostatic flow below a free surface for cases in which the ratio between the vertical and horizontal length scales is small and are used to describe waves in lakes, rivers, oceans, and the atmosphere. The equations admit discontinuous solutions, and numerical solutions are typically computed using high-resolution schemes. For many practical problems, there is a need to increase the grid resolution locally to capture complicated structures or steep gradients in the solution. An efficient method to this end is adaptive mesh refinement (AMR), which recursively refines the grid in parts of the domain and adaptively updates the refinement as the simulation progresses. Several authors have demonstrated that the explicit stencil computations of high-resolution schemes map particularly well to many-core architectures seen in hardware accelerators such as graphics processing units (GPUs). Herein, we present the first full GPU-implementation of a block-based AMR method for the second-order Kurganov–Petrova central scheme. We discuss implementation details, potential pitfalls, and key insights, and present a series of performance and accuracy tests. Although it is only presented for a particular case herein, we believe our approach to GPU-implementation of AMR is transferable to other hyperbolic conservation laws, numerical schemes, and architectures similar to the GPU.

(17) T. A. Haufmann, A. Berge, A. R. Brodtkorb,K. Kaspersen and A. Kim, Real-time online camera synchronization for volume carving on GPU, IEEE International Conference on Advanced Video and Signal-Based Surveillance (AVSS), 2013.
[Draft (PDF)] | [Paper (DOI)]

Abstract: Volume carving is a well-known technique for reconstructing a 3D scene from a set of 2D images, using features, usually foreground estimations, detected in individual cameras, and camera parameters to backproject the 2D images into 3D. Spatial calibration of the cameras is trivial, but the resulting carved volume is very sensitive to temporal offsets between the cameras. Automatic synchronization between the cameras is therefore desired. In this paper, we present a highly efficient implementation of volume carving and synchronization on a heterogeneous system fitted with commodity GPUs.
An online, real-time synchronization system is described and evaluated on surveillance video of an indoor scene. Improvements to the state of the art CPU-based algorithms are described.

(16) A. Berge, A. R. Brodtkorb, T. A. Haufmann, K. Kaspersen and A. Kim, Recommendations and guidelines for image processing on heterogeneous hardware, Technical report, 2013. Report (PDF) [Youtube video] [Recorded presentation]

Abstract: This report gives an introduction to using GPUs for computer vision. We start by giving an introduction to GPUs, followed by a state-of-the art survey of computer vision on GPUs. We then present our implementation of a real-time system for running low-level image processing algorithms on the GPU, based on live H.264 data originating from commodity-level IP cameras.

(15) A. R. Brodtkorb, T. R. Hagen, C. Schulz and G. Hasle GPU Computing in Discrete Optimization Part I: Introduction to the GPU, EURO Journal on Transportation and Logistics, 2013.
[Draft (PDF)] [Paper (Springer)

(14) C. Schulz, G. Hasle, A. R. Brodtkorb and T. R. Hagen GPU Computing in Discrete Optimization Part II: Survey Focused on Routing Problems, EURO Journal on Transportation and Logistics, 2013.
[Draft (PDF)] [Paper (Springer)

Abstract: Today there is still a large gap between the performance of current optimization technology and the requirements of real world applications. However, hardware development nowadays no longer results in higher speed for sequential algorithms but in increased parallelism in terms of multi core architectures and massively parallel accelerators like GPUs. The gap has therefore to be closed by utilizing this parallelism and all available hardware. Modern commodity PCs include both a multi-core CPU and at least one GPU, providing a low cost, easy accessible heterogeneous environment for optimization algorithms. This has led to several studies of how certain optimization algorithms can use the GPU to accelerate their computations. This paper begins with a short historical introduction to modern mainstream computer architectures and the evolution of modern GPUs. To facilitate the development of optimization algorithms that utilize the GPU efficiently, we provide a thorough discussion of best practice and strategies for the development of scalable, high performance GPU code. The heterogeneous aspect of using both the CPU and the GPU for computations is considered as well. In the second part of the paper we provide a general survey of the existing literature on heterogeneous computing in discrete optimization, followed by an in-depth critical discussion of selected papers on routing problems. We hope that the lessons that arise from the combination of the strategies for heterogeneous computing with the study of existing literature will stimulate further high quality research related to the development of efficient and powerful new heterogeneous optimization algorithms. Our point of view regarding those lessons and future research completes the paper.

(13) A. R. Brodtkorb, and M. L. Sætra Explicit Shallow Water Simulations on GPUs: Guidelines and Best Practices, in Proceedings of the XIX International Conference on Computational Methods for Water Resources, 2012.
[Draft (PDF)] | [Paper (PDF)]

Abstract: Graphics processing units have now been used for scientic calculations for over a decade, going from early proof-of-concepts to industrial use today. The inherent reason is that graphics processors are far more powerful than CPUs when it comes to both floating point operations and memory bandwidth, illustrated by the fact that three of the top 500 supercomputers in the world now utilize GPUs. In this paper, we present guidelines and best practices for harvesting the power of graphics processing units for shallow water simulations through stencil computations.

(12) A. R. Brodtkorb, M. L. Sætra and T. R. Hagen, GPU Programming Strategies and Trends in GPU Computing, Journal of Parallel and Distributed Computing, Volume 73, Issue 1, January 2013, Pages 4–13, DOI: 10.1016/j.jpdc.2012.04.003.
[Draft (PDF)] | [Paper (Elsevier)]

Abstract: Over the last decade, there has been a growing interest in the use of graphics processing units (GPUs) for non-graphics applications. From early academic proof-of-concept papers around the year 2000, the use of GPUs has now matured to a point where there are countless industrial applications. Together with the expanding use of GPUs, we have also seen a tremendous development in the programming languages and tools, and getting started programming GPUs has never been easier. However, whilst getting started with GPU programming can be simple, being able to fully utilize GPU hardware is an art that can take months and years to master. In this article, we give an overview of GPU programming strategies, with a focus on efficient hardware utilization. We give general advice in addition to step-by-step approaches to locating and removing bottlenecks through profile driven development. We conclude the article with our view on current and future trends.

(11) M. L. Sætra and A. R. Brodtkorb, Shallow Water Simulations on Multiple GPUs, Proceedings of the Para 2010 Conference Part II, Lecture Notes in Computer Science 7134 (2012), pp 56--66, DOI: 10.1007/978-3-642-28145-7_6.
[Draft (PDF)] | [Paper (Springer)]

Abstract: We present a state-of-the-art shallow water simulator running on multiple GPUs. Our implementation is based on an explicit high-resolution finite volume scheme for the shallow water equations, suitable for modeling dam breaks and flooding. We use row domain decomposition to enable multi-GPU computations, and perform traditional CUDA block decomposition within each GPU for further parallelism. Our implementation shows near perfect weak and strong scaling, and enables simulation of domains consisting of up-to 378 million cells at a rate of almost 400 megacells per second on the four GPUs of a Tesla S1070. Our experiments with the more recent Fermi architecture gives an estimate of over 1 gigacells per second performance.

(10) A. R. Brodtkorb, Scientific Computing on Heterogeneous Architectures, Ph.D. thesis, University of Oslo, ISSN 1501-7710, No. 1031, 2010.
[Thesis (PDF)] [Slides (PDF)]

Abstract: The CPU has traditionally been the computational work horse in scientific computing, but we have seen a tremendous increase in the use of accelerators, such as Graphics Processing Units (GPUs), in the last decade. These architectures are used because they consume less power and offer higher performance than equivalent CPU solutions. They are typically also far less expensive, as more CPUs, and even clusters, are required to match their performance. Even though these accelerators are powerful in terms of floating point operations per second, they are considerably more primitive in terms of capabilities. For example, they cannot even open a file on disk without the use of the CPU. Thus, most applications can benefit from using accelerators to perform heavy computation, whilst running complex tasks on the CPU. This use of different compute resources is often referred to as heterogeneous computing, and we explore the use of heterogeneous architectures for scientific computing in this thesis. Through six papers, we present qualitative and quantitative comparisons of different heterogeneous architectures, the use of GPUs to accelerate linear algebra operations in MATLAB, and efficient shallow water simulation on GPUs. Our results show that the use of heterogeneous architectures can give large performance gains.

(9) A. R. Brodtkorb, M. L. Sætra, and M. Altinakar, Efficient Shallow Water Simulations on GPUs: Implementation, Visualization, Verification, and Validation, Computers & Fuids, 55, (2011), pp 1--12. DOI: 10.1016/j.compfluid.2011.10.012.
[Draft (PDF)] [Paper (Elsevier)]

Abstract: In this paper, we present an efficient implementation of a state-of-the-art high-resolution explicit scheme for the shallow water equations on graphics processing units. The selected scheme is well balanced, supports dry states, and suits the execution model of graphics processing units well. We verify and validate our implementation and show that efficient use of single precision hardware is sufficiently accurate. Our framework further supports real-time visualization with both photo-realistic and non-photo-realistic display of the physical quantities. We present performance results showing that we can accurately simulate the first 4000 seconds of the Malpasset dam break case 27 seconds.

(8) A. R. Brodtkorb, T. R. Hagen, K.-A. Lie and J. R. Natvig, Simulation and Visualization of the Saint-Venant System using GPUs, Computing and Visualization in Science, special issue on Hot topics in Computational Engineering, 13(7), (2011), pp. 341--353, DOI: 10.1007/s00791-010-0149-x.
[Draft (PDF)] [Paper (Springer)]

Abstract: We consider three high-resolution schemes for computing shallow-water waves as described by the Saint-Venant system and discuss how to develop highly efficient implementations using graphical processing units (GPUs). The schemes are well-balanced for lake-at-rest problems, handle dry states, and support linear friction models. The first two schemes handle dry states by switching variables in the reconstruction step, so that that bilinear reconstructions are computed using physical variables for small water depths and conserved variables elsewhere. In the third scheme, reconstructed slopes are modified in cells containing dry zones to ensure non-negative values at integration points.We discuss how single and double-precision arithmetics affect accuracy and efficiency, discuss scalability and resource utilization for our implementations, and demonstrate that all three schemes map very well to current GPU hardware. We have also implemented direct and close-to-photo-realistic visualization of simulation results on the GPU, giving visual simulations with interactive speeds for reasonably-sized grids.

(7) A. R. Brodtkorb, C. Dyken, T. R. Hagen, J. M. Hjelmervik and O. O. Storaasli, State-of-the-Art in Heterogeneous Computing, Scientific Programming, 18(1) (2010), pp. 1--33
[Paper (PDF)] [Paper (IOS Press)]

Abstract: Node level heterogeneous architectures have become attractive during the last decade for several reasons: compared to traditional symmetric CPUs, they offer high peak performance and are energy and/or cost efficient. With the increase of fine-grained parallelism in high-performance computing, as well as the introduction of parallelism in workstations, there is an acute need for a good overview and understanding of these architectures. We give an overview of the state-of-the-art in heterogeneous computing, focusing on three commonly found architectures: the Cell Broadband Engine Architecture, graphics processing units (GPUs), and field programmable gate arrays (FPGAs). We present a review of hardware, available software tools, and an overview of state-ofthe- art techniques and algorithms. Furthermore, we present a qualitative and quantitative comparison of the architectures, and give our view on the future of heterogeneous computing.

(6) A. R. Brodtkorb, An Asynchronous API for Numerical Linear Algebra, Scalable Computing: Practice and Experience, special issue on Recent Developments in Multi-Core Computing Systems, 9(3) (2008), pp. 153--163.
[Draft (PDF)] [Paper (SCPE)]

Abstract: We present a task-parallel asynchronous API for numerical linear algebra that utilizes multiple CPUs, multiple GPUs, or a combination of both. Furthermore, we present a wrapper of this interface for use in MATLAB. Our API imposes only small overheads, scales perfectly to two processor cores, and shows even better performance when utilizing computational resources on the GPU.

(5) A. R. Brodtkorb and T. R. Hagen, A Comparison of Three Commodity-Level Parallel Architectures: Multi-core CPU, the Cell BE and the GPU, Seventh International Conference on Mathematical Methods for Curves and Surfaces, Lecture Notes in Computer Science, 5862 (2010), pp. 70--80
[Draft (PDF)] [Paper (Springer)]

Abstract: We explore three commodity parallel architectures: multi-core CPUs, the Cell BE processor, and graphics processing units. We have implemented four algorithms on these three architectures: solving the heat equation, inpainting using the heat equation, computing the Mandelbrot set, and MJPEG movie compression. We use these four algorithms to exemplify the benefits and drawbacks of each parallel architecture.

(4) A. R. Brodtkorb, The Graphics Processor as a Mathematical Coprocessor in MATLAB, The Second International Conference on Complex, Intelligent and Software Intensive Systems, pp. 822--827, March 2008, DOI: 10.1109/CISIS.2008.68.
[Draft (PDF)] [Paper (DOI)]

Abstract: We present an interface to the graphics processing unit (GPU) from MATLAB, and four algorithms from numerical linear algebra available through this interface; matrix-matrix multiplication, Gauss-Jordan elimination, PLU factorization, and tridiagonal Gaussian elimination. In addition to being a highlevel abstraction to the GPU, the interface offers background processing, enabling computations to be executed on the CPU simultaneously. The algorithms are shown to be up-to 31 times faster than highly optimized CPU code. The algorithms have only been tested on single precision hardware, but will easily run on new double precision hardware.

(3) A. R. Brodtkorb, A MATLAB Interface to the GPU, Master’s thesis, Department of Informatics, Faculty of Mathematics and Natural Sciences, University of Oslo, May 2007.
[Thesis (DUO)] [Thesis (PDF)]

Abstract: This thesis delves into the field of general purpose computation on graphics processing units (GPGPU). A MATLAB interface for solving numerical linear algebra on the graphics processing unit (GPU), and three algorithms from numerical linear algebra are presented. The algorithms are shown to be faster than the highly efficient ATLAS implementations used in MATLAB. In addition, the interface allows background processing on the GPU, enabling it to be used as a mathematical coprocessor. The computations are shown to be sufficiently accurate, and solving the shallow water equations implicitly is shown where both the CPU and the GPU are both utilized for maximumperformance. A comparison of the interface and other high-level languages for GPGPU is also presented.

(2) A. R. Brodtkorb, T. Fladby and M. L. Sætra, PLU factorization on a Cluster of GPUs Using Fast Ethernet, Technical report, 2007.
[Paper (PDF)]

Abstract: In this white paper, we present a novel approach to solve linear systems of equations on a cluster using the PLU factorization. We use the graphics processing unit (GPU) as the main computational engine at each node, and a block-cyclic data distribution to solve the system. The local computation is a new way of solving the PLU factorization on the GPU. It utilizes the full four-way vectorized arithmetic found in most GPUs, and a new pivoting strategy. The global algorithm uses the message passing interface (MPI) for communication between nodes. We show that our algorithm is highly efficient on the local nodes, but bounded by the relatively slow network. A faster network will eliminate this bottleneck, and the speed of the local computations show promising results.

(1) A. R. Brodtkorb, Matrix-Matrix Multiplication in MATLAB using the GPU, Technical report, 2006.
[Paper (PDF)]

Abstract: The use of GPU's as the main computing resource has yielded great speed-up factors in several fields including solving differential equations, linear algebra, signal processing and database queries. There have been several attempts at implementing efficient algorithms for matrix-matrix products with varying results. In-depth analysis of the algorithms has been presented as well. In this paper I review the work done in the field, and present a crude implementation of matrix-matrix products using the GPU. The implementation is run in Matlab.