Modern graphics processing units (GPUs) contain hundreds of arithmetic units and can be harnessed to provide tremendous acceleration for numerically intensive scientific applications such as molecular modeling. The increased capabilities and flexibility of recent GPU hardware combined with high level GPU programming languages such as CUDA and OpenCL has unlocked this computational power and made it accessible to computational scientists. The key to effective GPU computing is the design and implementation of data-parallel algorithms that scale to hundreds of tightly coupled processing units. Many molecular modeling applications are well suited to GPUs, due to their extensive computational requirements, and because they lend themselves to data-parallel implementations. Several exemplary results from our GPU computing work are presented in Klaus Schulten's Keynote Lecture from the 2010 GPU Technology Conference.
NCSA Blue Waters GPU-accelerated Supercomputer
Blue Waters GPU-accelerated Supercomputer

Elusive HIV-1 Capsid Structure Determination Accelerated by GPUs

Human immunodeficiency virus type 1 (HIV-1) is the major cause of AIDS, for which treatments need to be developed continuously as the virus becomes quickly resistant to new drugs. When the virus infects a human cell it releases into the cell its capsid, a closed, stable container protecting the viral genetic material. However, interaction with the cell triggers at some point an instability of the capsid, leading to a well timed release of the genetic material that merges then with the cell's genes and begins to control the cell. The dual role of the capsid, to be functionally both stable and unstable, makes it in principle an ideal target for antiviral drugs and, in fact, treatments of other viral infections successfully target the respective capsids. The size of the HIV-1 capsid (about 1,300 proteins), and its irregular shape had prevented so far the resolution of a full capsid atomic-level structure. However, in a tour de force effort, groups of experimental and computational scientists have now resolved the capsid's chemical structure (deposited to the protein data bank under the accession codes 3J3Q and 3J3Y). As reported recently (see also journal cover), the researchers combined NMR structure analysis, electron microscopy and data-guided molecular dynamics simulations utilizing VMD to prepare and analyze simulations performed using NAMD on one of the most powerful computers worldwide, Blue Waters, to obtain and characterize the HIV-1 capsid. The discovery can guide now the design of novel drugs for enhanced antiviral therapy.

Molecular Dynamics

Continuing increases in high performance computing technology have rapidly expanded the domain of biomolecular simulation from isolated proteins in solvent to complex aggregates, often in a lipid environment. Such systems routinely comprise 100,000 atoms, and several published NAMD simulations have exceeded 10,000,000 atoms. Studying the function of even the simplest biomolecular machines requires simulations of 100 ns or longer, even when employing simulation techniques for accelerating processes of interest. One of the most time consuming calculations in a typical molecular dynamics simulation is the evaluation of forces between atoms that do not share bonds. The high degree of parallelism and floating point arithmetic capability of GPUs can attain performance levels twenty times that of a single CPU core. The twenty-fold acceleration provided by the GPU decreases the runtime for the non-bonded force evaluations such that it can be overlapped with bonded forces and PME long-range force calculations on the CPU. These and other CPU-bound operations must be ported to the GPU before further acceleration of the entire NAMD application can be realized.

Power Profiling

Our team has recently developed hardware and software for measuring and optimizing the power efficiency of VMD and other computational biology applications on mobile computing hardware, ensuring that VMD runs well on battery-powered devices. By measuring power consumption of both the host computer and either integrated or discrete GPUs, one can measure the power consumed during different phases of application execution, and optimize the trade-off between performance and energy efficiency.

Multi-Resolution Molecular Surface Visualization

Molecular surface visualization allows researchers to see where structures are exposed to solvent, where structures come into contact, and to view the overall architecture of large biomolecular complexes such as trans-membrane channels and virus capsids. Recently, we have developed a new GPU-accelerated multi-resolution molecular surface representation, enabling smooth interactive animation of moderate sized biomolecular complexes consisting of a few hundred thousand to one million atoms, and interactive display of molecular surfaces for multi-million atom complexes, e.g. large virus capsids. The GPU-accelerated QuickSurf representation in VMD achieves performance orders of magnitude faster than the conventional Surf and MSMS representations, and makes VMD the first molecular visualization tool capable of achieving smooth animations of surface representations for systems of up to one million atoms.

Molecular Orbital Display

Visualization of molecular orbitals (MOs) is important for analyzing the results of quantum chemistry simulations. The functions describing the MOs are computed on a three-dimensional lattice, and the resulting data can then be used for plotting isocontours or isosurfaces for visualization as well as for other types of analyses. Existing software packages that render MOs perform calculations on the CPU and require runtimes of tens to hundreds of seconds depending on the complexity of the molecular system.

We have developed present novel data-parallel algorithms for computing MOs on modern graphics processing units (GPUs) using CUDA. As recently reported, the fastest GPU algorithm achieves up to a 125-fold speedup over an optimized CPU implementation running on one CPU core. We have implemented these algorithms within the popular molecular visualization program VMD, which can now produce high quality MO renderings for large systems in less than a second, and achieves the first-ever interactive animations of quantum chemistry simulation trajectories using only on-the-fly calculation.

Ion Placement

To best reproduce physiological conditions, molecular dynamics simulations must be run in the presence of appropriate ions. Generally such simulations are performed in the presence of sodium chloride, although in some cases (such as simulations including nucleic acid) other ions such as magnesium are necessary. Although many tools such as the VMD Autoionize plugin can place a random distribution of ions, molecules requiring counterions for their stability are better treated using ion placement methods which take the electrostatics of the solute into account. One method for doing this is to place important counterions at minima in the electrostatic potential field generated by the biomolecule of interest, iteratively updating the potential field after each ion is placed.

While this method of ion placement is simple and computes ion positions matched to the specific target molecule, it can be very computationally demanding for large structures because it requires calculation of the electrostatic potential at all points on a high-resolution 3-D lattice in the neighborhood of the solute. Coulomb-based ionization of very large structures such as viruses could require several days even using moderately sized clusters of computers. However, the calculation of a function on a lattice where all points are independent is an ideal application for GPU acceleration, and as recently reported in the Journal of Computational Chemistry, the use of GPUs to accelerate Coulomb-based ion placement leads to speedups of 100 times or more, allowing large structures to be properly ionized in less than an hour on a single desktop computer.

GPU accelerated ion placement for large bacterial ribosome and STMV virus structures

The direct summation of the Coulomb potential from all atoms to every lattice point requires computational work that grows quadratically, proportional to the product of the number of atoms and the number of lattice points. An algorithmic enhancement known as multilevel summation uses hierarchical interpolation of softened pairwise potentials from lattices of increasing coarseness to compute an approximation to the Coulomb potential. The amount of computational work for multilevel summation grows linearly, proportional to the sum of the number of atoms and the number of lattice points. Our reported GPU-assisted implementation of this method further reduces the time of obtaining large ionized structures to just a few minutes on a single desktop computer. The accuracy of the implementation is sufficient (with an average difference from the direct approach demonstrated to be in the range of 0.025% to 0.037%) to permit identical ion placement as the direct summation approach for small test molecules and nearly identical results for the ribosome.

The GPU-accelerated Coulomb potential calculation can be directly applied to calculate time-averaged electrostatic potentials from molecular dynamics simulations. As we reported, a VMD calculation of the electrostatic potential for one frame of a molecular dynamics simulation of the ribosome takes 529 seconds on a single GPU, as opposed to 5.24 hours on a single CPU core. A multilevel summation calculation for a single frame requires 67 seconds on one GPU.

Multi-GPU Coulomb Summation

Just as scientific computing can be done on clusters composed of a large number of CPU cores, in some cases problems can be decomposed and run in parallel on multiple GPUs within a single host machine, achieving correspondingly higher levels of performance. One of the drawbacks to the use of multi-core CPUs for scientific computing has been the limited amount of memory bandwidth available to each CPU socket, often severely limiting the performance of bandwidth-intensive scientific codes. Recently this problem has been further exacerbated since the memory bandwidth available to each CPU socket hasn't kept pace with the increasing number of cores in current CPUs. Since GPUs contain their own on-board high performance memory, the available memory bandwidth available for computational kernels scales as the number of GPUs is increased. This property can allow single-system multi-GPU codes to scale much better than their multi-core CPU based counterparts. Highly data-parallel and memory bandwidth intensive problems are often excellent candidates for such multi-GPU performance scaling.

The direct Coulomb summation algorithm implemented in VMD is an exemplary case for multi-GPU acceleration. The scaling efficiency for direct summation across multiple GPUs is nearly perfect -- the use of 4 GPUs delivers almost exactly 4X performance increase. A single GPU evaluates up to 39 billion atom potentials per second, performing 290 GFLOPS of floating point arithmetic. With the use of four GPUs, total performance increases to 157 billion atom potentials per second and 1.156 TFLOPS of floating point arithmetic, for a multi-GPU speedup of 3.99 and a scaling efficiency of 99.7%, as recently reported. To match this level of performance using CPUs, hundreds of state-of-the-art CPU cores would be required, along with their attendant cabling, power, and cooling requirements. While only one of the first steps in our exploration of the use of multiple GPUs, this result clearly demonstrates that it is possible to harness multiple GPUs in a single system with high efficiency.

Fluorescence Microphotolysis

Fluorescence microphotolysis is a non-invasive method of studying dynamics of cellular components using optical microscopy. In its framework, a small area of a fluorescent specimen is illuminated by a focused laser beam, and the fluorescence of the illuminated spot is recorded. Analyzing the change of the fluorescence signal with time, one can extract diffusion constants of the fluorescent molecules. However, such an analysis of experimental data often requires numerical calculations, namely, a diffusion-reaction equation (a partial differential equation in time and 2D or 3D space) has to be solved. Numerical schemes for solving this equation on a grid feature a significant degree of parallelism; indeed, the scheme can be represented as a vector-matrix multiplication problem, which is common for graphics applications and can easily be computed on a GPU. On the other hand, the computation of the fluorescent molecules concentration at a given point depends on the concentration at other points, introducing interdependencies that limit parallelism. Nevertheless, it has been demonstrated recently that one can achieve a significant speed-up with the GPU-accelerated computation of the fluorescence microphotolysis signals, as compared to the CPU computation. The computation that took about 8 minutes on a CPU has been shown to run in 38 seconds on a GPU. Given that experimentalists need to perform multiple computation runs with various parameters to match the observed fluorescence signals, this 12-times speed-up is very welcome. As we reported, the GPUs accelerated computation of fluorescence measurements opens new possibilities for experiments that employ new high-resolution microscopes (such as the so-called 4Pi microscope), because, due to the intricate pattern of light distribution in such microscopes, numerical solution is necessary to analyze experimental data. Further information on this topic is available here.


Book Chapters

"Application Case Study — Molecular Visualization and Analysis"
John E. Stone.
In, David Kirk, Wen-mei Hwu, Programming Massively Parallel Processors: A Hands-on Approach (Third Edition), Morgan Kaufmann, Chapter 15, pp. 331-344, Cambridge, MA, 2017.
Book home pages: Amazon | Elsevier | Online full text at ScienceDirect
"GPU-Accelerated Molecular Dynamics Clustering Analysis with OpenACC"
John E. Stone, Juan R. Perilla, C. Keith Cassidy, and Klaus Schulten.
In, Robert Farber, editor, Parallel Programming with OpenACC, Morgan Kaufmann, pp. 215-240, Cambridge, MA, 2016.
Book home pages: Amazon | Elsevier
"GPU-Accelerated Computation and Interactive Display of Molecular Orbitals"
John E. Stone, David J. Hardy, Jan Saam, Kirby L. Vandivort, and Klaus Schulten.
In, Wen-Mei Hwu, editor, GPU Computing Gems, Chapter 1, pp. 5-18, 2011.
Book home pages: Amazon | Elsevier
"Fast Molecular Electrostatics Algorithms on GPUs"
David J. Hardy, John E. Stone, Kirby L. Vandivort, David Gohara, Christopher Rodrigues, and Klaus Schulten.
In, Wen-Mei Hwu, editor, GPU Computing Gems, Chapter 4, pp. 43-58, 2011.
Book home pages: Amazon | Elsevier
"GPU Algorithms for Molecular Modeling"
John E. Stone, David J. Hardy, Barry Isralewitz, and Klaus Schulten.
In Jack Dongarra, David A. Bader, and Jakub Kurzak editors, Scientific Computing with Multicore and Accelerators, Chapman & Hall / CRC Press, Chapter 16, pp. 351-371, 2010.
Book home pages: Amazon | CRC Press



Class lectures, workshop materials, and sample source code:



Our Research in the News