Enhanced sampling

The basic properties of molecular materials in the solid state are often strongly influenced by the details of their crystal structures and the existence of polymorphs. Experimental determination of these structures is costly and time-consuming, which places increased importance on the role of theory and computation. Similarly, the biochemical function of small oligopeptides, from immunogenicity to inhibition, is affected by their equilibrium conformations in different environments. In developing techniques to study these effects, following a suggestion by Dunitz and Scheraga PNAS 101, 14309 (2004), we have adopted the perspective that these seemingly different phenomena are two aspects of the same underlying rare-event sampling problem. Rare events are characterized by major basins on a potential energy surface that are separated by high energetic or entropic barriers (see accompanying figure).

Techniques designed to treat one type of system can be adapted for the other. This research direction focuses on the development of free energy based enhanced sampling and surface navigation strategies that can be applied to either class of system when appropriately tailored to its particular requirements and conditions. Free energy is emphasized because it is fundamental in thermodynamics and is the appropriate figure of merit in ranking crystal structures and determining favored conformations of peptides. In many studies, one often seeks this quantity as a function of a set of collective variables (CVs) capable of distinguishing the structures of interest in a system. The CVs should also capture slow modes of motion that represent sampling bottlenecks. The number of CVs is typically a few to several tens, hence the dimensionality of the resulting free energy function is sufficiently high that generating and representing it constitute major challenges, which this research direction aims to address.

Consider a system of N atoms with positions r1,…,rN ≡ r and potential energy U(r). Suppose we have decided that a set of n CVs, qα(r), α = 1,…,n, distinguish the various conformational states on the free energy surface. The free energy surface is then defined as follows within classical statistical mechanics: We first define the marginal probability distribution function

where β=1/k_BT. The free energy surface A(s₁,…,s_n,T) is then given by

The expression for the marginal distribution suggests that an n-dimensional grid should be laid down in the space of the coarse-grained variables (CGVs) s1,…,sn and the distribution computed at each point on the grid. Such a computation can be carried out if n = 1 or 2, however, for larger values of n, the dimensional explosion of the grid renders the problem intractable, in which case the better approach is to sample the marginal distribution in much the same way that the Boltzmann distribution is sampled by molecular dynamics or Monte Carlo. The difficulty with sampling the marginal is that, unlike the Boltzmann distribution, the marginal is not known a priori. Thus, we need an approach capable of sampling an unknown distribution, which means we need to construct the marginal “on the fly” as part of the sampling approach. In order to construct such an approach, we first replace the Dirac δ-functions with Gaussian functions as follows:

With this substitution, the exponential in the Gaussian can be incorporated as part of the Boltzmann distribution, where it becomes a harmonic coupling between the CVs and the CGVs. Thus, if the harmonic force constants are taken to be finite (and large), it is possible to sample the Gaussian centers s_α directly via molecular dynamics. This is accomplished by constructing an extended phase space Hamiltonian of the form

where the πα are momenta conjugate to the Gaussian centers and μα are a set of fictitious masses associated with these momenta.

While this Hamiltonian allows us to sample the Gaussian centers, however, it does not guarantee that the sampling will be properly weighted by the marginal probability. In order to achieve the latter, we need to create an adiabatic decoupling between the physical variables (r,p) and the extended variables (s,π). Such an adiabatic decoupling will ensure that the physical variables sample as much of their phase space as possible in a local region of the CGVs, which effectively performs the integration that defines the marginal in this local region of the CGV space. The adiabatic decoupling is achieved by choosing the μα to be sufficiently large that the CGVs move slowly compared to the physical variables. In addition, we need to accelerate the sampling, which we can achieve by choosing the temperature associated with the CGVs, Ts >> T. The equations of motion take the form

where we implicitly include heat bath terms at temperature T and Ts to the physical and CGVs, respectively. Under the adiabatic and high-temperature conditions, it can be proved that distribution generated, which we denote Padb(s1,…,sn,T,Ts), is related to the true marginal by

From this relation, it can be seen that

Another approach for constructing the free energy is to use the adiabatic decoupling to generate locally averaged free energy gradients given by

and then use this information to construct the full free energy surface, which can be achieved by expanding the free energy function in a basis set. The following references contain details of the adiabatic free energy dynamics approach:

Rosso et al. J. Chem. Phys. 116, 4389 (2002).
Maragliano and E. Vanden-Eijnden Chem. Phys. Lett. 426, 168 (2006).
B. Abrams and M. E. Tuckerman J. Phys. Chem. B 112, 15742 (2008).
Chen et al. J. Chem. Phys. 137, 024102 (2012).
Cuendet and M. E. Tuckerman J. Chem. Theor. Comput. 10, 2975 (2014).
T. Tzanov et al. J. Phys. Chem. B 118, 6539 (2014).

Ref. 4 also describes an approach for combining the adiabatic free energy dynamics approach with the popular metadynamics method.

In Ref. 6, we employed adiabatic free energy dynamics to study the four-dimensional free energy surface of the dipeptide N-acetyl-tryptophan-methyl-amide using a range for force fields. The accompanying figure shows the molecule and the corresponding CVs, which are the dihedral angles shown in the figure, the major free energy basins in the gas and solution phases, and the comparison score between each force-field prediction and experiment.

Note that without the benefit of enhanced sampling, it would have been intractable to run simulations for each force field long enough to achieve full convergence of the four-dimensional free energy surface.

We have also applied the adiabatic free energy dynamics scheme to the alanine decamer in the gas phase, again using the 20 backbone Ramachandran angles as CVs. The free energy minimum for this system is an alpha-helix. The accompanying figure shows the RMSD from the alpha-helix along four independent trajectories when standard MD is used and when the adiabatic free energy dynamics approach is employed. The figure clearly shows the dramatic acceleration in sampling achieve using the latter enhanced sampling scheme.

Trajectories of the alanine decamer in the gas phase with standard MD (red) and adiabatic free energy dynamics in the extended phase-space formulation (d-AFED, black) for four independent trajectories.

Navigating high-dimensional free energy surfaces via the stochastic activation-relaxation technique (START): As an alternative to enhanced sampling via adiabatic free energy dynamics, it we have also developed a technique to navigate on the free energy surface and identify free energy minima and index-1 saddle points (which we generally term “landmarks”). The method is described in Chen et al. Proc. Natl. Acad. Sci. U.S.A. 112, 3235 (2015). Briefly, it is possible to use a combination of gentlest ascent dynamics and minimization, e.g., steepest descent, to locate these landmarks. The approach falls into the general category of heterogeneous multiscale modeling, in which fully atomistic restrained molecular dynamics is used to generate local free energy gradients and Hessian matrix elements, and these are then used to advance the CGVs either along an activation trajectory to escape a free energy minimum or a relaxation trajectory to proceed from a saddle to a minimum. We express this via the first-order equations:

for activation and

for relaxation. The local gradients and Hessian elements are computed from restrained simulations via the relations

where the averages are computed at fixed values of the CGVs, and

Since the averages are over simulations of finite length, they have inherent sampling error, which gives causing the optimization (activation and relaxation) to be stochastic in nature. Thus, in order to identify the minima and saddles, a clustering approach is needed. One possibility is the DBSCAN approach, or alternatively, the scheme of Rodriguez and Laio, Science 344, 1492 (2014) can be employed.

As an application of START, we generated the landmarks of the five-residue oligopeptide met-enkephalin (Tyr-Gly-Gly-Phe-Met), shown in the accompanying figure, in the gas phase.

The CVs are the ten backbone Ramachandran dihedral angles. START was able to locate 1081 minima and 1431saddle points. These minima and saddles are collected into a graph with minima represented as blue nodes, saddles as red nodes, and edges connect nodes if a segment of a START trajectory connects them. This graph is shown in the left panel of the accompanying figure. The large number of landmark points makes interpreting this graph difficult. In order to aid in understanding the structure of the free energy surface, we employ the sketch-map technique [Ceriotti et al. Proc. Natl. Acad. Sci. 108, 13023 (2011)] to sort the graph nodes according to a feature space of the configurations obtained. Sketch-map introduces an optimization function χ² = Σ_ij[F(R_ij)-f(r_ij)], where F is a sigmoid function of R_ij, the distance between configurations and f is a sigmoid function of r_ij, the distance between graph nodes. Optimizing this function maps configurations that are far apart in configuration space to nodes that have a larger distance in the graph. The right panel of the figure shows the graph nodes after sketch-map is applied.

Once the landmarks are identified, the adiabatic free energy dynamics method can be applied to the landmarks in order to obtain the free energy differences. These differences can be converged for the ~2500 landmarks obtained within a few hundred nanoseconds much more rapidly than a search on the fully ten-dimensional landscape. The actual free energy differences are shown in the upper left corner of the left panel while a correlation plot of free energy differences obtained between a 200 ns run and a 500 ns run of d-AFED on this system is shown in the lower left corner of the left panel.

An important advantage of START is that the free energies can be obtained both in the gas phase and in solution phase. An illustration of this is provided in the accompanying figure, which shows the free energy surface as a function of the Ramachandran angles of the solvated alanine dipeptide using the OPLS/AA force field generated via d-AFED together with the landmarks generated via a START calculation.