Introduction

Physical system

The Sars-Cov-2 virus is organised in two polyproteins which contain shorter proteins. We focus on one of the two main protease proteins, $M^{pro}$. They play a fundamental role in the replication process of the virus, by cutting the polyproteins in shorter chains. The $M^{pro}$ protein is one of the most important drug targets, due to the fact that it is absent in human body and that it is directly involved in the virus replication process (source).

From the crystallographic structure the $M^{pro}$ results to be divided in two subunits (monomers). Each subunit consists of the same sequence of 306 aminoacids. Each one has an active site, known as the catalytic dyad, which consists in two residues: H41-C145. Structurally, it comprehends three domains: Domain I: residues 8-101, Domain II: residues 102-184, Domain III: residues 201-306. There is also a long loop, residues 185-200, which connects domains II and III.

Molecular Dynamics

As mentioned before, all the trajectories are obtained by means of Molecular dynamics simulations. It is then worth to briefly explain this method. In the system under study there are not dissipative forces, so Hamiltonian formalism can be employed. The equations of motion are obtained by taking the time derivative of the generalized coordinates and momenta $$\dot{q}_{\alpha} = \frac{\partial \mathcal{H}}{\partial p_{\alpha}} \quad \dot{p}_{\alpha} =- \frac{\partial \mathcal{H}}{\partial q_{\alpha}} .$$ These are first order differential equations and the algorithms we are going to introduce are meant to numerically solve them. The positions and the velocity of each atom have to be found as a function of time and by keeping the total energy constant. This means the total derivative of the Hamiltonian is $$\frac{d \mathcal{H}}{dt} = 0 .$$ The average value of a macroscopic observable A can be computed with $$A = \langle a \rangle = \frac{1}{M}\sum_{n=1}^{M} a(x_{n \Delta t}),$$ where $\Delta t$ is the time step and $a(x_{n \Delta t})$ is the macroscopic parameter evaluated at each time step on each element of the ensemble which is visited by the system.

Velocity Verlet algorithm (1982)

This is the algorithm employed in our simulation. The main difference between this algorithm and the Verlet is that the positions and velocities are updated at the same time. The starting point is the Taylor expansion truncated to the second order: $$ \vec{r}_{i} (t + \Delta t) = \vec{r}_{i}(t) + \vec{\dot{r}}_{i}(t) \Delta t + \frac{(\Delta t)^{2}}{2m_{i}}\vec{F}_{i}(t). \label{vv_1}$$ Due to the time reversibility, which is guaranteed by the Hamiltonian formalism, it is possible to write $$ \vec{r}_{i} (t) = \vec{r}_{i}(t+\Delta t) - \vec{\dot{r}}_{i}(t+\Delta t) \Delta t + \frac{(\Delta t)^{2}}{2m_{i}}\vec{F}_{i}(t + \Delta t). \label{vv_2}$$ If $\vec{r}_{i} (t+ \Delta t) $ in \eqref{vv_1} is substituted in \eqref{vv_2} the following relation is obtained: $$\vec{v}_{i} (t + \Delta t) = \vec{v}_{i}(t) + \frac{( \Delta t)^{2}}{2m_{i}}[\vec{F}_{i}(t + \Delta t)+\vec{F}_{i}(t + \Delta t)].$$ To run a Molecular dynamics simulation some parameters are needed.

Time step $\Delta t$
Initial positions, which are provided by crystallographic experiments and collected in the PDB file
Initial velocities, which are randomly taken from a Boltzmann distribution at temperature $T$.

Project structure

The crystallographic structures are available for many mutations of the $M^{pro}$, but the dynamics of the protein has not been explored yet. The aim of this project is to explore the dynamics of the $M^{pro}$, in particular of the 8DFN mutation.

Part 1: Molecular Dynamics

To explore the role of the mutation in the dynamics of the protein, the wild type structure is compared to the mutated one. In this case, the protein from the 7SI9 complex was taken as the wild type reference, after discarding the additional ligand present in the PDB file.

Given that the protein is composed of two monomers, we aimed to investigate how the quaternary interactions of the protein affect the dynamics of the subunits. For achieving this, each full protein evolution is compared to the evolution of its respective single subunit. As a result, a single simulation run consists of 4 trajectories.

During the evolution, the system does not explore all the accessible microstates. Hence, starting from the same initial configurations, different trajectories are obtained. Because of this, we ran two repetitions for each trajectory and compared them. In summary, by following the workflow of this project 8 different trajectories are produced. To avoid confusion, a short nomenclature was employed to name the different trajectories, which is described as follows:

Trajectory name	Protein	Subunits	Repetition
mt1_rep0	mutated-type	1	first
mt1_rep1	mutated-type	1	second
mt2_rep0	mutated-type	2	first
mt2_rep1	mutated-type	2	second
wt1_rep0	wild-type	1	first
wt1_rep1	wild-type	1	second
wt2_rep0	wild-type	2	first
wt2_rep1	wild-type	2	second

Part 2: General Analysis

In the general analysis we focus on the equilibrium configurations and on the conformational changes of all the trajectories. The methods shown are Ramachandran Map (RAMA), Root Mean Square Deviation (RMSD), Root Mean Square Fluctuations (RMSF), Radius of Gyration (RGYR), Contact Map (CMAP) and Block Analysis (BSE).

Part 3: Detailed Analysis

The following part of the analysis focuses mainly on one repetition of the mutated type protein, mt2_rep0. The residues from and near the active site are also explored in more detail for all repetitions. For this section, the usage of RMSD values is expanded by introducing clustering methods, namely Hierarchical Clustering and Principal Components Analysis (PCA). Additionally, an enhanced way of visualizing CMAP values is presented, denominated here Dynamic Contact Map Differences (DCMD). Some observations are noted for the active site of the protein. The Pyinteraph methodology is also introduced.

Part 4: Solvent Analysis

The last part of the analysis focuses on the behaviour of water during the simulation. A new methodology named Water Average Local Densities (WALD) is proposed to calculate local densities of water molecules. The analysis is further complemented by means of the Solvent Accessible Surface Area (SASA) command provided by GROMACS.