Screening of Potential Candidates for SARS-CoV-2 Spike Glycoprotein using Bioinformatics approach

Abstract Background: Covid-19 is a pandemic disease caused by the novel severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). It deceased more than three lakh people and infected millions of people across the globe till date. The disease is highly contagious, spreading from person to person through direct contact or through the aerosols released by coughing or sneezing. To reduce the chain of infections, social distancing and sanitary measures are followed. Since, drugs and vaccines are under development, screening of potential candidates using bioinformatics approach may provide an instant solution to stop further deaths and spread of the disease. Methods: Orthologous proteins of SARS-CoV-2 were predicted with reciprocal best blast hit to identify the protein with large evolutionary difference. Homologous region of the evolutionary difference in 3D structure of the proteins was detected using multiple sequence alignment (MSA). The candidates from DrugBank exhibiting high binding affinities with the homologous region as well as with the entire protein structure were uncovered through docking analysis. Results: The orthologous proteins showed the surface spike glycoprotein as the protein with large evolutionary difference. The MSA detected GLN96, LEU97, GLN98, ALA99 and LEU100 amino acid sequence in the spike glycoprotein as the homologous region of evolutionary difference. Docking analysis revealed Remdesivir and TMC_310911 to exhibit the highest binding affinities with the homologous region and the entire protein respectively. Conclusions: The present study uncovers Remdesivir and TMC_310911 as the highly potential candidates for Covid-19. Further investigations may lead to prevention of infections and mortality due to the disease.


INTRODUCTION
Coronavirus disease 2019 (Covid-19) is a new coronavirus disease caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) as highlighted by the World Health Organization (WHO) in the Situation Report -94. The disease was first reported in Wuhan city, China in December 2019 [1]. It became pandemic by spreading within few months to more than 200 countries in the world according to the data updated by Worldometers (https://www.worldometers.info/coronavirus/), a freely accessible public database confirmed by WHO. As on May 22, 2020, it killed 335,203 people and infected more than five million people across the globe (Worldometers). The Centers for Disease Control and Prevention (CDC) states that the fatal disease spreads through direct contact or with the aerosols produced by coughing, sneezing or talking of the infected patients. The symptoms of the disease include fever, dry cough and fatigue and in severe cases, it cause difficult breathing, loss of organs functioning leading to death (WHO). Till now, there is no drug available to treat the disease and most of the vaccines are under development. Therefore, the WHO advised the people to wash hands regularly, maintain social distance of at least 3 feet, avoid touching eyes, nose and mouth and to practice respiratory hygiene. The SARS-CoV-2 contains RNA genetic material and 12 proteins which include surface spike glycoprotein, envelope protein, membrane glycoprotein, nucleocapsid phosphoprotein and other viral proteins [2]. The spike glycoprotein help the virus to attach to cell specific surfaces and mediate the transmission in human hosts [3]. Inhibition of spike mediated entry could prove key in preventing the virus infection and the disease [4]. Since the genomic and proteomic data related to SARS-CoV-2 are deposited in the public repositories and currently the available drugs are being explored, the present study makes an attempt to use the bioinformatics approach to screen some of the potential candidates for spike glycoprotein and identify suitable candidate for the disease.

Prediction of orthologs
Proteome sets of SARS-CoV-2 and related genomes were retrieved from NCBI genome (https://www.ncbi.nlm.nih.gov/genome). The proteome sets were used to predict orthologous proteins of SARS-CoV-2 using reciprocal best BLAST hit (RBBH) [5]. The orthologous proteins were filtered with E-value and sequence coverage of 10e-2 and 30 percent, 10e-3 and 40 percent and 10e-5 and 50 percent respectively. The results were analysed to infer the proteins with significant evolutionary changes.

Prediction of regions of evolutionary differences
A set of spike glycoprotein including proteins from its closely related genomes was used to implement multiple sequence alignment (MSA) using Clustal Omega [6]. The resulting alignment was used to identify the regions of evolutionary differences such as homologous substitutions, insertions and deletions in the spike glycoprotein [7]. Then, the 3D protein structure of the spike glycoprotein (6LZG) [8] was extracted from RCSB Protein Data Bank. Earlier protein set along with the amino acid sequence of the 3D protein structure was used to implement again the MSA. The resulting new alignment was used to identify the homologous regions of the sites of evolutionary differences in the protein structure.

Docking analysis
The list of potential candidates for Covid-19 was obtained from DrugBank [9] white paper updated on March 26, 2020. The chemical names, structures and composition of these candidates is shown in the following Table 1.
The potential candidates in PDB format were downloaded from DrugBank database. Docking analysis was performed through AutoDock Vina [10] using spike glycoprotein as the protein molecule and the potential candidates as the ligand molecules. In the AutoDock software, the protein molecule was prepared by removing water, ions, ligands and non-amino acid atoms from the 3D structure. Polar hydrogens and Kollman charges were added to the protein structure and then saved as PDBQT file. The ligand molecules were prepared by adding Gasteiger charges to the 3D structures. The number of rotable bonds in the ligand structures was identified and then saved as PDBQT files. Size and position of the GridBox covering the regions of evolutionary differences was identified to predict the interaction and binding energy between the ligand molecules and the amino acids in the mutations sites. Similarly, size and position of the GridBox covering the whole protein was identified separately to predict the interaction and binding energy between the ligand molecules and the binding sites of the entire protein molecule. The grid coordinates, PDBQT file of the protein and PDBQT file of each of the ligands was used to implement dock analysis using AutoDock Vina. Docking analysis is implemented separately in the homologous region of the evolutionary differences and in the entire protein.

Orthologs of SARS-CoV-2 proteins
The number of SARS-CoV-2 and related genomes available at NCBI genome were 44. Implementation of RBBH produced 6,989 orthologs of total 12 proteins in SARS-CoV-2. Filtering of the orthologs with different E-values and coverage as mentioned in the methods section, different sets of orthologous proteins were obtained. However, after filtering, only 10 proteins of SARS-CoV-2 had orthologous proteins while the other two proteins viz., ORF7b and ORF10 had no protein orthologs. The comparison of number of orthologs with different E-values and coverage can be visualized from the Fig. 1. The figure clearly shows that the significant evolutionary changes occurred in spike glycoprotein, membrane glycoprotein, orf1a polyprotein and nucleocapsid phosphoprotein. However, the number of orthologous proteins of YP_009724390.1, the surface spike glycoprotein, was drastically reduced when both the E-value cut off reduced from 10e-2 to 10e-3 and the coverage increased from 30 percent to 40 percent. This implies that significant evolutionary changes were occurred in the spike glycoprotein.

Regions of evolutionary difference in spike glycoprotein
It was observed from the orthologs analysis that the amino acid sequence of SARS-CoV-2 spike glycoprotein is closely related to spike glycoprotein of Bat coronavirus BM48-31/BGR/2008 and E2 glycoprotein precursor of severe acute respiratory syndrome-related coronavirus (SARS-CoV). This suggests the probable origin of SARS-CoV-2 from these genomes. MSA of these proteins revealed several homologous substitutions, few insertions and very few deletion regions. However, MSA of these proteins along with the amino acid sequence of the 6LZG protein PDB structure produced only an insertion region in the spike glycoprotein sequence from positions 680 to 684 with the amino acid sequence stretch of SPRRA. The region of insertion in the MSA is represented in the Fig.  2. The homologous substitution amino acids at the insertion regions of the protein structure were found to be GLN96, LEU97, GLN98, ALA99 and LEU100.

Docking results
Docking analysis in the homologous region of insertion produced nine best conformations for each of the ligands except for Azithromycin and TMC_310911 which had only two and four best conformations respectively. Top conformation of each of the ligand was the best conformations with least binding energy and zero root-meansquare deviation (RMSD) value. Similarly, docking analysis in the whole protein region for each of the ligands produced nine best conformations and the top conformation was the best with least binding energy and zero RMSD value. The following Table 2 shows the binding energies of the best conformation of each ligand.   12 Umifenovir -8.0 -6.9 The table depicts that the Remdesivir has the best binding conformation with the least binding energy in the homologous region of insertion while TMC_310911 has the best binding conformation with the least binding energy when the entire protein was consider. The binding sites of Remdesivir in the spike glycoprotein were GLN98, ALA99, GLN102, ASN210 and LYS562 amino acids. Similarly, the binding sites of TMC_310911 in the spike glycoprotein were THR371, SER409 and ILE291 amino acids. It can also be observed from the table that the TMC_310911 too has a better binding conformation in the homologous region of insertion. Therefore, the present study uncovers both the Remdesivir and TMC_310911 as the highly potential candidates for the spike glycoprotein. Docking results of the Remdesivir and TMC_310911 can be visualized in the Fig. 3 produced using PyMOL [11].

CONCLUSION
The drastic evolutionary changes were observed in the spike glycoprotein indicating that it has some special function in the pathogen. The docking analysis revealed Remdesivir as the candidate with the best binding affinity in the homologous region of the insertion site while TMC_310911 as the candidate with the best binding affinity in the entire spike glycoprotein. Thus, the Remdesivir and the TMC_310911 are the highly potential candidates for SARS-CoV-2 spike glycoprotein. Therefore, the present study supports further investigation, experimentation and clinical trials of the Remdesivir and the TMC_310911 to fight against Covid-19 saving people and thus health and wealth of a country.