A combined protein toxin screening based on the transcriptome and proteome of Solenopsis invicta

Species annotation

The S. invicta which have ten antennomeres with two club segments, copper brown in body and head, and dark in abdomen with sting, were growing massively in China and could be roughly divided into two stages (Fig. 1A and B). The period from the beginning of the invasion to 2008 was the initial stage of diffusion with the slow spread rate. After 2008, it entered the period of rapid diffusion (Fig. 1B). The rapid trend of S. invicta invasion has piqued our interest, whose transcriptome and proteome were also successfully constructed in our study. At the transcriptome level, the assembly quality of unigenes was evaluated by GC content and unigene length. The proteins of samples passed the quality inspection through evaluating the sequence coverage (%) distribution, peptide length distribution, peptide counts distribution, and the relative molecular weight distribution (Fig. S1). NR database showed the sequence with 80% ~ 100% similarity was the most abundant, accounting for 87.68% (14,203), followed by the sequence with 60% ~ 80% similarity, accounting for 8.12% (1315) (Fig. 1C). All the annotated sequences had low e-values, with the ranges of 0, 0 ~ 1e-100, 1e-100 ~ 1e-60, 1e-60 ~ 1e-45, 1e-45 ~ 1e-30, 1e-30 ~ 1e-15, and 1e-15 ~ 1e-5 accounting for 14.67%, 33.23%, 12.78%, 7.86%, 12.46%, 10.69%, and 8.30% of all sequences, respectively (Fig. 1D). Four species S. invicta, Acromyrmex.echinatior, Monomorium.pharaonis, and Cyphomyrmex.costatus were the most matched species, with the similarities of 78.48%, 1.86%, 1.62%, and 1.36%, respectively (Fig. 1E).

Fig. 1figure 1

Species annotation. A S. invicta sample. B The increasing dynamics of the number of S. invicta in China. The y-axis was the number of counties with S. invicta occurrence, and the x-axis was the year. The data were obtained from the Ministry of Agriculture and Rural Affairs of the People’s Republic of China (http://www.moa.gov.cn/). C Similarity distribution. D E-value distribution. E NR species distribution

Comparative GO analysis and KEGG pathways

A total of 33,231 unigenes and 721 proteins were obtained from the successfully constructed transcriptome and proteome. Only 9,842 unigenes (29.62%) and 469 proteins (65.05%) were annotated against the GO database, and proteins had 2.20 times as the annotation percentage as unigenes. While only 4,844 unigenes (14.58%) and 717 proteins (99.45%) were annotated against the KEGG database, and proteins had 6.82 times as the annotation percentage of unigenes. Gene analysis of unigenes and proteins was performed by using the default parameters of Blast2GO software [32]. GO was classified into the following three categories: BP, CC, and MF. By selecting the unigene number in the top fifteen for mapping, we found that biological process and oxidation reduction process were the two richest terms in BP, with 451 (1.36%) and 358 (1.08%) in unigenes, and with 31 (4.30%) and 43 (5.96%) in proteins, respectively. Cytoplasm and nucleus were the two most abundant terms in CC, with 1656 (4.98%) and 1605 (4.83%) in unigenes, and with 134 (18.59%) and 71 (9.85%) in proteins, respectively. Extracellular space had the highest ratio in CC, with 6.78 times as the percentage of annotated proteins as unigenes. In MF, protein binding, molecular function and ATP binding were the three most abundant terms, with 711 (2.14%), 628 (1.89%) and 551 (1.66%) in unigenes, and with 66 (9.16%), 44 (6.10%), and 38 (5.27%) in proteins, respectively. Calcium ion binding had the highest ratio in MF, with 6.21 times as the percentage of annotated proteins as unigenes (Fig. 2A).

Fig. 2figure 2

Comparative Gene Ontology (GO) analysis and KEGG pathways. A Comparative Gene Ontology (GO) analysis of unigenes and proteins. “regulation of transcription…”, “positive regulation…”, “regulation of transcription…”, “negative regulation…”, “integral component…”, “integral component of…”, “structural constituent…”, “protein homodimerization…”, and “DNA binding transcription…” in the figure actually represented as “regulation of transcription, DNA-templated”, “positive regulation of transcription by RNA polymerase II”, “regulation of transcription by RNA polymerase II”, “negative regulation of transcription by RNA polymerase II”, “integral component of membrane”, “integral component of plasma membrane”, “structural constituent of ribosome”, “protein homodimerization activity”, and “DNA binding transcription factor activity”, respectively. B Cellular processes. “Transport…” and “Cellular community…” in the figure actually represented as “Transport and catabolism” and “Cellular community—eukaryotes”. C Genetic information processing. “Folding, sorting…” and “Replication…” in the figure actually represented as “Folding, sorting and degradation” and “Replication and repair”. D Environmental information processing. “Signaling molecules…” in the figure actually represented as “Signaling molecules and interaction”. E Metabolism. “Carbohydrate…”, “Amino acid…”, “Nucleotide…”, “Metabolism of co…”, “Glycan…”, “Xenobiotics…”, “Metabolism of other…”, “Metabolism of ter…”, and “Biosynthesis…” in the figure actually represented as “Carbohydrate metabolism”, “Amino acid metabolism”, “Nucleotide metabolism”, “Metabolism of cofactors and vitamins”, “Glycan biosynthesis and metabolism”, “Xenobiotics biodegradation and metabolism”, “Metabolism of other amino acids”, “Metabolism of terpenoids and polyketides”, and “Biosynthesis of other secondary metabolites”, respectively. F Organismal systems. “Environmental…” in the figure actually represented as “Environmental adaptation”. G Human diseases. “Endocrine…”, “Neurodegenerative…”, and “Cardiovascular…” in the figure actually represented as “Endocrine and metabolic diseases”, “Neurodegenerative diseases”, and “Cardiovascular diseases”. H Level 3 KEGG pathways. “Protein processing…”, “Ubiquitin mediated…”, “MAPK signaling…”, “Ribosome biogenesis…”, “Phosphatidylinositol…”, and “Drug metabolism…” in the figure actually represented as “Protein processing in endoplasmic reticulum”, “Ubiquitin mediated proteolysis”, “MAPK signaling pathway—fly”, “Ribosome biogenesis in eukaryotes”, “Phosphatidylinositol signaling system”, and “Drug metabolism—other enzymes”, respectively

Unigenes and proteins were annotated against the KEGG database with the online automated annotation system KAAS (KEGG Automatic Annotation Server) (http://www.genome.jp/tools/kaas/). The pathways of KEGG were divided into six categories at the level 1 including cellular processes, genetic information processing, environmental information processing, metabolism, organismal systems, and human diseases. Among them, metabolism accounted for 10.18% in unigenes and 52.43% in proteins while human diseases accounted for only 1.10% in unigenes and 4.16% in proteins. At the level 2, in cellular processes, transport and catabolism were the most with 734 unigenes (2.21%) and 75 proteins (10.4%) annotated (Fig. 2B). In genetic information processing, 730 translations (2.20%) were annotated in unigenes, and 45 folding, sorting and degradation (6.24%) were annotated most in proteins (Fig. 2C). In environmental information processing, signal transduction was the most in both unigenes and proteins, with 931 (2.80%) and 50 (6.93%), respectively (Fig. 2D). Metabolism annotated the most abundant unigenes and proteins among the six categories (Fig. 2E). In organismal systems, excretory system had the largest ratio, and the ratio of the proteins was 15.36 times of the unigene annotation (Fig. 2F). In human diseases, infectious diseases were the most abundant, with 98 (0.29%) and 12 (1.66%) annotated in unigenes and proteins, respectively (Fig. 2G). The top one pathway with matched unigene was ribosome (218, 0.66%) whereas the most abundant in proteins was oxidative phosphorylation (28, 3.88%) (Fig. 2H).

Potential toxins screening

A total of 316 and 47 toxin-related proteins were screened at the transcriptome and proteome levels through Blast annotation, respectively (Fig. 3A). At the transcriptome level, Latroinsectotoxin (12.03%), Latrocrustotoxin-Lt1a (12.03%), Serine proteinase/ serine protease (10.13%), Calglandulin (6.01%), Thrombin-like enzyme (5.38%), Venom prothrombin activator (4.75%), Neprilysin (4.11%), Putative lysosomal acid lipase/ cholesteryl ester hydrolase (4.11%), Phospholiase (4.11%), Latrotoxin (3.80%), Acetylcholinesterase (3.48%), Venom acid phosphatase Acph-1 (3.16%), Venom carboxylesterase-6 (2.85%), Venom allergen (1.90%), Reticulocalbin (0.95%), and Others (20.89%) were screened out. 47 proteins were further screened from 316 toxin-related uingenes. Among the toxins with the top fifteen highest richness in the transcriptome, only Latroinsectotoxin (2.13%), Serine proteinase/ serine protease (14.89%), Calglandulin (12.77%), Venom prothrombin activator (10.64%), Neprilysin (10.64%), Venom carboxylesterase-6 (4.26%), Venom allergen (4.26%), and Reticulocalbin (4.26%) were screened out in the proteome (Fig. 3B).

Fig. 3figure 3

Potential toxins screening. A Potential toxins and enzymes were identified at the transcriptome and proteome levels of S. invicta. Red color and blue color represented transcriptome and proteome respectively. B Different colors represented different toxins or enzymes, and the size of the area represented the percentage of each toxin in the total. “Venom carboxy…”, “Venom acid…”, “Serine…”, “Venom prothrombin…” and “Putative…” stood for “Venom carboxylesterase-6”, “Venom acid phosphatase Acph-1”, “Serine proteinase/serine protease”, “Venom prothrombin activator”, and “Putative lysosomal acid lipase/cholesteryl ester hydrolase”, respectively. C The MW and PSMs of toxins. D The PI and amino acid coverage (Coverage) distribution of toxins. E The toxins were comparatively analyzed in transcriptome and proteome, all unigenes and proteins related to GAPDH2 were calculated by log2 (Fold change) and displayed by heatmap. “Venom prothrombin activator… (1)”, and “Venom prothrombin activator… (2)” represented as “Venom prothrombin activator omicarin-C non-catalytic subunit (1)”, and “Venom prothrombin activator omicarin-C non-catalytic subunit (2)”

Characteristics of all toxins were analyzed including the molecular weight (MW), the peptide spectrum matches (PSMs), the isoelectric point (PI), and coverage (%). The MW of toxins were distributed between 13 ~ 303 kD, and the PSMs were mainly concentrated in 1 ~ 15 (Fig. 3C). The PI concentrated between 4.0 ~ 6.0 was accounting for 57.45%. The lowest coverage of amino acid in toxins was Venom prothrombin activator pseutarin-C non-catalytic subunit (TRINITY_DN9182_c1_g14, 0.00%), and the highest was Calglandulin (TRINITY_DN9166_c1_g16, 21.00%) (Fig. 3D). After it, we selected the toxins that both expressed at the transcriptome and proteome levels, and calculated their fold change with the expression of GAPDH2 which is a house-keeping gene. Among them, it showed that Venom allergen 3 (1) (TRINITY_DN8992_c2_g4) and Calglandulin (1) (TRINITY_DN9166_c1_g16) were both highly expressed at the transcriptome and proteome levels. In addition, Kunitz-type serine protease inhibitor TCI (TRINITY_DN8482_c2_g14), Phospholipase A2 (Fragment) (TRINITY_DN9050_c0_g3), Venom prothrombin activator omicarin-C non-catalytic subunit (1) (TRINITY_DN9341_c1_g13), and Snake venom serine protease (TRINITY_DN9155_c0_g11) were lowly expressed in the transcriptome while the opposite in the proteome (Fig. 3E).

Calglandulin

The dominate toxic component calglandulin was naturally performed sequence alignment and 3D modeling with a template (PDB ID: 2F2P) which showed the structure of calmodulin bound to a calcineurin peptide. Calglandulin belongs to the calmodulin family. TRINITY_DN9166_c1_g16 with 2F2P showed a similarity of 34.27%. There was a total of 14 differences between TRINITY_DN9166_c1_g16 and the template, which were also showed in the 3D structure, including 8 coils, 4 turns, and 2 helixes (Fig. 4A and B). The construction of phylogenetic tree intuitively showed close genetic relationship with formicidae and the highest similarity was S. invicta (100.00%) (Fig. 4C).

Fig. 4figure 4

Sequence alignment, 3D modeling and phylogenetic analysis of Calglandulin. A Putative sequence TRINITY_DN9166_c1_g16 was aligned with the template (PDB ID: 2F2P). At the bottom of columns, asterisks (*) showed conserved positions, colons (:) showed conserved substitutions and points (.) showed non-conserved substitutions. Grey line, green bend, blue banded arrowhead, and red solenoid represent coil, turn, sheet, and helix, respectively. Different fragments were marked by red framework. B 3D modeling was modeled by SWISS-MODEL and viewed by Discovery Studio 4.5. The color grey, green, blue, and red represents coil, turn, sheet, and helix, respectively. The differences of TRINITY_DN9166_c1_g16 compared with the template (PDB ID: 2F2P) were marked with black font. C The phylogenetic tree including sequence TRINITY_ DN9166_c1_g16 and other ten sequences from different species was constructed by using MEGA 7 with the Neighbor-Joining method. Putative sequence, Myrmicinae, and Ponerinae were marked as red, green, and yellow circles, respectively

Venom allergen 3

The sequence alignment and 3D modeling of dominate toxic component venom allergen 3 were naturally performed with a template (PDB ID: 2VZN). Because the template of TRINITY_DN8992_c2_g4 with 2VZN showed a high similarity of 100.00%, there was no difference in 3D structure (Fig. 5A and B). The construction of phylogenetic tree intuitively showed that the ten species with the highest similarity all belonged to formicidae and may further shed insights on the molecular evolution of venom (Fig. 5C).

Fig. 5figure 5

Sequence alignment, 3D modeling and phylogenetic analysis of Venom allergen 3 toxin. A Putative sequence TRINITY_DN8992_c2_g4 was aligned with the template (PDB ID: 2VZN). At the bottom of columns, asterisks (*) showed conserved positions, colons (:) showed conserved substitutions and points (.) showed non-conserved substitutions. Grey line, green bend, blue banded arrowhead, and red solenoid represent coil, turn, sheet, and helix, respectively. Different fragments were marked by red framework. B 3D modeling was modeled by SWISS-MODEL and viewed by Discovery Studio 4.5. The color grey, green, blue, and red represents coil, turn, sheet, and helix, respectively. The differences of TRINITY_DN8992_c2_g4 compared with the template (PDB ID: 2VZN) were marked with black font. C The phylogenetic tree including sequence TRINITY_DN8992_c2_g4 and other ten sequences from different species was constructed by using MEGA 7 with the Neighbor-Joining method. Putative sequence, Myrmicinae, Ponerinae, Dolichoderinae, and Dorylinae were marked as red, green, yellow, dark blue, and pink circles, respectively

Venom prothrombin activator hopsarin-D

The sequence alignment and 3D modeling of dominate toxic component venom prothrombin activator hopsarin-D was naturally showed with a template (PDB ID: 4BXW). The template is crystal structure of the prothrombinase complex from the venom of Pseudonaja Textilis. TRINITY_DN9654_c1_g6 with 4BXW showed a similarity of 36.28%. The TRINITY_DN9654_c1_g6 had a total of 18 differences with the template, including 10 coils, 4 turns, 4 helixes (Fig. 6A and B). A phylogenetic tree of S. invicta venom (venom prothrombin activator hopsarin-D) was constructed, and showed different distances between sequence (TRINITY_DN9654_c1_g6) and other species (Fig. 6C).

Fig. 6figure 6

Sequence alignment, 3D modeling and phylogenetic analysis of Venom prothrombin activator hopsarin-D. A Putative sequence TRINITY_DN9654_c1_g6 was aligned with the template (PDB ID: 4BXW). At the bottom of columns, asterisks (*) showed conserved positions, colons (:) showed conserved substitutions and points (.) showed non-conserved substitutions. Grey line, green bend, blue banded arrowhead, and red solenoid represent coil, turn, sheet, and helix, respectively. Different fragments were marked by red framework. B 3D modeling was modeled by SWISS-MODEL and viewed by Discovery Studio 4.5. The color grey, green, blue, and red represents coil, turn, sheet, and helix, respectively. The differences of TRINITY_DN9654_c1_g6 compared with the template (PDB ID: 4BXW) were marked with black font. C The phylogenetic tree including sequence TRINITY_DN9654_c1_g6 and ten other sequences from different species was constructed by using MEGA 7 with the Neighbor-Joining method. Putative sequence, Myrmicinae, Dorylinae, and Pseudomyrmecinae were marked as red, green, pink, and light blue circles, respectively

留言 (0)

沒有登入
gif