Only three genes rfaG, nifH, and selD do not have significant correlation between evolutionary rates and GC contents. GC contents were growing along with Ka values under different linear determination coefficients.
Figure 2. Higher GC content happens with higher Ka value. The red points are genes from MmarS2. The deviation of amino acids is the amino acid frequencies of one gene minus the amino acid frequencies of its homologous gene in MmarS2. We further observed that sizes of the genomes containing these homologous genes also positively correlate with the GC contents of these homologous genes Supplementary Figure S1.
For different genes, the amino acid compositions significantly influencing the evolutionary rates are different. Considering that huge genomes have more proteins to be translated. Amino acids have different energy and material cost for synthesis, which means some amino acids are much more expensive than others.
Thus, huge genomes incline to employ cheaper amino acids, while cheap amino acids tend to be GC rich and genes with higher GC content tend to be highly expressed Chen et al. Amino acid composition reflects the action of natural selection to enhance metabolic efficiency, and cheaper amino acids tend to be encoded by codons with high GC content.
Consequently, we observe the positive linear relationship between GC content and Ka values, which may be caused by maximizing the metabolic efficiency. Although some of genes from MmarS2 ileS vs. Linear regression models were constructed between evolutionary rates Ka and amino acid compositions of LUCA as well as non-LUCA proteins using principal analysis regression Supplementary Table S4 to solve the multilinear problem.
Firstly, six principal components are extracted from 20 amino acid composition values and then linear regression was performed between the predicted values of the six principal components and the Ka values. All the first principal component for these homologous genes has a very high correlation with their GC contents The mean of average R is 0. Amino acids encoded by GC-rich codons frequently existing in the first principal component show that the first principal component represents the GC-content and thus GC contents decide the evolutionary rate as the main factor.
Table 2. The linear models between evolutionary rates and amino acid compositions. Next, to detect the detailed amino acid gain and loss during evolution according to GC content, we investigated the composition variation of GC rich amino acids and AT rich amino acids for homologous proteins. Comparing with proteins of MmarS2, amino acid contents may increase or decrease in corresponding homologous proteins.
According to the GC content deviation degree we classify the homologous genes from different genomes into four groups. The deviation of amino acid composition for GC rich amino acids positively correlates with the deviation of GC content, while the deviation of amino acid composition for AT rich amino acids negatively correlates with the deviation of GC content Figure 2C.
Finally, it can be deduced that genomes with higher GC content may have higher ratios of GC rich amino acids. In conclusion, the LUCA proteins evolve under strong effects of GC content, which probably is selected by metabolic efficiency of amino acids.
GC rich amino acids tend to increase along with the increase of GC content of protein coding genes. It is proved that GC content is one strong factor for promoting protein evolution and shaping the amino acid composition.
However, the recruitment order of amino acids was considered as a main component deciding the mutation direction of amino acids Jordan et al. The GC features for each amino acid were determined based on corresponding genetic codons Osawa et al. The correlation analysis between these features show that GC content could have contradictory effects on evolutionary rates than cost, molecular weight, and recruitment order. The energy cost for synthesizing amino acids Akashi and Gojobori, and the molecular weight of amino acids also are important factors influencing the evolutionary rates.
However, their correlation with the gain and loss of amino acid is very weak not significant. Then, we investigated the effects of amino acid recruitment order on the evolutionary rates. Sixty-five genomes were used in the following analysis. Firstly, it is investigated that whether the ancient amino acids have higher frequency than the newly recruited amino acids in all these researched genomes. Figure 3. Only in In C The boxplots of values of deviation of amino acids for ancient amino acids and newly recruited amino acids.
D The order and GC content features for 20 standard amino acids. The GC-rich amino acids are with values of 2, and the AT-rich amino acids are with values of 0. The early recruited amino acids are from 1 to 10, and the newly recruited amino acids are from 11 to These proteins were used in panels A,B.
Although new amino acids have lower amino acid levels, they may have higher deviations. Thus, the low levels of new amino acid compositions may not cause the lower chance of amino acid replacement occurrence.
Figure 4. The amino acid compositions for possible LUCA organisms and all other microorganisms. Clostridium 48 genomes and Methano 51 genomes are possible LUCA organisms, and Others genomes are some bacteria and archaea genomes. The Methano group has less Q, H, W amino acids than others.
Amino acids C and M of Methano and Clostridium are a little higher may because they have more sulfur or living in a circumstance without higher sulfur. The Methano here is short for Methanoproducents. Next, it is investigated whether the recruitment order influence the gain and loss of amino acids during evolution. The correlation analysis result showed that only in For comparison, it was shown that in The GC content of amino acids have more effects on amino acid composition than the recruitment order of amino acids.
To further verify this, the gain and loss conditions for those earliest and latest four amino acids in homologous proteins with various GC variation were compared Supplementary Table S5. Four eldest amino acids Asp, Ser, Glu, and Leu tend to be lost more often than being gained, while four newest amino acids Gln, His, Cys, and Trp tend to be gained more during evolution process Figure 3C.
When the change range of GC content is 0 to 5 percent, old amino acids incline to be lost, while new amino acids incline to be gained. When the change range of GC content is higher than five percent, compositions of these amino acids tend to be largely influenced by the GC content.
The ratio of GC content for codons influences the corresponding amino acid composition variation. Each genome has a preferred GC content, which promotes the variation of amino acid composition than the recruitment sequence of amino acids. Old residuals have high levels of usage in all proteins, and new ones have rather low level of usage. However, the amino acids with lower levels of usage may cause the lower chance of amino acid replacement occurrence.
Thus, more investigation is needed. According to the amino acid recruitment order, these amino acids are further grouped into two sub-groups: group new, which include Q, H, C, W, and group old, which include D, V, S, E, L, T. Thus, the amino acids with a lower levels of usage may not cause the lower chance of amino acid replacement occurrence.
The conclusion we acquired here is reliable. The principal component analysis showed that the first six components of compositions for 20 amino acids can be applied to build a significant linear models. The first principle component correlate well with GC contents of genes. Further analysis found that the GC contents correlate well with the Ka for genes, and the loss and gain of amino acids changes along with the GC contents.
The amino acid frequencies coded by GC-rich codons positively correlate with the deviation of GC-content, while the compositions of amino acids coded by AT-rich codons negatively correlate with the deviation of the GC-content. Thus, the strong effect of GC contents on the protein evolution is proved.
Next, we found that the recruitment order of amino acids has effects on amino acid composition during evolution. Its effect is weaker than that of GC content. Finally, as one feature of amino acids, GC contents has stronger effects on the protein evolution than other important features like recruitment order and cost.
Previously published research explained the gain and loss of amino acid variation during evolution with a neutral hypothesis, claiming that the trend in protein evolution was not driven by any simple trend at the DNA level Jordan et al. A second bias is in the composition of the few base pair around the fragment ends. It has been described before in RNA-seq 9. The relative frequency of nucleotides follows a position-specific pattern roughly starting four bases before fragment and ending 8—9 bases inside it Figure 6 , Left.
Note that G and C are differently preferred, and so is A compared with T. The fragment GC effect described before can also be seen—the small preference of G and C between 20 and reflecting fragment sizes. Rates stratified by dinucleotide counts are significantly different than singletons.
In particular, the dinucleotide on which fragment rates depend the most is the pair surrounding the fragment end the breakpoint , shown on the right. Fragments are much more likely to start within a CpG dinucleotide, than any other dinucleotide.
Fragmentation effect. A horizontal dotted line marks the relative abundance of the base at mappable positions. Local effects captured by the fragment model drive the GC curves found at larger scales. For all three bin sizes, the predicted counts black trace the observed loess line blue , and also capture some of the variability around the curve. Aggregation of single location estimates. A — C Estimates based on the fragment GC curve black trace similar paths as loess cyan estimated on observed counts blue on multiple scales.
D — F Estimates based on alternative models compared with observed counts on 1 kb bins. See Supplementary methods for details on how models for E and F were defined and estimated.
In contrast, models based on smaller portion of the fragment do not trace the observed curves. Figure 7 D shows the estimates from the read W 0, The methods of correction used for Figure 7 E and F are described in detail in Supplementary data. Correction based on the fragment and fragment-length models remove most GC-dependent fragment count variation.
The same holds for all bin sizes. Since adding length did not change the results greatly, we use the more parsimonious model for the rest of this work. We visualize the correction in a region of chromosome 1 which has no CN changes. In Figure 8 A uncorrected but scaled 1 kb bin counts display large low-frequency variations, which can be mistaken for CN events. The fragment model removes these variations better than the loess model.
In Figure 8 B, a histogram of corrected counts shows that the fragment correction produces tighter distribution of scaled counts around 1 compared with the loess model.
Corrected counts of normal sample. Each point represents counts from both libraries forward strand. A similar correction on the tumor data reveals a hidden CN both libraries, forward strand in Figure 9. GC curves for both the loess and fragment models were estimated from chromosome 1, and corrected counts for a CN gain on chromosome 2 are shown. The CN gain is hidden in the uncorrected data due to low-frequency count variation driven by GC content. Both the fragment model correction and the loess correction reveal the CN gain.
The fragment correction provides better separation between bands [see histograms in Figure 9 B ]. Also, it successfully corrects for different binning resolutions Supplementary Figure S3. Note that chromosome 1 was used for GC estimation because it does not seem to have large CN changes as seen in Figure 2.
CN gain from tumor sample. Counts and corrected counts at position 29 kb on chromosome 2. GC curves estimated on chromosome 1 which has no large CN changes. B Histogram of normalized counts at 28—30 mb underlined on left plots. The estimated GC effect and mappability explain most the variation in the fragment coverage of the normal genome though not all of it. The GC model removes most of the variability in the binned counts, much more so than corrections based only on mappability.
The RV of the fragment model is considerably smaller than that of the loess model. It is still larger than Poisson, though small areas with extremely high coverage cause most of this extra variance. Computed on 1 kb bins from normal sample forward strand, library 1 , after removing outlier bins. For a comparison more robust to these high-coverage regions, we compare quantiles rather than variances. In Figure 10 , we compare the 0. The variation in bins with very low observed counts is largely explained by mappability.
However, mappability cannot explain variation of higher counts, and the spread between the quantiles is approximately double that of the Poisson. Models taking GC content into account produce much tighter spreads. The fragment-length model the green curve consistently leaves less variation around the estimated rates than the loess model blue.
Comparison to Poisson variation. Models that predict better will have narrower vertical spreads. Variation around the mean of the fragment model green , the loess blue and mappability black are compared to variation around a Poisson red. In the above analysis, we described a single tumor—normal pair produced by a single lab, but our results are general to many examined samples from multiple labs. In Figure 11 , we show four descriptive plots from a different data set based on HCC cell line, see Table 1 for details.
The GC has a strong effect on fragment counts, and this relation is unimodal Figure 11 A. A distinct difference is the lack of length dependence of the fragments data not shown. The AT preference near fragment ends is also missing, further proving that it is not the major source of the GC bias.
Two additional sets of data are shown in the Supplementary Data. GC plots for Dataset 2. C GC curve at fragment model W 2, Large biases in fragment counts related to the GC composition of regions were found in the data sets we examined.
These observed effects have a recurring unimodal shape, but varied considerably between different samples. We have shown that this GC effect is mostly driven by the GC composition of the full fragment.
Conditioning on the GC of the fragments captures the strongest bias, and removing this effect provides the best correction, compared with alternative GC windows. When single base pair predictions based on the fragment composition are aggregated, the results trace the observed GC dependence.
This cannot be said about local effects that take only the reads into account. This conclusion holds for various data sets, with different fragment length composition, read lengths and GC effect shapes.
That the GC curve is unimodal is key to this analysis. In all data sets shown, the rate of GC-poor or GC-rich fragments is significantly lower than average, in many cases zero. Unimodality was overlooked by Dohm et al. Even in humans, it is hard to spot this effect if counts are binned by GC quantiles instead of GC values. Nevertheless, it is this departure from linearity that allowed pinpointing an optimal scale—the fragment size. In that, unimodality gives us important clues as to the causes of the GC bias.
While we have described other sequence-related biases, we believe they are not driving the strong coverage GC biases. These include an increased coverage when the ends are AT rich, and location-specific fragmentation biases near the fragment ends.
They are also surprisingly negligible in the context of larger bins. Still, they might locally mitigate the fragment GC effect: the effect of fragment length on GC curve seems to be associated with these biases. Our conclusions seem to complement those of Aird et al.
We have shown this is indeed the case. It should be noted that even these optimized PCR protocols can still display significant biases and may require GC correction. Our refined description of the GC effect is of practical value for GC correction. First of all, the non-linearity of the GC effect is a warning sign regarding two-sample correction methods. In the main example we study, the pair of normal and tumor samples do not have the same GC curves.
We have seen this in additional data sets as well. Using normal counts to correct tumor counts could sometimes produce GC-related artifacts, which might lead to faulty segmentations. The GC effects of samples should be carefully studied before such corrections are made. A single sample correction for GC requires a model, and we demonstrate the importance of choosing the best model. Overlapping windows smaller than the fragment fail to remove the bulk of the GC effect.
Similarly, using read coverage rather than fragment count hurts the correction. Instead, measuring fragment rate for single base pair positions, decouples the GC modeling from the downstream analysis.
Thus, it removes the lower threshold on the scale of analysis, providing single base pair estimates, which can be later smoothed by the researcher as needed or binned into uneven bins if needed. An important benefit of DNA-seq over previous technologies is that simply repeating the experiment can increase the resolution of the analysis.
Our model assures that this increased resolution does not hurt the GC correction. Unlike other bias correction methods, such as BEADS 14 , we generate weights predicted fragment rates for the genomic location rather than for the observed reads. Mappable genomic positions are stratified according to the GC of a hypothetical fragment, and rates per GC stratum are estimated by counting the fragments at those same positions.
Estimating predicted rates for both covered and uncovered locations can help detect deletions, and these predicted rates form a natural input for downstream analysis using heterogeneous Poisson models. This procedure can be critical when length information is unavailable i. Abstract Genomic DNA base composition GC content is predicted to significantly affect genome functioning and species ecology. Publication types Research Support, Non-U. The statistical analyses were done using R Figure 1.
Relationship between intron GC-content and expression breadth in human. Expression breadth was measured with microarray data. Genes were grouped into 20 equal-sized categories according to their intronic GC-content. The median of expression for each class is drawn in thin lines and the mean in thick lines. The inter-category correlation between GC-content and expression mean, thick lines is high.
However, there is a huge intra-category variance see the boxes size. Review of the correlations between GC-content and expression published in the literature. There is a large variability in the values and in the signs of the correlations.
The different analyses were based on different measures of GC-content, different methods to detect gene expression SAGE, EST and microarray and different parameters of expression breadth, number of tissues where genes are expressed; mean, average level of expression for expressed genes; peak, maximum level of expression.
Correlation between GC-content and expression, for different measures of genes expression and for different estimators of base composition in human and mouse. Expression parameters: breadth, number of tissues where genes are expressed; mean, average level of expression for expressed genes; peak, maximum level of expression. The sign and R 2 -value of correlations are given, No. Simulations to assess the impact of grouping data on the measure of correlation coefficient between two linearly correlated variables.
Then, points were grouped into categories according to the value of X , and correlations were computed between X and Y , averaged within each category. Correlation coefficients are indicated for different levels of grouping i. Correlation between intron GC-content and gene expression breadth, computed on windows of neighboring genes: impact of window size. Lander, E. Nature , , — Mouchiroud, D.
Gene , , — Duret, L. Watanabe, Y. Kong, A. Jabbari, K. Smit, A. Bernardi, G. Science , , — Gene , , 3 — Galtier, N. Genetics , , — Eyre-Walker, A. Goncalves, I. Genome Res. Ponger, L. Vinogradov, A. Nucleic Acids Res. Urrutia, A. Versteeg, R. Pruitt, K. Lercher, M. Caron, H. Birney, E. Liang, P. Natl Acad. USA , 99 , — Edgar, R. Zhang, Z.
0コメント