parameter | value |
---|---|
SpectroPipeR_version | 0.2.0 |
output_folder | species_mix_analysis |
ion_q_value_cutoff | 0.01 |
id_drop_cutoff | 0.3 |
normalization_method | median |
normalization_factor_cutoff_outlier | 4 |
filter_oxidized_peptides | TRUE |
protein_intensity_estimation | MaxLFQ |
stat_test | rots |
type_slr | median |
fold_change | 1.5 |
p_value_cutoff | 0.05 |
paired | FALSE |
Spectronaut_report_file | HYE_Exploris480_SN19_Report_SpectroPipeR (Normal).tsv |
log_file_name | species_mix_analysis/2024_07_11__15_06_SpectroPipeR_analysis.log |
skipping_MaxLFQ_median_norm | FALSE |
batch_adjusting | FALSE |
Data analysis description
The whole analysis was performed using R version 4.4.1 (2024-06-14) and SpectroPipeR package 0.2.0.
The raw data output from Spectronaut was utilized for normalization purposes. If the normalization option in Spectronaut was unchecked, a median normalization was performed over the MS2 total peak area intensities. Alternatively, if the normalization method was selected in Spectronaut™, the corresponding normalization approach provided by Spectronaut™ was employed.
Zero intensity values were replaced using the half-minimal intensity value from the whole dataset.
Methionine oxidized peptides were removed from the analysis.
To obtain peptide intensity data, the intensities of all ions corresponding to a given peptide were summed up within each sample.
If a covariate adjustment of peptide intensity data was performed using the users input formula, a linear mixed model (LMM) was calculated based on that formula per peptide and the outcoming residuals were added to the mean peptide intensity over the samples. This means that the adjusted peptide intensities retain their intensity level (low intense peptides keep their low intensity and high intense ions keep their higher intensity).
The MaxLFQ protein intensity data was generated by using the MaxLFQ algorithm implemented in the iq R-package (version1.9.12) followed by a global median normalization.
The principle component analysis for peptide and protein level was done using the factomineR package (version: 2.11), where data was scaled to unit variance.
The UMAP analysis was performed using the umap package (version 0.2.10.0).
The statistical analysis was performed on peptide level using the PECA package (version: 1.40.0)
Here a ROTS test was used to perform the statistical analysis implemented in the PECA package (version1.40.0) aka ROPECA (Reproducibility-Optimized Peptide Change Averaging) approach. ROPECA is a statistical method for analyzing proteomics data, specifically designed for data-independent acquisition (DIA) mass spectrometry experiments (Seyednasrollah, F., Rantanen, K., Jaakkola, P., Elo, L. (2016). It aims to maximize the reproducibility of the detections by optimizing the overlap of significant peptides between replicate samples.
The modified t-statistic is calculated using the linear modeling approach in the Bioconductor limma package. An empirical Bayes method to squeeze the protein-wise residual variances towards a common value (or towards a global trend) (Smyth, 2004; Phipson et al., 2016). The degrees of freedom for the individual variances are increased to reflect the extra information gained from the empirical Bayes moderation, resulting in increased statistical power to detect differential expression.
ROTS: reproducible RNA-seq biomarker detector-prognostic markers for clear cell renal cell cancer. Nucleic acids research 44(1), e1. https://dx.doi.org/10.1093/nar/gkv806).
ROPECA - Reproducibility-Optimized Peptide Change Averaging - utilizes the ROTS. ROPECA approach is described in Suomi, T., Elo, L.L. Enhanced differential expression statistics for data-independent acquisition proteomics. Sci Rep 7, 5869 (2017). https://doi.org/10.1038/s41598-017-05949-y
pipeline parameters used for the analysis
samples overview
protein counts
In total 10753 Protein_groups could be identified.
- 1166 Protein_groups with < 2 peptides.
- 9587 Protein_groups with >= 2 peptides.
ion ID rate (q-value filtered)
ON/OFF analysis
The ON/OFF analysis was conducted by filtering the ions based on the chosen Q-value cutoff (ion Q-value below 0.01). For a protein group to be considered detected in a particular condition, it had to meet the following criteria:
- The protein group must be represented by at least two peptides.
- These peptides must be present in at least 50% of the replicates for that specific condition.
If these requirements were met, the protein group was designated as “DETECTED” (1) in that condition. Otherwise, it was marked as “NOT DETECTED” (0) for that particular condition.
normalization of data
The ion intensity data was normalized either in Spectronaut or in the the pipeline using a median-median normalization on ion level depending on your settings.
No normalization outlier detected !
R.FileName | R.Condition | R.Replicate | MedianNormalizationFactor | normalization_outlier |
---|---|---|---|---|
20230224_Exp…S_OT2_MixA_TR1 | HYE mix A | 1 | 1.0713139 | no |
20230224_Exp…S_OT2_MixA_TR2 | HYE mix A | 2 | 0.9922639 | no |
20230224_Exp…S_OT2_MixA_TR3 | HYE mix A | 3 | 1.0241338 | no |
20230224_Exp…S_OT2_MixA_TR4 | HYE mix A | 4 | 1.0741274 | no |
20230224_Exp…S_OT2_MixB_TR1 | HYE mix B | 1 | 1.0305240 | no |
20230224_Exp…S_OT2_MixB_TR2 | HYE mix B | 2 | 0.9414677 | no |
20230224_Exp…S_OT2_MixB_TR3 | HYE mix B | 3 | 0.9251446 | no |
20230224_Exp…S_OT2_MixB_TR4 | HYE mix B | 4 | 0.8988186 | no |
`
MaxLFQ intensity distribution
The MaxLFQ protein intensity data was generated by using the MaxLFQ algorithm implemented in the iq R-package (version 1.9.12) followed by a global median normalization. The distribution of the MaxLFQ data is visualized in the plot below.
PCA & correlation & UMAP
With the normalized data (peptide or protein intensities) a PCA analysis was performed (data was scaled).
PCA stands for Principal Component Analysis, which is a statistical technique used to reduce the dimensionality of large data sets. It does this by transforming a large set of variables into a smaller one that still contains most of the information in the large set. This is achieved by finding new variables, called principal components, that are linear combinations of the original variables and capture as much of the variation in the data as possible.
The first principal component captures the largest amount of variation in the data, while each subsequent component captures the next largest amount of variation.
A PCA plot is a visual representation of the results of a Principal Component Analysis (PCA). It can help you understand the relationships between the samples in your data by showing how they cluster together based on their similarity.
Samples that are close together on the plot are similar to each other, while samples that are far apart are dissimilar. The direction and length of the arrows on the plot show how each variable contributes to the first and second principal components.
The first and second dimensions of a PCA plot represent the first and second principal components of the data, respectively. These are the two directions in the data that capture the most variation. The first principal component captures the largest amount of variation in the data, while the second principal component captures the second largest amount of variation.
UMAP stands for Uniform Manifold Approximation and Projection. It is a dimension reduction technique that can be used for visualization similarly to t-SNE, but also for general non-linear dimension reduction. UMAP is a powerful tool for machine learning practitioners to visualize and understand large, high dimensional datasets. It offers a number of advantages over t-SNE, most notably increased speed and better preservation of the data’s global structure.
To perform a UMAP analysis, the first step is to construct a high-dimensional graph representation of the data. This is done by finding the k-nearest neighbors for each data point and connecting them with edges. The weight of each edge is determined by the distance between the two points it connects. Next, a low-dimensional graph is optimized to be as structurally similar as possible to the high-dimensional graph. This is done using a stochastic gradient descent algorithm that minimizes the cross-entropy between the two graphs.
The result of a UMAP analysis is a low-dimensional representation of the data that can be visualized using a scatter plot. Each point on the plot represents a sample, and its position reflects its values for the first two or three principal components. Samples that are close together on the plot are similar to each other, while samples that are far apart are dissimilar.
In summary, UMAP analysis is performed by constructing a high-dimensional graph representation of the data, then optimizing a low-dimensional graph to be as structurally similar as possible. The result is a low-dimensional representation of the data that can be visualized and used to understand patterns and relationships in the data.
Spearman’s rank correlation coefficient, also known as Spearman’s ρ (rho), is a nonparametric measure of rank correlation. It is a statistical measure of the strength and direction of the monotonic relationship between two variables. In other words, it assesses how well the relationship between two variables can be described using a monotonic function.
To calculate Spearman’s correlation coefficient, the raw data is first converted into ranks. The Spearman correlation coefficient is then defined as the Pearson correlation coefficient between the rank variables. For a sample of size n, if all n ranks are distinct integers, it can be computed using the formula ρ = 1 - ((6 * Σd_i^2) / (n * (n^2 - 1))), where d_i is the difference between the two ranks of each observation and n is the number of observations.
Spearman’s correlation is appropriate for both continuous and discrete ordinal variables. It is often used as an alternative to Pearson’s correlation when the data does not meet the assumptions of linearity and normality required for Pearson’s correlation. Spearman’s correlation is less sensitive to outliers and can be used for data that follows curvilinear or monotonic relationships.