Data analysis description

The whole analysis was performed using R version 4.4.1 (2024-06-14) and SpectroPipeR package 0.2.0.

The raw data output from Spectronaut was utilized for normalization purposes. If the normalization option in Spectronaut was unchecked, a median normalization was performed over the MS2 total peak area intensities. Alternatively, if the normalization method was selected in Spectronaut™, the corresponding normalization approach provided by Spectronaut™ was employed.

Zero intensity values were replaced using the half-minimal intensity value from the whole dataset.

Methionine oxidized peptides were removed from the analysis.

To obtain peptide intensity data, the intensities of all ions corresponding to a given peptide were summed up within each sample.

If a covariate adjustment of peptide intensity data was performed using the users input formula, a linear mixed model (LMM) was calculated based on that formula per peptide and the outcoming residuals were added to the mean peptide intensity over the samples. This means that the adjusted peptide intensities retain their intensity level (low intense peptides keep their low intensity and high intense ions keep their higher intensity).

The MaxLFQ protein intensity data was generated by using the MaxLFQ algorithm implemented in the iq R-package (version1.9.12) followed by a global median normalization.

The principle component analysis for peptide and protein level was done using the factomineR package (version: 2.11), where data was scaled to unit variance.

The UMAP analysis was performed using the umap package (version 0.2.10.0).

The statistical analysis was performed on peptide level using the PECA package (version: 1.40.0)

Here a ROTS test was used to perform the statistical analysis implemented in the PECA package (version1.40.0) aka ROPECA (Reproducibility-Optimized Peptide Change Averaging) approach. ROPECA is a statistical method for analyzing proteomics data, specifically designed for data-independent acquisition (DIA) mass spectrometry experiments (Seyednasrollah, F., Rantanen, K., Jaakkola, P., Elo, L. (2016). It aims to maximize the reproducibility of the detections by optimizing the overlap of significant peptides between replicate samples.

modt

The modified t-statistic is calculated using the linear modeling approach in the Bioconductor limma package. An empirical Bayes method to squeeze the protein-wise residual variances towards a common value (or towards a global trend) (Smyth, 2004; Phipson et al., 2016). The degrees of freedom for the individual variances are increased to reflect the extra information gained from the empirical Bayes moderation, resulting in increased statistical power to detect differential expression.

Reproducibility-Optimized Test Statistic (ROTS)

ROTS: reproducible RNA-seq biomarker detector-prognostic markers for clear cell renal cell cancer. Nucleic acids research 44(1), e1. https://dx.doi.org/10.1093/nar/gkv806).

ROPECA - Reproducibility-Optimized Peptide Change Averaging - utilizes the ROTS. ROPECA approach is described in Suomi, T., Elo, L.L. Enhanced differential expression statistics for data-independent acquisition proteomics. Sci Rep 7, 5869 (2017). https://doi.org/10.1038/s41598-017-05949-y

pipeline parameters used for the analysis

parameter	value
SpectroPipeR_version	0.2.0
output_folder	species_mix_analysis
ion_q_value_cutoff	0.01
id_drop_cutoff	0.3
normalization_method	median
normalization_factor_cutoff_outlier	4
filter_oxidized_peptides	TRUE
protein_intensity_estimation	MaxLFQ
stat_test	rots
type_slr	median
fold_change	1.5
p_value_cutoff	0.05
paired	FALSE
Spectronaut_report_file	HYE_Exploris480_SN19_Report_SpectroPipeR (Normal).tsv
log_file_name	species_mix_analysis/2024_07_11__15_06_SpectroPipeR_analysis.log
skipping_MaxLFQ_median_norm	FALSE
batch_adjusting	FALSE

samples overview

protein counts

In total 10753 Protein_groups could be identified.

1166 Protein_groups with < 2 peptides.
9587 Protein_groups with >= 2 peptides.

ion ID rate (q-value filtered)

ON/OFF analysis

The ON/OFF analysis was conducted by filtering the ions based on the chosen Q-value cutoff (ion Q-value below 0.01). For a protein group to be considered detected in a particular condition, it had to meet the following criteria:

The protein group must be represented by at least two peptides.
These peptides must be present in at least 50% of the replicates for that specific condition.

If these requirements were met, the protein group was designated as “DETECTED” (1) in that condition. Otherwise, it was marked as “NOT DETECTED” (0) for that particular condition.

normalization of data

The ion intensity data was normalized either in Spectronaut or in the the pipeline using a median-median normalization on ion level depending on your settings.

No normalization outlier detected !

R.FileName	R.Condition	R.Replicate	MedianNormalizationFactor	normalization_outlier
20230224_Exp…S_OT2_MixA_TR1	HYE mix A	1	1.0713139	no
20230224_Exp…S_OT2_MixA_TR2	HYE mix A	2	0.9922639	no
20230224_Exp…S_OT2_MixA_TR3	HYE mix A	3	1.0241338	no
20230224_Exp…S_OT2_MixA_TR4	HYE mix A	4	1.0741274	no
20230224_Exp…S_OT2_MixB_TR1	HYE mix B	1	1.0305240	no
20230224_Exp…S_OT2_MixB_TR2	HYE mix B	2	0.9414677	no
20230224_Exp…S_OT2_MixB_TR3	HYE mix B	3	0.9251446	no
20230224_Exp…S_OT2_MixB_TR4	HYE mix B	4	0.8988186	no

MaxLFQ intensity distribution

The MaxLFQ protein intensity data was generated by using the MaxLFQ algorithm implemented in the iq R-package (version 1.9.12) followed by a global median normalization. The distribution of the MaxLFQ data is visualized in the plot below.

PCA & correlation & UMAP

With the normalized data (peptide or protein intensities) a PCA analysis was performed (data was scaled).

PCA

PCA stands for Principal Component Analysis, which is a statistical technique used to reduce the dimensionality of large data sets. It does this by transforming a large set of variables into a smaller one that still contains most of the information in the large set. This is achieved by finding new variables, called principal components, that are linear combinations of the original variables and capture as much of the variation in the data as possible.

The first principal component captures the largest amount of variation in the data, while each subsequent component captures the next largest amount of variation.

A PCA plot is a visual representation of the results of a Principal Component Analysis (PCA). It can help you understand the relationships between the samples in your data by showing how they cluster together based on their similarity.

Samples that are close together on the plot are similar to each other, while samples that are far apart are dissimilar. The direction and length of the arrows on the plot show how each variable contributes to the first and second principal components.

The first and second dimensions of a PCA plot represent the first and second principal components of the data, respectively. These are the two directions in the data that capture the most variation. The first principal component captures the largest amount of variation in the data, while the second principal component captures the second largest amount of variation.

UMAP

UMAP stands for Uniform Manifold Approximation and Projection. It is a dimension reduction technique that can be used for visualization similarly to t-SNE, but also for general non-linear dimension reduction. UMAP is a powerful tool for machine learning practitioners to visualize and understand large, high dimensional datasets. It offers a number of advantages over t-SNE, most notably increased speed and better preservation of the data’s global structure.

To perform a UMAP analysis, the first step is to construct a high-dimensional graph representation of the data. This is done by finding the k-nearest neighbors for each data point and connecting them with edges. The weight of each edge is determined by the distance between the two points it connects. Next, a low-dimensional graph is optimized to be as structurally similar as possible to the high-dimensional graph. This is done using a stochastic gradient descent algorithm that minimizes the cross-entropy between the two graphs.

The result of a UMAP analysis is a low-dimensional representation of the data that can be visualized using a scatter plot. Each point on the plot represents a sample, and its position reflects its values for the first two or three principal components. Samples that are close together on the plot are similar to each other, while samples that are far apart are dissimilar.

In summary, UMAP analysis is performed by constructing a high-dimensional graph representation of the data, then optimizing a low-dimensional graph to be as structurally similar as possible. The result is a low-dimensional representation of the data that can be visualized and used to understand patterns and relationships in the data.

Spearman correlation

Spearman’s rank correlation coefficient, also known as Spearman’s ρ (rho), is a nonparametric measure of rank correlation. It is a statistical measure of the strength and direction of the monotonic relationship between two variables. In other words, it assesses how well the relationship between two variables can be described using a monotonic function.

To calculate Spearman’s correlation coefficient, the raw data is first converted into ranks. The Spearman correlation coefficient is then defined as the Pearson correlation coefficient between the rank variables. For a sample of size n, if all n ranks are distinct integers, it can be computed using the formula ρ = 1 - ((6 * Σd_i^2) / (n * (n^2 - 1))), where d_i is the difference between the two ranks of each observation and n is the number of observations.

Spearman’s correlation is appropriate for both continuous and discrete ordinal variables. It is often used as an alternative to Pearson’s correlation when the data does not meet the assumptions of linearity and normality required for Pearson’s correlation. Spearman’s correlation is less sensitive to outliers and can be used for data that follows curvilinear or monotonic relationships.

coefficient of variation (CV)

sample CV plot

The CV, or coefficient of variation, was calculated based on the protein intensity level. The bold pink percentage indicates the percent of the data per condition that is below a CV of 0.1, while the regular pink percentage indicates the data that is below a CV of 0.2.

cumulative frequency of CV

The cumulative frequency of coefficient of variation (CV) graphically represents the cumulative frequency of the Coefficient of Variation (CV) at both the peptide and protein levels. On the x-axis, the coefficient of variation (CV) is plotted, while the y-axis displays the cumulative frequency. The lines are differentiated by color according to the condition. This enables the user to assess and evaluate the reproducibility of measurements across different conditions in the analysis.

statistics

pairwise comparison that were performed

group1	group2	slr_ratio_meta
HYE mix A	HYE mix B	HYE mix A/HYE mix B

raw statistics table

get RAW statistical table: 06_statistics/8_sample_analysis/statistical_analysis.csv
statistical table with 2 peptide hits (wide format): 06_statistics/8_sample_analysis/statistical_analysis_WIDE_FORMAT_2_more_peptides_per_protein.xlsx
or use files filtered for 2 peptides in same folder

The bar chart provides a visual representation of the number of proteins that have either under-estimated or over-estimated protein intensity ratios relative to the peptide-centric ratios. This comparison uses a 2-fold difference as the threshold value for determining under-estimation or over-estimation of protein intensity.

example

peptide ratio of comparison = 2
protein intensity ratio of comparison = 6

The comparison ratio of protein intensity to peptide-centric ratio (e.g. ROPECA) is three times greater. This suggests that the estimation of protein intensity for this particular protein/condition comparison may be susceptible to errors. Further examination is warranted, particularly if the ratios are diverging in opposite directions.

You may further investigate this in the Protein_intensity_benchmark__table*.csv files in the 05_processed_data folder.

statistics table (≥ 2 peptides)

slr (signal-log₂-ratio) colors: ≤2 <<<< 0 >>>> ≥2
p-value colors: ≤0.05 & >0.05

iBAQ intensity values quantiles (N=10 per group) - comparison:

slr: signal log2-ratio
t: t-statistics
score: t-statistics
p: raw p-value
p.fdr: adjusted p-value; false-discovery rates; method: Benjamini-Hochberg
PG.ProteinGroups: protein group identifier
group1: condition group 1 of pairwise comparison
group2: condition group 2 of pairwise comparison
slr_ratio_meta: how the ratio was calculated
test: which test was used (ROTS…reproducable optimized test-statistics, modt …)
significant: if there is a significant change (cutoffs e.g.: FC = 1.5 & adjusted-p-value = 0.05)
significant: if there is a significant change (cutoffs e.g.: FC = 1.5 & adjusted-p-value = 0.05)
significant_changed_raw_p: if there is a significant change (cutoffs e.g.: FC = 1.5 & raw-p-value = 0.05)
significant_changed_fc: fold-change cutoff used
significant_changed_p_value: p-value cutoff used
fold_change_absolute: absolute fold-change
fold_change_direction: fold-change direction
fold_change (FC): fold-change
iBAQ_quants: iBAQ intensity quantiles (N=10 per group); Q1 = lowest intensity quantile / Q10 = highest intensity quantile

Pipeline description

parameter	description
output_folder	character - output folder path (abs.)
ion_q_value_cutoff	numeric - Q-value used in Spectronaut analysis: Biognosys default is 0.01 = 1% error rate
id_drop_cutoff	numeric - value between 0-1 (1 = 100%); xx percent lower than median of ion ID rate => outlier
normalization_method	character - “median” or “spectronaut; auto-detection is per default ON, meaning if normalization was performed in Spectronaut this will be detected and prefered over parameter setting here; median normalization is the fallback option
normalizaion_factor_cutoff_outlier	numeric - median off from global median (4 means abs. 4fold off)
filter_oxidized_peptides	logical - if oxidized peptides should be removed from peptide quantification: TRUE or FALSE
protein_intensity_estimation	character - Hi3 = Hi3 protein intensity estimation, MaxLFQ = MaxLFQ protein intensity estimation
stat_test	character - choose statistical test: “rots” = reproducibility optimized test statistics, “modt” = moderate t-test (lmfit, eBayes), “t” = t-test
type_slr	character - choose ratio aggregation method: “median” or “tukey” is used when calculating protein values
fold_change	numeric - fold-change used as cutoff for fold-change filtering e.g. 1.5
p_value_cutoff	numeric - p-value used as cutoff for p/q-value filtering e.g. 0.05
paired	logical - should a paired statistical analysis be performed: TRUE or FALSE

protein intensity calculation

Hi3

Hi3 uses the mean over the highest 2-3 peptides per protein defined by the median over the whole dataset.

iBAQ

iBAQ stands for Intensity Based Absolute Quantification. It is a method used to estimate the relative abundance of proteins within a sample. In the iBAQ algorithm, the summed intensities of the precursor peptides that map to each protein are divided by the number of theoretically observable peptides, which is considered to be all tryptic peptides between 6 and 30 amino acids in length. This operation converts a measure that is expected to be proportional to the molar quantity of the protein into an absolute quantity.(Quantitative Mass Spectrometry-Based Proteomics: An Overview)

MaxLFQ

MaxLFQ stands for Maximal Peptide Ratio Extraction and Label-Free Quantification. It is an algorithm used to estimate protein abundances in mass spectrometry-based proteomics by aiming to maintain the fragment intensity ratios between samples. The MaxLFQ algorithm calculates protein intensities by taking the maximum peptide ratio of all peptides that map to a protein and normalizing it across all samples.

The MaxLFQ algorithm was developed by Cox et al. in 2014 and is widely used in label-free quantitative proteomics. It is considered to be an accurate method for proteome-wide label-free quantification.

In more technical terms, the MaxLFQ algorithm calculates ratio between any two samples using the peptide species that are present. The pair-wise protein ratio is then defined as the median of the peptide ratios, to protect against outliers (require a minimal number of two peptide ratios in order for a given protein ratio to be considered valid).At this point the algorithm constructed a triangular matrix containing all pair-wise protein ratios between any two samples, which is the maximal possible quantification information. Then the algorithm perform a least-squares analysis to reconstruct the abundance profile optimally satisfying the individual protein ratios in the matrix based on the sum of squared differences. Then the alg. rescales the whole profile to the cumulative intensity across samples, thereby preserving the total summed intensity for a protein over all samples. This procedure is repeated for all proteins, resulting in an accurate abundance profile for each protein across the samples.

All plots and tables are saved in subfolders: subfolders are generated based on the number of raw files in the Spectronaut output