class: title-slide, middle, center # Analysis of proteomics data .pull-left[ ## Robert Castelo [robert.castelo@upf.edu](mailto:robert.castelo@upf.edu) ### Dept. of Medicine and ### Life Sciences ### Universitat Pompeu Fabra ] .pull-right[ ## Eduard Sabidó [eduard.sabido@crg.eu](mailto:eduard.sabido@crg.cat) ### CRG/UPF Proteomics ### Core Facility ] <br> ### MSc on Bioinformatics for the Health Sciences ### UPF School of Health and Life Sciences ### Academic Year 2024-2025 --- ## Workflow of protein expression quantification  .footer[ Figure 1, Tyanova et al. The MaxQuant computational platform for mass spectrometry-based shotgun proteomics. [_Nature Protocols_, 11:2301-2319, 2016](https://doi.org/10.1038/nprot.2016.136). ] --- ## Workflow of protein expression quantification  .footer[ Figure 1, Tyanova et al. The MaxQuant computational platform for mass spectrometry-based shotgun proteomics. [_Nature Protocols_, 11:2301-2319, 2016](https://doi.org/10.1038/nprot.2016.136). ] --- class: center, middle, inverse # Peptide search --- ## Search for peptides in raw mass-spec data * To illustrate the analysis of mass-spectrometry data we are going to use the proteomics data by [Costa et al. (2021)](https://doi.org/10.1111/febs.15578), where gene and protein expression profiles from peripheral blood were compared between extremely preterm newborns (< 28kw gestational age) affected and unaffected by a fetal inflammatory response (FIR).  * Proteins from 20 dried blood spot samples were quantified using an [Orbitrap Fusion Lumos](https://www.thermofisher.com/es/es/home/industrial/mass-spectrometry/liquid-chromatography-mass-spectrometry-lc-ms/lc-ms-systems/orbitrap-lc-ms/orbitrap-tribrid-mass-spectrometers/orbitrap-fusion-lumos-mass-spectrometer.html) instrument and the [MaxQuant](https://www.maxquant.org) software. --- ## Search for peptides in raw mass-spec data * The raw proteomics data is available at the PRIDE repository under accession [PXD011626](https://www.ebi.ac.uk/pride/archive/projects/PXD011626). * For convenience, instead of [MaxQuant](https://www.maxquant.org), here we are going to illustrate how to perform protein quantification using another software, [MS-GF+](https://doi.org/10.1038/ncomms6277), available at [https://github.com/MSGFPlus/msgfplus](https://github.com/MSGFPlus/msgfplus). * To use it you need a system-wide [Java](https://www.java.com/en/download/help/index_installing.html) installation and the `MSGFPlus.jar` file from the MS-GF+ software available through its GitHub [repository](https://github.com/MSGFPlus/msgfplus). --- ## Search for peptides in raw mass-spec data * Before we use the MS-GF+ software, we need to convert the files produced by the Orbitrap instrument in the proprietary Thermo Fisher RAW format into the open XML-based [mzML](https://doi.org/10.1007%2F978-1-60761-444-9_22) format. * This conversion can be performed with the [ThermoRawFileParser](https://doi.org/10.1021/acs.jproteome.9b00328) software in the following three steps: 1. First, if you are in a Unix or macOS system, install the [mono](https://www.mono-project.com) framework. 2. Second, download the latest release of the ThermoRawFileParser from its [GitHub repository](https://github.com/compomics/ThermoRawFileParser). 3. Third, if you are in a Unix or macOS system, use the following command line: ``` $ mono ThermoRawFileParser.exe -i=file.raw -b=file.mzML ``` If you are in Windows system, you can use the previous command line without calling the `mono` executable. --- ## Search for peptides in raw mass-spec data * Once we have the `.mzML` files we can proceed to search for peptides in our mass spectrometry data as follows: ``` $ java -Xmx3500M -jar MSGFPlus.jar -s inputfile.mzML \ -d proteins.fasta -o outputfile.mzid \ -inst 1 -t 20ppm -ti -1,2 -ntt 2 -tda 1 ``` * The parameters other than input and output files are set for the mass-spec data from Costa et al. (2021). Try to run that search with the file [180506_S_ROCA_01_01_BS03.mzML](180506_S_ROCA_01_01_BS03.mzML.zip) (ZIP file of about 180M) and the database of human proteins [sp_human_2018_04.fasta](sp_human_2018_04.fasta.zip). Calculations should take about 20 minutes in a modern laptop. --- class: small-code ## Search for peptides in raw mass-spec data * Once you have obtained a peptide identification file such as [180506_S_ROCA_01_01_BS03.mzid](180506_S_ROCA_01_01_BS03.mzid.zip), we can import the results into R using the Bioconductor package [mzID](https://bioconductor.org/packages/mzID) ``` r > library(mzID) > > res <- mzID("180506_S_ROCA_01_01_BS03.mzid") reading 180506_S_ROCA_01_01_BS03.mzid... DONE! > class(res) [1] "mzID" attr(,"package") [1] "mzID" ``` * To inspect the data we should transform the previous object into a `data.frame` with the function `flatten()` as follows. ``` r > resdf <- flatten(res) > dim(resdf) [1] 17710 31 ``` --- class: small-code ## Search for peptides in raw mass-spec data * We can select the proteins with one or more detected peptides at a FDR < 1% as follows. ``` r > mask <- resdf[["ms-gf:qvalue"]] < 0.01 > sum(mask) [1] 640 > head(resdf[mask, ], n=2) spectrumid scan number(s) scan start time 1 controllerType=0 controllerNumber=1 scan=25737 25737 92.08893 2 controllerType=0 controllerNumber=1 scan=24799 24799 89.24185 acquisitionnum chargestate experimentalmasstocharge calculatedmasstocharge 1 25737 2 1417.591 1417.591 2 24799 3 1051.784 1051.784 rank passthreshold ms-gf:denovoscore ms-gf:evalue ms-gf:pepqvalue 1 1 TRUE 153 2.901908e-25 0 2 1 TRUE 178 3.113228e-24 0 ms-gf:qvalue ms-gf:rawscore ms-gf:specevalue assumeddissociationmethod 1 0 151 1.310874e-32 HCD 2 0 158 1.404265e-31 HCD isotopeerror start end pre post isdecoy length accession 1 0 302 328 R F FALSE 453 sp|P02679|FIBG_HUMAN 2 0 300 328 K F FALSE 453 sp|P02679|FIBG_HUMAN description 1 Fibrinogen gamma chain OS=Homo sapiens OX=9606 GN=FGG PE=1 SV=3 2 Fibrinogen gamma chain OS=Homo sapiens OX=9606 GN=FGG PE=1 SV=3 pepseq modified modification 1 LTYAYFAGGDAGDAFDGFDFGDDPSDK FALSE <NA> 2 YRLTYAYFAGGDAGDAFDGFDFGDDPSDK FALSE <NA> idFile spectrumFile databaseFile 1 180506_S_ROCA_01_01_BS03.mzid test.mzML sp_human_2018_04.fasta 2 180506_S_ROCA_01_01_BS03.mzid test.mzML sp_human_2018_04.fasta ``` --- ## Search for peptides in raw mass-spec data * We can visually verify the hits of our mass-spec data to a given peptide by inspecting the corresponding RAW file with the R packages [rawrr](https://github.com/fgcz/rawrr) and [protViz](https://cran.r-project.org/package=protViz). * Because the [rawrr](https://github.com/fgcz/rawrr) package needs a proprietary software component from Thermo Fisher, if you have installed yourself the package, you need to run the following command before start using it, to install the additional components that cannot be distributed with the package: ``` r > library(rawrr) > > installRawrrExe() ``` * This function will a return zero value if this component has been correctly downloaded and installed. --- ## Search for peptides in raw mass-spec data * For instance, let's say we found the peptide `AAVGQEEIQLR` and want to see its hits in the RAW file [180506_S_ROCA_01_01_BS03.raw](180506_S_ROCA_01_01_BS03.raw.zip) (ZIP file of about 450M). * First, read the index of scans from the RAW file. ``` r > library(rawrr) > > rawfile <- "180506_S_ROCA_01_01_BS03.raw" > idx <- readIndex(rawfile) ``` * The first time you call `readIndex()` you will be shown a license of use of the software from Thermo Fisher, which you should accept answering `Y` if you want to proceed using the software. --- ## Search for peptides in raw mass-spec data * Second, calculate the parent ion mass of the peptide using the function `parentIonMass()` from the [protViz](https://cran.r-project.org/package=protViz) package. ``` r > library(protViz) > > peptide <- "AAVGQEEIQLR" > mass2Hplus <- (parentIonMass(peptide) + 1.008) / 2 ``` * Third, find the spectra with a difference in mass under a tolerance of 0.01. ``` r > hits <- which(abs(idx$precursorMass - mass2Hplus) < 0.01) > hits [1] 9340 10203 22819 ``` --- class: small-code ## Search for peptides in raw mass-spec data * Fourth, read the actual spectra corresponding to the found hits. ``` r > S <- readSpectrum(rawfile, hits) > class(S) [1] "rawrrSpectrumSet" > length(S) [1] 3 > S[[1]] Total Ion Current: 60525.68 Scan Low Mass: 140 Scan High Mass: 1225 Scan Start Time (min): 40.86975 Scan Number: 9340 Base Peak Intensity: 2028 Base Peak Mass: 972.496 Scan Mode: ITMS + c NSI r d Full ms2 607.3312@hcd28.00 [140.0000-1225.0000] ======= Instrument data ===== : NULL Multi Inject Info: AGC: Predicted Micro Scan Count: 1 Charge State: 2 Monoisotopic M/Z: 607.3312 Ion Injection Time (ms): 200.000 MS2 Isolation Width: 1.60 HCD Energy: -31.08 === Mass Calibration: NULL Conversion Parameter B: 0.000000 Conversion Parameter C: 0.000000 Temperature Comp. (ppm): 0.00 RF Comp. (ppm): 0.00 Space Charge Comp. (ppm): 0.00 Resolution Comp. (ppm): 0.00 Number of LM Found: 0 === Ion Optics Settings: NULL ==== Diagnostic Data: NULL RawOvFtT: 0.0 Injection t0: 0.000 ``` --- class: small-code ## Search for peptides in raw mass-spec data * Finally, visualize the hits, the first one for instance. ``` r yIons <- fragmentIon(peptide)[[1]]$y plot(S[[1]], SN=TRUE, diagnostic=TRUE) names(yIons) <- paste0("y", seq(1, length(yIons))) abline(v=yIons, col="#DDDDDD88", lwd=5) axis(3, yIons, names(yIons)) ``` <img src="data:image/png;base64,#/Users/robert/Documents/docencia/master/curs2425/IEO23-24_proteomics2/docs/index_files/figure-html/rawhit1-1.png" width="550px" style="display: block; margin: auto;" /> --- class: center, middle, inverse # Protein expression data analysis --- ## Workflow of protein expression data analysis * Import data. * Explore distribution of unique peptides. * Explore pattern of missing quantifications. * Filtering of lowly-expressed proteins. * Filtering of unreliably-expressed proteins. * Normalization. * Imputation of missing quantifications. * Exploration of protein expression profiles. * Differential protein expression analysis. --- ## Import data * Download the following data files from [Costa et al. (2021)](https://doi.org/10.1111/febs.15578): 1. Protein quantifications ([proteinQuantifications.csv](proteinQuantifications.csv)). A matrix with values of abundance of proteins in samples, which we may regard as raw protein expression profiles. 2. Protein feature data ([featureData.csv](featureData.csv)). Metadata about the proteins, including the number of peptides on which quantifications are based. 3. Sample phenotype data ([phenoData.csv](phenoData.csv)). Metadata about the samples, including a main outcome of interest (fetal inflammatory response -FIR) and a covariate (sex). --- class: small-code ## Import data * Load the protein quantifications into R, which we may regard as raw protein expression profiles. ``` r > protQuantData <- read.csv(file="proteinQuantifications.csv", row.names=1) > protQuantData <- as.matrix(protQuantData) > dim(protQuantData) [1] 649 20 > head(protQuantData) Sample1 Sample2 Sample3 Sample4 Sample5 Sample6 Sample7 Sample8 A0A075B6K5 0 580960 0 0 300090 0 0 829400 A2NJV5 0 1163300 0 1867200 1164800 1147700 0 1056500 A0A0C4DH68 0 0 0 0 0 0 0 0 A0A1B0GWG4 1269600 379010 0 0 0 0 812560 0 A0JNW5 48361000 3987500 0 36275000 6213400 9468100 0 0 Q3MJ40 0 0 0 0 0 0 0 0 Sample9 Sample10 Sample11 Sample12 Sample13 Sample14 Sample15 A0A075B6K5 0 0 0 1461600 802610 270530 0 A2NJV5 0 0 0 1326700 1553300 422040 0 A0A0C4DH68 1536000 0 0 0 373960 0 0 A0A1B0GWG4 0 0 0 0 0 0 0 A0JNW5 0 24263000 0 19831000 0 27741000 0 Q3MJ40 0 0 0 0 0 0 0 Sample16 Sample17 Sample18 Sample19 Sample20 A0A075B6K5 0 486930 378470 0 0 A2NJV5 901020 1132000 657880 1158200 753220 A0A0C4DH68 0 0 0 0 0 A0A1B0GWG4 464060 0 0 0 0 A0JNW5 0 14330000 0 0 0 Q3MJ40 2237100 0 0 0 0 ``` --- class: small-code ## Import data * Proteins quantified with zero value are in fact not quantified, i.e., their quantification is missing. For this reason, we set zero values to `NA`. ``` r > protQuantData[protQuantData == 0] <- NA > head(protQuantData) Sample1 Sample2 Sample3 Sample4 Sample5 Sample6 Sample7 Sample8 A0A075B6K5 NA 580960 NA NA 300090 NA NA 829400 A2NJV5 NA 1163300 NA 1867200 1164800 1147700 NA 1056500 A0A0C4DH68 NA NA NA NA NA NA NA NA A0A1B0GWG4 1269600 379010 NA NA NA NA 812560 NA A0JNW5 48361000 3987500 NA 36275000 6213400 9468100 NA NA Q3MJ40 NA NA NA NA NA NA NA NA Sample9 Sample10 Sample11 Sample12 Sample13 Sample14 Sample15 A0A075B6K5 NA NA NA 1461600 802610 270530 NA A2NJV5 NA NA NA 1326700 1553300 422040 NA A0A0C4DH68 1536000 NA NA NA 373960 NA NA A0A1B0GWG4 NA NA NA NA NA NA NA A0JNW5 NA 24263000 NA 19831000 NA 27741000 NA Q3MJ40 NA NA NA NA NA NA NA Sample16 Sample17 Sample18 Sample19 Sample20 A0A075B6K5 NA 486930 378470 NA NA A2NJV5 901020 1132000 657880 1158200 753220 A0A0C4DH68 NA NA NA NA NA A0A1B0GWG4 464060 NA NA NA NA A0JNW5 NA 14330000 NA NA NA Q3MJ40 2237100 NA NA NA NA ``` --- class: small-code ## Import data * Load feature data on the number of peptides on which the quantification of each protein is based and the corresponding gene symbol. ``` r > featureData <- read.csv(file="featureData.csv", row.names=1) > head(featureData) UniqPeptides GeneSymbol A0A075B6K5 1 IGLV3-9 A2NJV5 1 IGKV2-29 A0A0C4DH68 1 IGKV2-24 A0A1B0GWG4 1 SERTM2 A0JNW5 1 UHRF1BP1L Q3MJ40 1 CCDC144B ``` * Proteins in the feature data must match exactly those in the row names of the matrix of protein quantifications. ``` r > identical(rownames(featureData), rownames(protQuantData)) [1] TRUE ``` --- class: small-code ## Import data * Load phenotype data with the main outcome of interest (FIR) and a covariate (sex) for the proteomics experiment. ``` r > phenotypeData <- read.csv("phenoData.csv", row.names=1, stringsAsFactors=TRUE) > head(phenotypeData, n=3) FIR Sex Sample1 yes female Sample2 yes male Sample3 no female > table(phenotypeData) Sex FIR female male no 4 7 yes 3 6 ``` * The sample names in the phenotype data must match exactly those in the columns names of the matrix of protein quantifications. ``` r > identical(rownames(phenotypeData), colnames(protQuantData)) [1] TRUE ``` --- ## Explore distribution of unique peptides * Explore the empirical cumulative distribution of the number of unique peptides throughout all the 649 proteins. ``` r par(mar=c(4, 5, 1, 1)) plot.ecdf(featureData$UniqPeptides, panel.first=grid(), xlab="Nr. unique peptides", las=1, main="") ``` <img src="data:image/png;base64,#/Users/robert/Documents/docencia/master/curs2425/IEO23-24_proteomics2/docs/index_files/figure-html/uniqPeptidesFn-1.png" width="450px" style="display: block; margin: auto;" /> --- class: small-code ## Explore distribution of unique peptides * Explore the protein quantification values as function of the number of unique peptides. ``` r f <- cut(featureData$UniqPeptides, breaks=c(0, 1, 10, 20, 80)) logQ <- rowMeans(log2(protQuantData)) plot(logQ ~ f, xlab="# Uniq. Pept.", ylab="Mean log2 Quant.", xaxt="n", las=1) axis(1, at=1:nlevels(f), labels=sprintf("%s\n(n=%d)", levels(f), table(f)), padj=0.5) ``` <img src="data:image/png;base64,#/Users/robert/Documents/docencia/master/curs2425/IEO23-24_proteomics2/docs/index_files/figure-html/uniqPeptidesByProtQ-1.png" width="450px" style="display: block; margin: auto;" /> --- class: small-code ## Explore pattern of missing quantifications * Explore the pattern of missing quantifications by first calculating the number of samples where each protein has been quantified and then examining its distribution. ``` r qsamplesbyprot <- rowSums(!is.na(protQuantData)) nqsamplesbyprot <- table(qsamplesbyprot) par(mfrow=c(1,2), mar=c(4, 5, 3, 2)) barplot(nqsamplesbyprot, xlab="# samples in which proteins are expressed", ylab="# Proteins", las=1) plot.ecdf(ncol(protQuantData)-qsamplesbyprot, xlab="Min. # samples in which proteins are expressed", las=1, xaxt="n", panel.first=grid()) axis(1, at=seq(0, 20, by=2), labels=seq(20, 0, by=-2)) ``` <img src="data:image/png;base64,#/Users/robert/Documents/docencia/master/curs2425/IEO23-24_proteomics2/docs/index_files/figure-html/nQsamplesByProt-1.png" width="800px" style="display: block; margin: auto;" /> --- ## Filtering of lowly-expressed proteins * We filter _lowly_-expressed proteins by discarding those quantified in fewer samples than those forming the smaller group in our outcome of interest. ``` r > mask <- qsamplesbyprot >= min(table(phenotypeData$FIR)) > sum(mask) [1] 257 > protQuantData <- protQuantData[mask , ] > dim(protQuantData) [1] 257 20 > featureData <- featureData[mask, ] > dim(featureData) [1] 257 2 ``` --- ## Filtering of unreliably-expressed proteins * We also filter out proteins that have been detected by only one unique peptide, thus selecting proteins whose quantification is more robust because is based on two or more unique peptides. ``` r > mask <- featureData$UniqPeptides > 1 > sum(mask) [1] 208 > protQuantData <- protQuantData[mask, ] > dim(protQuantData) [1] 208 20 > featureData <- featureData[mask, ] > dim(featureData) [1] 208 2 ``` --- class: small-code ## Filtering of unreliably-expressed proteins * The previous filtering steps lead to the following number of reliably-quantified proteins per sample. ``` r nQprotsBySample <- colSums(!is.na(protQuantData)) names(nQprotsBySample) <- sub("Sample", "", names(nQprotsBySample)) ord <- order(nQprotsBySample) barplot(nQprotsBySample[ord], xlab="Samples", ylab="# proteins", las=1, col=c("skyblue", "darkred")[as.integer(phenotypeData$FIR[ord])]) legend("topleft", levels(phenotypeData$FIR), fill=c("skyblue", "darkred"), title="FIR", inset=0.01) ``` <img src="data:image/png;base64,#/Users/robert/Documents/docencia/master/curs2425/IEO23-24_proteomics2/docs/index_files/figure-html/nQprotBySample-1.png" width="800px" style="display: block; margin: auto;" /> --- class: small-code ## Normalization * We normalize the raw protein quantification values using the Bioconductor package [vsn](https://bioconductor.org/packages/vsn). You can find a review on normalization methods for quantitative label-free proteomics in [(Välikangas et al., 2018)](https://doi.org/10.1093/bib/bbw095). ``` r > library(vsn) > > vsn.fit <- vsnMatrix(protQuantData) > normProtQuantData <- predict(vsn.fit, protQuantData) > normProtQuantData[1:8, 1:8] Sample1 Sample2 Sample3 Sample4 Sample5 Sample6 Sample7 Sample8 O00233 20.26333 20.97396 NA 20.37712 19.85400 NA NA 21.73933 O14818 19.94017 NA NA NA NA NA 20.43246 NA P19105 NA 21.41161 NA 20.33488 NA 20.87179 NA 20.62828 O43432 21.82440 21.59959 22.43131 22.11325 20.35374 21.91358 24.05840 NA Q99880 24.23519 25.03429 22.93407 23.48319 23.39306 20.74483 21.60923 21.68305 O75347 NA 19.65295 19.88065 NA NA NA 19.64799 19.47023 O75368 21.20920 20.99475 NA 21.64698 21.47215 21.47737 21.98717 21.94319 P00167 19.85649 NA NA NA 19.91599 20.68977 NA 19.99607 ``` --- class: small-code ## Normalization * Compare raw quantification values in log2-scale with normalized protein quantifications. ``` r par(mfrow=c(2, 1), mar=c(4, 5, 1, 1)) boxplot(log2(protQuantData), xlab="Samples", ylab="log2 Quant.", las=1) boxplot(normProtQuantData, xlab="Samples", ylab="log2 Norm. Quant.", las=1) ``` <img src="data:image/png;base64,#/Users/robert/Documents/docencia/master/curs2425/IEO23-24_proteomics2/docs/index_files/figure-html/unnormVsNormProtExp-1.png" width="800px" style="display: block; margin: auto;" /> --- class: small-code ## Imputation of missing quantifications * Examine the pattern of missing quantifications in proteins with at least one missing value. ``` r misspat <- is.na(normProtQuantData) ; misspat <- misspat[rowSums(misspat) > 0, ] sampleClustering <- hclust(dist(t(misspat+0)), method="complete") protClustering <- hclust(dist(misspat+0), method="complete") heatmap(misspat+0, Colv=as.dendrogram(sampleClustering), labRow="", scale="none", Rowv=as.dendrogram(protClustering), col=c("black", "white"), cexCol=0.8) ``` <img src="data:image/png;base64,#/Users/robert/Documents/docencia/master/curs2425/IEO23-24_proteomics2/docs/index_files/figure-html/missingQpattern-1.png" width="800px" style="display: block; margin: auto;" /> --- ## Imputation of missing quantifications * How many proteins do not have any missing value. ``` r > mask <- rowSums(is.na(normProtQuantData)) == 0 > sum(mask) [1] 46 ``` * How many proteins have one or more missing values. ``` r > sum(!mask) [1] 162 ``` --- class: small-code ## Imputation of missing quantifications * Compare expression profile distributions between complete protein expression profiles and protein expression profiles with at least one missing quantification. ``` r library(geneplotter) par(mar=c(4, 5, 1, 1)) multidensity(list("No missing"=as.vector(normProtQuantData[mask, ]), "Some missing"=as.vector(normProtQuantData[!mask, ])), main="", las=1, xlab="Protein expression") ``` <img src="data:image/png;base64,#/Users/robert/Documents/docencia/master/curs2425/IEO23-24_proteomics2/docs/index_files/figure-html/expDistComp-1.png" width="500px" style="display: block; margin: auto;" /> --- class: small-code ## Imputation of missing quantifications * Clearly, proteins with complete expression profiles have higher expression levels than proteins with missing expression values and we may conclude that we missing values arise from proteins whose expression is close to the detection limit. * We use the function `impute.wrapper.MLE()` from the R/CRAN package [imputeLCMD](https://cran.r-project.org/web/packages/imputeLCMD) to impute missing values following a _missing at random_ (MAR) pattern with a method for multivariate normal data by Schafer (1997). ``` r > library(imputeLCMD) > > normProtQuantDataImp <- impute.wrapper.MLE(normProtQuantData) > normProtQuantDataImp[1:5, 1:8] Sample1 Sample2 Sample3 Sample4 Sample5 Sample6 Sample7 Sample8 O00233 20.26333 20.97396 20.33908 20.37712 19.85400 20.19773 19.50513 21.73933 O14818 19.94017 21.49988 21.14678 20.09215 19.79553 20.61105 20.43246 20.32351 P19105 20.86321 21.41161 21.22577 20.33488 21.03228 20.87179 20.17302 20.62828 O43432 21.82440 21.59959 22.43131 22.11325 20.35374 21.91358 24.05840 22.68997 Q99880 24.23519 25.03429 22.93407 23.48319 23.39306 20.74483 21.60923 21.68305 ``` --- class: small-code ## Imputation of missing quantifications * Compare protein expression values before and after normalizing and imputing. ``` r par(mfrow=c(2, 1), mar=c(4, 5, 1, 1)) boxplot(log2(protQuantData), xlab="Samples", ylab="log2 Quant.", las=1) boxplot(normProtQuantDataImp, xlab="Samples", ylab="log2 Norm. Quant.", las=1) ``` <img src="data:image/png;base64,#/Users/robert/Documents/docencia/master/curs2425/IEO23-24_proteomics2/docs/index_files/figure-html/unnormVsNormImpProtExp-1.png" width="800px" style="display: block; margin: auto;" /> --- class: small-code ## Exploration of protein expression values * Examine the similarity between sample expression profiles, in relationship with the main outcome of interest (FIR) and the covariate (sex), where the latter does not seem to drive any variability in this proteomics data set. ``` r library(edgeR) ; par(mar=c(4, 5, 1, 1)) plotMDS(normProtQuantDataImp, labels=NULL, las=1, col=c("skyblue", "darkred")[phenotypeData$FIR], pch=c(female=19, male=15)[phenotypeData$Sex]) legend("topright", levels(phenotypeData$FIR), pch=19, col=c("skyblue", "darkred"), title="FIR", inset=0.05) ``` <img src="data:image/png;base64,#/Users/robert/Documents/docencia/master/curs2425/IEO23-24_proteomics2/docs/index_files/figure-html/protMDS-1.png" width="400px" style="display: block; margin: auto;" /> --- class: small-code ## Differential protein expression analysis * We conduct a differential expression analysis using the `limma-trend` pipeline. ``` r > library(limma) > mod <- model.matrix(~ FIR, phenotypeData) > fit <- lmFit(normProtQuantDataImp, mod) > fit$genes <- featureData > fit <- eBayes(fit, trend=TRUE) > summary(decideTests(fit, p=0.05)) (Intercept) FIRyes Down 0 0 NotSig 0 196 Up 208 12 > tt <- topTable(fit, coef=2, n=Inf) > dim(tt) [1] 208 8 > head(tt) UniqPeptides GeneSymbol logFC AveExpr t P.Value P06702 7 S100A9 2.6606369 23.25566 5.272890 2.808093e-05 P05164 13 MPO 1.7068542 21.22826 4.775473 9.298233e-05 P27105 4 STOM 0.8435713 20.63693 4.740952 1.010925e-04 P05109 4 S100A8 2.4192540 22.91681 4.645568 1.274036e-04 Q99880 4 HIST1H2BL 2.1351920 22.95113 4.550444 1.605120e-04 P59666 2 DEFA3 1.7969710 20.66881 4.406364 2.278509e-04 adj.P.Val B P06702 0.005840834 2.3632026 P05164 0.006624987 1.1809028 P27105 0.006624987 1.0985069 P05109 0.006624987 0.8707355 Q99880 0.006677300 0.6435092 P59666 0.007815732 0.2994167 ``` --- ## Differential protein expression analysis * Examine differential expression diagnostics: p-value distribution. ``` r par(mar=c(4, 5, 1, 1)) hist(tt$P.Value, main="", xlab="Raw P-value", col="grey", las=1) ``` <img src="data:image/png;base64,#/Users/robert/Documents/docencia/master/curs2425/IEO23-24_proteomics2/docs/index_files/figure-html/DEpvalDist-1.png" width="800px" style="display: block; margin: auto;" /> --- class: small-code ## Differential protein expression analysis * Examine differential expression diagnostics: volcano plot. ``` r plot(tt$logFC, -log10(tt$P.Value), pch=".", cex=4, las=1, xlab="log2FC", ylab="log10 P-value") mask <- abs(tt$logFC) > log2(1.5) & tt$adj.P.Val < 0.05 text(tt$logFC[mask], -log10(tt$P.Value[mask]), tt$GeneSymbol[mask], pos=1, col="red", cex=0.6) ``` <img src="data:image/png;base64,#/Users/robert/Documents/docencia/master/curs2425/IEO23-24_proteomics2/docs/index_files/figure-html/DEvolcano-1.png" width="600px" style="display: block; margin: auto;" /> --- ## Concluding remarks * Mass-spectrometry proteomics allows for a wide exploration of protein abundance. * Searching for peptides in mass-spec data cannot be peformed solely on the basis of visual hits to spectra and requires robust statistical methods that adjust for multiple testing. * It is important to understand the specific features of this type of data, such as the missing quantifications and the number of peptides used to quantify the proteins. * Missing quantifications require the use of data imputation techniques. * Different choices of filtering lowly and unreliably expressed proteins, data imputation techniques and normalization procedures may lead to different results. * Once data is normalized and imputed, we can use the `limma-trend` pipeline for a differential protein expression analysis. --- class: small-code ## Session information ``` r > sessionInfo() R version 4.5.0 (2025-04-11) Platform: x86_64-apple-darwin20 Running under: macOS Sonoma 14.7.3 Matrix products: default BLAS: /Library/Frameworks/R.framework/Versions/4.5-x86_64/Resources/lib/libRblas.0.dylib LAPACK: /Library/Frameworks/R.framework/Versions/4.5-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.1 locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 time zone: Europe/Madrid tzcode source: internal attached base packages: [1] stats4 stats graphics grDevices utils datasets methods [8] base other attached packages: [1] edgeR_4.6.1 limma_3.64.0 imputeLCMD_2.1 [4] impute_1.82.0 pcaMethods_2.0.0 norm_1.0-11.1 [7] tmvtnorm_1.6 gmm_1.8 sandwich_3.1-1 [10] Matrix_1.7-3 mvtnorm_1.3-3 geneplotter_1.86.0 [13] annotate_1.86.0 XML_3.99-0.18 AnnotationDbi_1.70.0 [16] IRanges_2.42.0 S4Vectors_0.46.0 lattice_0.22-7 [19] vsn_3.76.0 Biobase_2.68.0 BiocGenerics_0.54.0 [22] generics_0.1.3 protViz_0.7.9 rawrr_1.16.0 [25] mzID_1.46.0 loaded via a namespace (and not attached): [1] tidyselect_1.2.1 dplyr_1.1.4 farver_2.1.2 [4] blob_1.2.4 Biostrings_2.76.0 fastmap_1.2.0 [7] xaringan_0.30 digest_0.6.37 lifecycle_1.0.4 [10] ProtGenerics_1.40.0 statmod_1.5.0 KEGGREST_1.48.0 [13] RSQLite_2.3.11 magrittr_2.0.3 compiler_4.5.0 [16] rlang_1.1.6 sass_0.4.10 tools_4.5.0 [19] yaml_2.3.10 knitr_1.50 bit_4.6.0 [22] plyr_1.8.9 RColorBrewer_1.1-3 grid_4.5.0 [25] preprocessCore_1.70.0 xtable_1.8-4 ggplot2_3.5.2 [28] scales_1.4.0 iterators_1.0.14 cli_3.6.5 [31] rmarkdown_2.29 crayon_1.5.3 httr_1.4.7 [34] DBI_1.2.3 cachem_1.1.0 affy_1.86.0 [37] splines_4.5.0 parallel_4.5.0 BiocManager_1.30.25 [40] XVector_0.48.0 vctrs_0.6.5 jsonlite_2.0.0 [43] bit64_4.6.0-1 locfit_1.5-9.12 foreach_1.5.2 [46] jquerylib_0.1.4 affyio_1.78.0 glue_1.8.0 [49] codetools_0.2-20 gtable_0.3.6 GenomeInfoDb_1.44.0 [52] UCSC.utils_1.4.0 tibble_3.2.1 pillar_1.10.2 [55] htmltools_0.5.8.1 GenomeInfoDbData_1.2.14 R6_2.6.1 [58] doParallel_1.0.17 evaluate_1.0.3 png_0.1-8 [61] memoise_2.0.1 bslib_0.9.0 Rcpp_1.0.14 [64] xfun_0.52 zoo_1.8-14 pkgconfig_2.0.3 ```