Analysis of proteomics data

class: title-slide, middle, center

# Analysis of proteomics data

.pull-left[
## Robert Castelo
[robert.castelo@upf.edu](mailto:robert.castelo@upf.edu)
### Dept. of Medicine and
### Life Sciences
### Universitat Pompeu Fabra
]

.pull-right[
## Eduard Sabidó
[eduard.sabido@crg.eu](mailto:eduard.sabido@crg.cat)
### CRG/UPF Proteomics
### Core Facility
]

<br>

### MSc on Bioinformatics for the Health Sciences
### UPF School of Health and Life Sciences
### Academic Year 2024-2025

---

## Workflow of protein expression quantification

![](data:image/png;base64,#img/TyanovaCox16_MaxQuantFig1a.png)

.footer[
Figure 1, Tyanova et al. The MaxQuant computational platform for mass spectrometry-based
shotgun proteomics.  
[_Nature Protocols_, 11:2301-2319, 2016](https://doi.org/10.1038/nprot.2016.136).
]

---

## Workflow of protein expression quantification

![:scale 90%](data:image/png;base64,#img/TyanovaCox16_MaxQuantFig1b.png)

.footer[
Figure 1, Tyanova et al. The MaxQuant computational platform for mass spectrometry-based
shotgun proteomics.  
[_Nature Protocols_, 11:2301-2319, 2016](https://doi.org/10.1038/nprot.2016.136).
]

---
class: center, middle, inverse

# Peptide search

---

## Search for peptides in raw mass-spec data

* To illustrate the analysis of mass-spectrometry data we are going to use the
  proteomics data by [Costa et al. (2021)](https://doi.org/10.1111/febs.15578),
  where gene and protein expression profiles from peripheral blood were
  compared between extremely preterm newborns (< 28kw gestational age) affected
  and unaffected by a fetal inflammatory response (FIR).
  ![abstract](data:image/png;base64,#img/Costa21FEBS.png)
* Proteins from 20 dried blood spot samples were quantified using an
  [Orbitrap Fusion Lumos](https://www.thermofisher.com/es/es/home/industrial/mass-spectrometry/liquid-chromatography-mass-spectrometry-lc-ms/lc-ms-systems/orbitrap-lc-ms/orbitrap-tribrid-mass-spectrometers/orbitrap-fusion-lumos-mass-spectrometer.html)
  instrument and the [MaxQuant](https://www.maxquant.org) software.

---

## Search for peptides in raw mass-spec data

* The raw proteomics data is available at the PRIDE repository under accession
  [PXD011626](https://www.ebi.ac.uk/pride/archive/projects/PXD011626).  
  &nbsp;&nbsp;
* For convenience, instead of [MaxQuant](https://www.maxquant.org), here we are
  going to illustrate how to perform protein quantification using another
  software,
  [MS-GF+](https://doi.org/10.1038/ncomms6277), available at
  [https://github.com/MSGFPlus/msgfplus](https://github.com/MSGFPlus/msgfplus).  
  &nbsp;&nbsp;
* To use it you need a system-wide
  [Java](https://www.java.com/en/download/help/index_installing.html)
  installation and the `MSGFPlus.jar` file from the MS-GF+ software available
  through its GitHub [repository](https://github.com/MSGFPlus/msgfplus).

---

## Search for peptides in raw mass-spec data

* Before we use the MS-GF+ software, we need to convert the files produced by
  the Orbitrap instrument in the proprietary Thermo Fisher RAW format into the
  open XML-based
  [mzML](https://doi.org/10.1007%2F978-1-60761-444-9_22) format.  
  &nbsp;&nbsp;
* This conversion can be performed with the
  [ThermoRawFileParser](https://doi.org/10.1021/acs.jproteome.9b00328)
  software in the following three steps:
  1. First, if you are in a Unix or macOS system, install the
     [mono](https://www.mono-project.com) framework.  
     &nbsp;&nbsp;
  2. Second, download the latest release of the ThermoRawFileParser from its
     [GitHub repository](https://github.com/compomics/ThermoRawFileParser).  
     &nbsp;&nbsp;
  3. Third, if you are in a Unix or macOS system, use the following command
     line:  
     &nbsp;&nbsp;
     ```
     $ mono ThermoRawFileParser.exe -i=file.raw -b=file.mzML
     ```
     If you are in Windows system, you can use the previous command line
     without calling the `mono` executable.

---

## Search for peptides in raw mass-spec data

* Once we have the `.mzML` files we can proceed to search for peptides in our
  mass spectrometry data as follows:  
  &nbsp;&nbsp;
  ```
  $ java -Xmx3500M -jar MSGFPlus.jar -s inputfile.mzML \
    -d proteins.fasta -o outputfile.mzid \
    -inst 1 -t 20ppm -ti -1,2 -ntt 2 -tda 1
  ```
  &nbsp;&nbsp;
* The parameters other than input and output files are set for the mass-spec
  data from Costa et al. (2021). Try to run that search with the file
  [180506_S_ROCA_01_01_BS03.mzML](180506_S_ROCA_01_01_BS03.mzML.zip) (ZIP file
  of about 180M) and the database of human proteins
  [sp_human_2018_04.fasta](sp_human_2018_04.fasta.zip). Calculations should
  take about 20 minutes in a modern laptop.

---
class: small-code

## Search for peptides in raw mass-spec data

* Once you have obtained a peptide identification file such as
  [180506_S_ROCA_01_01_BS03.mzid](180506_S_ROCA_01_01_BS03.mzid.zip), we can
  import the results into R using the Bioconductor package
  [mzID](https://bioconductor.org/packages/mzID)

``` r
  > library(mzID)
  > 
  > res <- mzID("180506_S_ROCA_01_01_BS03.mzid")
  reading 180506_S_ROCA_01_01_BS03.mzid...
   DONE!
  > class(res)
  [1] "mzID"
  attr(,"package")
  [1] "mzID"
  ```
* To inspect the data we should transform the previous object into a
  `data.frame` with the function `flatten()` as follows.

``` r
  > resdf <- flatten(res)
  > dim(resdf)
  [1] 17710    31
  ```

---
class: small-code

## Search for peptides in raw mass-spec data

* We can select the proteins with one or more detected peptides at
  a FDR < 1% as follows.

``` r
  > mask <- resdf[["ms-gf:qvalue"]] < 0.01
  > sum(mask)
  [1] 640
  > head(resdf[mask, ], n=2)
                                        spectrumid scan number(s) scan start time
  1 controllerType=0 controllerNumber=1 scan=25737          25737        92.08893
  2 controllerType=0 controllerNumber=1 scan=24799          24799        89.24185
    acquisitionnum chargestate experimentalmasstocharge calculatedmasstocharge
  1          25737           2                 1417.591               1417.591
  2          24799           3                 1051.784               1051.784
    rank passthreshold ms-gf:denovoscore ms-gf:evalue ms-gf:pepqvalue
  1    1          TRUE               153 2.901908e-25               0
  2    1          TRUE               178 3.113228e-24               0
    ms-gf:qvalue ms-gf:rawscore ms-gf:specevalue assumeddissociationmethod
  1            0            151     1.310874e-32                       HCD
  2            0            158     1.404265e-31                       HCD
    isotopeerror start end pre post isdecoy length            accession
  1            0   302 328   R    F   FALSE    453 sp|P02679|FIBG_HUMAN
  2            0   300 328   K    F   FALSE    453 sp|P02679|FIBG_HUMAN
                                                        description
  1 Fibrinogen gamma chain OS=Homo sapiens OX=9606 GN=FGG PE=1 SV=3
  2 Fibrinogen gamma chain OS=Homo sapiens OX=9606 GN=FGG PE=1 SV=3
                           pepseq modified modification
  1   LTYAYFAGGDAGDAFDGFDFGDDPSDK    FALSE         <NA>
  2 YRLTYAYFAGGDAGDAFDGFDFGDDPSDK    FALSE         <NA>
                           idFile spectrumFile           databaseFile
  1 180506_S_ROCA_01_01_BS03.mzid    test.mzML sp_human_2018_04.fasta
  2 180506_S_ROCA_01_01_BS03.mzid    test.mzML sp_human_2018_04.fasta
  ```

---

## Search for peptides in raw mass-spec data

* We can visually verify the hits of our mass-spec data to a given
  peptide by inspecting the corresponding RAW file with the R packages
  [rawrr](https://github.com/fgcz/rawrr) and
  [protViz](https://cran.r-project.org/package=protViz).  
  &nbsp;&nbsp;
* Because the [rawrr](https://github.com/fgcz/rawrr) package needs
  a proprietary software component from Thermo Fisher, if you have installed
  yourself the package, you need to run the following command before start
  using it, to install the additional components that cannot be distributed
  with the package:

``` r
  > library(rawrr)
  > 
  > installRawrrExe()
  ```
* This function will a return zero value if this component has been correctly
  downloaded and installed.

---

## Search for peptides in raw mass-spec data

* For instance, let's say we found the peptide `AAVGQEEIQLR` and want
  to see its hits in the RAW file
  [180506_S_ROCA_01_01_BS03.raw](180506_S_ROCA_01_01_BS03.raw.zip)
  (ZIP file of about 450M).  
  &nbsp;&nbsp;
* First, read the index of scans from the RAW file.

``` r
  > library(rawrr)
  > 
  > rawfile <- "180506_S_ROCA_01_01_BS03.raw"
  > idx <- readIndex(rawfile)
  ```
* The first time you call `readIndex()` you will be shown a license of
  use of the software from Thermo Fisher, which you should accept answering
  `Y` if you want to proceed using the software.

---

## Search for peptides in raw mass-spec data

* Second, calculate the parent ion mass of the peptide using
  the function `parentIonMass()` from the
  [protViz](https://cran.r-project.org/package=protViz) package.

``` r
  > library(protViz)
  > 
  > peptide <- "AAVGQEEIQLR"
  > mass2Hplus <- (parentIonMass(peptide) + 1.008) / 2
  ```
* Third, find the spectra with a difference in mass under a
  tolerance of 0.01.

``` r
  > hits <- which(abs(idx$precursorMass - mass2Hplus) < 0.01)
  > hits
  [1]  9340 10203 22819
  ```

---
class: small-code

## Search for peptides in raw mass-spec data

* Fourth, read the actual spectra corresponding to the
  found hits.

``` r
  > S <- readSpectrum(rawfile, hits)
  > class(S)
  [1] "rawrrSpectrumSet"
  > length(S)
  [1] 3
  > S[[1]]
  Total Ion Current:	 60525.68
  Scan Low Mass:	 140
  Scan High Mass:	 1225
  Scan Start Time (min):	 40.86975
  Scan Number:	 9340
  Base Peak Intensity:	 2028
  Base Peak Mass:	 972.496
  Scan Mode:	 ITMS + c NSI r d Full ms2 607.3312@hcd28.00 [140.0000-1225.0000]
  ======= Instrument data =====   :	NULL
  Multi Inject Info:	                                
  AGC:	Predicted   
  Micro Scan Count:	1
  Charge State:	2
  Monoisotopic M/Z:	607.3312
  Ion Injection Time (ms):	200.000
  MS2 Isolation Width:	1.60
  HCD Energy:	-31.08
  === Mass Calibration:	NULL
  Conversion Parameter B:	0.000000
  Conversion Parameter C:	0.000000
  Temperature Comp. (ppm):	0.00
  RF Comp. (ppm):	0.00
  Space Charge Comp. (ppm):	0.00
  Resolution Comp. (ppm):	0.00
  Number of LM Found:	0
  === Ion Optics Settings:	NULL
  ====  Diagnostic Data:	NULL
  RawOvFtT:	0.0
  Injection t0:	0.000
  ```

---
class: small-code

## Search for peptides in raw mass-spec data

* Finally, visualize the hits, the first one for instance.

``` r
  yIons <- fragmentIon(peptide)[[1]]$y
  plot(S[[1]], SN=TRUE, diagnostic=TRUE)
  names(yIons) <- paste0("y", seq(1, length(yIons)))
  abline(v=yIons, col="#DDDDDD88", lwd=5)
  axis(3, yIons, names(yIons))
  ```
  
  <img src="data:image/png;base64,#/Users/robert/Documents/docencia/master/curs2425/IEO23-24_proteomics2/docs/index_files/figure-html/rawhit1-1.png" width="550px" style="display: block; margin: auto;" />

---
class: center, middle, inverse

# Protein expression data analysis

---

## Workflow of protein expression data analysis

* Import data.  
  &nbsp;&nbsp;
* Explore distribution of unique peptides.  
  &nbsp;&nbsp;
* Explore pattern of missing quantifications.  
  &nbsp;&nbsp;
* Filtering of lowly-expressed proteins.  
  &nbsp;&nbsp;
* Filtering of unreliably-expressed proteins.  
  &nbsp;&nbsp;
* Normalization.  
  &nbsp;&nbsp;
* Imputation of missing quantifications.  
  &nbsp;&nbsp;
* Exploration of protein expression profiles.  
  &nbsp;&nbsp;
* Differential protein expression analysis.

---

## Import data

* Download the following data files from
  [Costa et al. (2021)](https://doi.org/10.1111/febs.15578):

1. Protein quantifications
     ([proteinQuantifications.csv](proteinQuantifications.csv)).
     A matrix with values of abundance of proteins in samples, which we may
     regard as raw protein expression profiles.  
     &nbsp;&nbsp;
  2. Protein feature data ([featureData.csv](featureData.csv)). Metadata about
     the proteins, including the number of peptides on which quantifications
     are based.  
     &nbsp;&nbsp;
  3. Sample phenotype data ([phenoData.csv](phenoData.csv)). Metadata about the
     samples, including a main outcome of interest (fetal inflammatory response
     -FIR) and a covariate (sex).

---
class: small-code

## Import data

* Load the protein quantifications into R, which we may regard as raw protein
  expression profiles.

``` r
  > protQuantData <- read.csv(file="proteinQuantifications.csv", row.names=1)
  > protQuantData <- as.matrix(protQuantData)
  > dim(protQuantData)
  [1] 649  20
  > head(protQuantData)
              Sample1 Sample2 Sample3  Sample4 Sample5 Sample6 Sample7 Sample8
  A0A075B6K5        0  580960       0        0  300090       0       0  829400
  A2NJV5            0 1163300       0  1867200 1164800 1147700       0 1056500
  A0A0C4DH68        0       0       0        0       0       0       0       0
  A0A1B0GWG4  1269600  379010       0        0       0       0  812560       0
  A0JNW5     48361000 3987500       0 36275000 6213400 9468100       0       0
  Q3MJ40            0       0       0        0       0       0       0       0
             Sample9 Sample10 Sample11 Sample12 Sample13 Sample14 Sample15
  A0A075B6K5       0        0        0  1461600   802610   270530        0
  A2NJV5           0        0        0  1326700  1553300   422040        0
  A0A0C4DH68 1536000        0        0        0   373960        0        0
  A0A1B0GWG4       0        0        0        0        0        0        0
  A0JNW5           0 24263000        0 19831000        0 27741000        0
  Q3MJ40           0        0        0        0        0        0        0
             Sample16 Sample17 Sample18 Sample19 Sample20
  A0A075B6K5        0   486930   378470        0        0
  A2NJV5       901020  1132000   657880  1158200   753220
  A0A0C4DH68        0        0        0        0        0
  A0A1B0GWG4   464060        0        0        0        0
  A0JNW5            0 14330000        0        0        0
  Q3MJ40      2237100        0        0        0        0
  ```

---
class: small-code

## Import data

* Proteins quantified with zero value are in fact not quantified, i.e.,
  their quantification is missing. For this reason, we set zero values to
  `NA`.

``` r
  > protQuantData[protQuantData == 0] <- NA
  > head(protQuantData)
              Sample1 Sample2 Sample3  Sample4 Sample5 Sample6 Sample7 Sample8
  A0A075B6K5       NA  580960      NA       NA  300090      NA      NA  829400
  A2NJV5           NA 1163300      NA  1867200 1164800 1147700      NA 1056500
  A0A0C4DH68       NA      NA      NA       NA      NA      NA      NA      NA
  A0A1B0GWG4  1269600  379010      NA       NA      NA      NA  812560      NA
  A0JNW5     48361000 3987500      NA 36275000 6213400 9468100      NA      NA
  Q3MJ40           NA      NA      NA       NA      NA      NA      NA      NA
             Sample9 Sample10 Sample11 Sample12 Sample13 Sample14 Sample15
  A0A075B6K5      NA       NA       NA  1461600   802610   270530       NA
  A2NJV5          NA       NA       NA  1326700  1553300   422040       NA
  A0A0C4DH68 1536000       NA       NA       NA   373960       NA       NA
  A0A1B0GWG4      NA       NA       NA       NA       NA       NA       NA
  A0JNW5          NA 24263000       NA 19831000       NA 27741000       NA
  Q3MJ40          NA       NA       NA       NA       NA       NA       NA
             Sample16 Sample17 Sample18 Sample19 Sample20
  A0A075B6K5       NA   486930   378470       NA       NA
  A2NJV5       901020  1132000   657880  1158200   753220
  A0A0C4DH68       NA       NA       NA       NA       NA
  A0A1B0GWG4   464060       NA       NA       NA       NA
  A0JNW5           NA 14330000       NA       NA       NA
  Q3MJ40      2237100       NA       NA       NA       NA
  ```

---
class: small-code

## Import data

* Load feature data on the number of peptides on which the quantification
  of each protein is based and the corresponding gene symbol.

``` r
  > featureData <- read.csv(file="featureData.csv", row.names=1)
  > head(featureData)
             UniqPeptides GeneSymbol
  A0A075B6K5            1    IGLV3-9
  A2NJV5                1   IGKV2-29
  A0A0C4DH68            1   IGKV2-24
  A0A1B0GWG4            1     SERTM2
  A0JNW5                1  UHRF1BP1L
  Q3MJ40                1   CCDC144B
  ```
* Proteins in the feature data must match exactly those in the
  row names of the matrix of protein quantifications.

``` r
  > identical(rownames(featureData), rownames(protQuantData))
  [1] TRUE
  ```

---
class: small-code

## Import data

* Load phenotype data with the main outcome of interest (FIR) and a covariate
  (sex) for the proteomics experiment.

``` r
  > phenotypeData <- read.csv("phenoData.csv", row.names=1, stringsAsFactors=TRUE)
  > head(phenotypeData, n=3)
          FIR    Sex
  Sample1 yes female
  Sample2 yes   male
  Sample3  no female
  > table(phenotypeData)
       Sex
  FIR   female male
    no       4    7
    yes      3    6
  ```
* The sample names in the phenotype data must match exactly those
  in the columns names of the matrix of protein quantifications.

``` r
  > identical(rownames(phenotypeData), colnames(protQuantData))
  [1] TRUE
  ```

---

## Explore distribution of unique peptides

* Explore the empirical cumulative distribution of the number of unique
  peptides throughout all the 649 proteins.

``` r
  par(mar=c(4, 5, 1, 1))
  plot.ecdf(featureData$UniqPeptides, panel.first=grid(),
            xlab="Nr. unique peptides", las=1, main="")
  ```
  
  <img src="data:image/png;base64,#/Users/robert/Documents/docencia/master/curs2425/IEO23-24_proteomics2/docs/index_files/figure-html/uniqPeptidesFn-1.png" width="450px" style="display: block; margin: auto;" />

---
class: small-code

## Explore distribution of unique peptides

* Explore the protein quantification values as function of the
  number of unique peptides.

``` r
  f <- cut(featureData$UniqPeptides, breaks=c(0, 1, 10, 20, 80))
  logQ <- rowMeans(log2(protQuantData))
  plot(logQ ~ f, xlab="# Uniq. Pept.", ylab="Mean log2 Quant.", xaxt="n", las=1)
  axis(1, at=1:nlevels(f), labels=sprintf("%s\n(n=%d)", levels(f), table(f)), padj=0.5)
  ```
  
  <img src="data:image/png;base64,#/Users/robert/Documents/docencia/master/curs2425/IEO23-24_proteomics2/docs/index_files/figure-html/uniqPeptidesByProtQ-1.png" width="450px" style="display: block; margin: auto;" />

---
class: small-code

## Explore pattern of missing quantifications

* Explore the pattern of missing quantifications by first calculating
  the number of samples where each protein has been quantified and then
  examining its distribution.

``` r
  qsamplesbyprot <- rowSums(!is.na(protQuantData))
  nqsamplesbyprot <- table(qsamplesbyprot)
  par(mfrow=c(1,2), mar=c(4, 5, 3, 2))
  barplot(nqsamplesbyprot, xlab="# samples in which proteins are expressed", ylab="# Proteins", las=1)
  plot.ecdf(ncol(protQuantData)-qsamplesbyprot, xlab="Min. # samples in which proteins are expressed",
            las=1, xaxt="n", panel.first=grid())
  axis(1, at=seq(0, 20, by=2), labels=seq(20, 0, by=-2))
  ```
  
  <img src="data:image/png;base64,#/Users/robert/Documents/docencia/master/curs2425/IEO23-24_proteomics2/docs/index_files/figure-html/nQsamplesByProt-1.png" width="800px" style="display: block; margin: auto;" />

---

## Filtering of lowly-expressed proteins

* We filter _lowly_-expressed proteins by discarding those quantified in fewer
  samples than those forming the smaller group in our outcome of interest.

``` r
  > mask <- qsamplesbyprot >= min(table(phenotypeData$FIR))
  > sum(mask)
  [1] 257
  > protQuantData <- protQuantData[mask , ]
  > dim(protQuantData)
  [1] 257  20
  > featureData <- featureData[mask, ]
  > dim(featureData)
  [1] 257   2
  ```

---

## Filtering of unreliably-expressed proteins

* We also filter out proteins that have been detected by only one unique
  peptide, thus selecting proteins whose quantification is more robust because
  is based on two or more unique peptides.

``` r
  > mask <- featureData$UniqPeptides > 1
  > sum(mask)
  [1] 208
  > protQuantData <- protQuantData[mask, ]
  > dim(protQuantData)
  [1] 208  20
  > featureData <- featureData[mask, ]
  > dim(featureData)
  [1] 208   2
  ```

---
class: small-code

## Filtering of unreliably-expressed proteins

* The previous filtering steps lead to the following number of
  reliably-quantified proteins per sample.

``` r
  nQprotsBySample <- colSums(!is.na(protQuantData))
  names(nQprotsBySample) <- sub("Sample", "", names(nQprotsBySample))
  ord <- order(nQprotsBySample)
  barplot(nQprotsBySample[ord], xlab="Samples", ylab="# proteins", las=1, col=c("skyblue", "darkred")[as.integer(phenotypeData$FIR[ord])])
  legend("topleft", levels(phenotypeData$FIR), fill=c("skyblue", "darkred"), title="FIR", inset=0.01)
  ```
  
  <img src="data:image/png;base64,#/Users/robert/Documents/docencia/master/curs2425/IEO23-24_proteomics2/docs/index_files/figure-html/nQprotBySample-1.png" width="800px" style="display: block; margin: auto;" />

---
class: small-code

## Normalization

* We normalize the raw protein quantification values using the Bioconductor
  package [vsn](https://bioconductor.org/packages/vsn). You can find a review
  on normalization methods for quantitative label-free proteomics in
  [(Välikangas et al., 2018)](https://doi.org/10.1093/bib/bbw095).

``` r
  > library(vsn)
  > 
  > vsn.fit <- vsnMatrix(protQuantData)
  > normProtQuantData <- predict(vsn.fit, protQuantData)
  > normProtQuantData[1:8, 1:8]
          Sample1  Sample2  Sample3  Sample4  Sample5  Sample6  Sample7  Sample8
  O00233 20.26333 20.97396       NA 20.37712 19.85400       NA       NA 21.73933
  O14818 19.94017       NA       NA       NA       NA       NA 20.43246       NA
  P19105       NA 21.41161       NA 20.33488       NA 20.87179       NA 20.62828
  O43432 21.82440 21.59959 22.43131 22.11325 20.35374 21.91358 24.05840       NA
  Q99880 24.23519 25.03429 22.93407 23.48319 23.39306 20.74483 21.60923 21.68305
  O75347       NA 19.65295 19.88065       NA       NA       NA 19.64799 19.47023
  O75368 21.20920 20.99475       NA 21.64698 21.47215 21.47737 21.98717 21.94319
  P00167 19.85649       NA       NA       NA 19.91599 20.68977       NA 19.99607
  ```

---
class: small-code

## Normalization

* Compare raw quantification values in log2-scale with normalized protein
  quantifications.

``` r
  par(mfrow=c(2, 1), mar=c(4, 5, 1, 1))
  boxplot(log2(protQuantData), xlab="Samples", ylab="log2 Quant.", las=1)
  boxplot(normProtQuantData, xlab="Samples", ylab="log2 Norm. Quant.", las=1)
  ```
  
  <img src="data:image/png;base64,#/Users/robert/Documents/docencia/master/curs2425/IEO23-24_proteomics2/docs/index_files/figure-html/unnormVsNormProtExp-1.png" width="800px" style="display: block; margin: auto;" />

---
class: small-code

## Imputation of missing quantifications

* Examine the pattern of missing quantifications in proteins with at least one
  missing value.

``` r
  misspat <- is.na(normProtQuantData) ; misspat <- misspat[rowSums(misspat) > 0, ]
  sampleClustering <- hclust(dist(t(misspat+0)), method="complete")
  protClustering <- hclust(dist(misspat+0), method="complete")
  heatmap(misspat+0, Colv=as.dendrogram(sampleClustering), labRow="", scale="none",
          Rowv=as.dendrogram(protClustering), col=c("black", "white"), cexCol=0.8)
  ```
  
  <img src="data:image/png;base64,#/Users/robert/Documents/docencia/master/curs2425/IEO23-24_proteomics2/docs/index_files/figure-html/missingQpattern-1.png" width="800px" style="display: block; margin: auto;" />

---

## Imputation of missing quantifications

* How many proteins do not have any missing value.

``` r
  > mask <- rowSums(is.na(normProtQuantData)) == 0
  > sum(mask)
  [1] 46
  ```
* How many proteins have one or more missing values.

``` r
  > sum(!mask)
  [1] 162
  ```

---
class: small-code

## Imputation of missing quantifications

* Compare expression profile distributions between complete protein expression
  profiles and protein expression profiles with at least one missing
  quantification.

``` r
  library(geneplotter)
  par(mar=c(4, 5, 1, 1))
  multidensity(list("No missing"=as.vector(normProtQuantData[mask, ]),
                    "Some missing"=as.vector(normProtQuantData[!mask, ])),
               main="", las=1, xlab="Protein expression")
  ```
  
  <img src="data:image/png;base64,#/Users/robert/Documents/docencia/master/curs2425/IEO23-24_proteomics2/docs/index_files/figure-html/expDistComp-1.png" width="500px" style="display: block; margin: auto;" />

---
class: small-code

## Imputation of missing quantifications

* Clearly, proteins with complete expression profiles have higher expression
  levels than proteins with missing expression values and we may conclude that
  we missing values arise from proteins whose expression is close to the
  detection limit.  
  &nbsp;&nbsp;
* We use the function `impute.wrapper.MLE()` from the R/CRAN package
  [imputeLCMD](https://cran.r-project.org/web/packages/imputeLCMD) to
  impute missing values following a _missing at random_ (MAR) pattern
  with a method for multivariate normal data by Schafer (1997).

``` r
  > library(imputeLCMD)
  > 
  > normProtQuantDataImp <- impute.wrapper.MLE(normProtQuantData)
  > normProtQuantDataImp[1:5, 1:8]
          Sample1  Sample2  Sample3  Sample4  Sample5  Sample6  Sample7  Sample8
  O00233 20.26333 20.97396 20.33908 20.37712 19.85400 20.19773 19.50513 21.73933
  O14818 19.94017 21.49988 21.14678 20.09215 19.79553 20.61105 20.43246 20.32351
  P19105 20.86321 21.41161 21.22577 20.33488 21.03228 20.87179 20.17302 20.62828
  O43432 21.82440 21.59959 22.43131 22.11325 20.35374 21.91358 24.05840 22.68997
  Q99880 24.23519 25.03429 22.93407 23.48319 23.39306 20.74483 21.60923 21.68305
  ```

---
class: small-code

## Imputation of missing quantifications

* Compare protein expression values before and after normalizing and imputing.

``` r
  par(mfrow=c(2, 1), mar=c(4, 5, 1, 1))
  boxplot(log2(protQuantData), xlab="Samples", ylab="log2 Quant.", las=1)
  boxplot(normProtQuantDataImp, xlab="Samples", ylab="log2 Norm. Quant.", las=1)
  ```
  
  <img src="data:image/png;base64,#/Users/robert/Documents/docencia/master/curs2425/IEO23-24_proteomics2/docs/index_files/figure-html/unnormVsNormImpProtExp-1.png" width="800px" style="display: block; margin: auto;" />

---
class: small-code

## Exploration of protein expression values

* Examine the similarity between sample expression profiles, in relationship
  with the main outcome of interest (FIR) and the covariate (sex), where the
  latter does not seem to drive any variability in this proteomics data set.

``` r
  library(edgeR) ; par(mar=c(4, 5, 1, 1))
  plotMDS(normProtQuantDataImp, labels=NULL, las=1,
          col=c("skyblue", "darkred")[phenotypeData$FIR],
          pch=c(female=19, male=15)[phenotypeData$Sex])
  legend("topright", levels(phenotypeData$FIR), pch=19,
         col=c("skyblue", "darkred"), title="FIR", inset=0.05)
  ```
  
  <img src="data:image/png;base64,#/Users/robert/Documents/docencia/master/curs2425/IEO23-24_proteomics2/docs/index_files/figure-html/protMDS-1.png" width="400px" style="display: block; margin: auto;" />

---
class: small-code

## Differential protein expression analysis

* We conduct a differential expression analysis using the `limma-trend`
  pipeline.

``` r
  > library(limma)
  > mod <- model.matrix(~ FIR, phenotypeData)
  > fit <- lmFit(normProtQuantDataImp, mod)
  > fit$genes <- featureData
  > fit <- eBayes(fit, trend=TRUE)
  > summary(decideTests(fit, p=0.05))
         (Intercept) FIRyes
  Down             0      0
  NotSig           0    196
  Up             208     12
  > tt <- topTable(fit, coef=2, n=Inf)
  > dim(tt)
  [1] 208   8
  > head(tt)
         UniqPeptides GeneSymbol     logFC  AveExpr        t      P.Value
  P06702            7     S100A9 2.6606369 23.25566 5.272890 2.808093e-05
  P05164           13        MPO 1.7068542 21.22826 4.775473 9.298233e-05
  P27105            4       STOM 0.8435713 20.63693 4.740952 1.010925e-04
  P05109            4     S100A8 2.4192540 22.91681 4.645568 1.274036e-04
  Q99880            4  HIST1H2BL 2.1351920 22.95113 4.550444 1.605120e-04
  P59666            2      DEFA3 1.7969710 20.66881 4.406364 2.278509e-04
           adj.P.Val         B
  P06702 0.005840834 2.3632026
  P05164 0.006624987 1.1809028
  P27105 0.006624987 1.0985069
  P05109 0.006624987 0.8707355
  Q99880 0.006677300 0.6435092
  P59666 0.007815732 0.2994167
  ```

---

## Differential protein expression analysis

* Examine differential expression diagnostics: p-value distribution.

``` r
  par(mar=c(4, 5, 1, 1))
  hist(tt$P.Value, main="", xlab="Raw P-value", col="grey", las=1)
  ```
  
  <img src="data:image/png;base64,#/Users/robert/Documents/docencia/master/curs2425/IEO23-24_proteomics2/docs/index_files/figure-html/DEpvalDist-1.png" width="800px" style="display: block; margin: auto;" />

---
class: small-code

## Differential protein expression analysis

* Examine differential expression diagnostics: volcano plot.

``` r
  plot(tt$logFC, -log10(tt$P.Value), pch=".", cex=4, las=1, xlab="log2FC",
       ylab="log10 P-value")
  mask <- abs(tt$logFC) > log2(1.5) & tt$adj.P.Val < 0.05
  text(tt$logFC[mask], -log10(tt$P.Value[mask]), tt$GeneSymbol[mask], pos=1,
       col="red", cex=0.6)
  ```
  
  <img src="data:image/png;base64,#/Users/robert/Documents/docencia/master/curs2425/IEO23-24_proteomics2/docs/index_files/figure-html/DEvolcano-1.png" width="600px" style="display: block; margin: auto;" />

---

## Concluding remarks

* Mass-spectrometry proteomics allows for a wide exploration of protein
  abundance.  
  &nbsp;&nbsp;
* Searching for peptides in mass-spec data cannot be peformed solely on the
  basis of visual hits to spectra and requires robust statistical methods that
  adjust for multiple testing.  
  &nbsp;&nbsp;
* It is important to understand the specific features of this type of data,
  such as the missing quantifications and the number of peptides used to
  quantify the proteins.  
  &nbsp;&nbsp;
* Missing quantifications require the use of data imputation techniques.  
  &nbsp;&nbsp;
* Different choices of filtering lowly and unreliably expressed proteins, data
  imputation techniques and normalization procedures may lead to different
  results.  
  &nbsp;&nbsp;
* Once data is normalized and imputed, we can use the `limma-trend` pipeline
  for a differential protein expression analysis.

---
class: small-code

## Session information

``` r
> sessionInfo()
R version 4.5.0 (2025-04-11)
Platform: x86_64-apple-darwin20
Running under: macOS Sonoma 14.7.3

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.5-x86_64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.5-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/Madrid
tzcode source: internal

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base

other attached packages:
 [1] edgeR_4.6.1          limma_3.64.0         imputeLCMD_2.1      
 [4] impute_1.82.0        pcaMethods_2.0.0     norm_1.0-11.1       
 [7] tmvtnorm_1.6         gmm_1.8              sandwich_3.1-1      
[10] Matrix_1.7-3         mvtnorm_1.3-3        geneplotter_1.86.0  
[13] annotate_1.86.0      XML_3.99-0.18        AnnotationDbi_1.70.0
[16] IRanges_2.42.0       S4Vectors_0.46.0     lattice_0.22-7      
[19] vsn_3.76.0           Biobase_2.68.0       BiocGenerics_0.54.0 
[22] generics_0.1.3       protViz_0.7.9        rawrr_1.16.0        
[25] mzID_1.46.0

loaded via a namespace (and not attached):
 [1] tidyselect_1.2.1        dplyr_1.1.4             farver_2.1.2           
 [4] blob_1.2.4              Biostrings_2.76.0       fastmap_1.2.0          
 [7] xaringan_0.30           digest_0.6.37           lifecycle_1.0.4        
[10] ProtGenerics_1.40.0     statmod_1.5.0           KEGGREST_1.48.0        
[13] RSQLite_2.3.11          magrittr_2.0.3          compiler_4.5.0         
[16] rlang_1.1.6             sass_0.4.10             tools_4.5.0            
[19] yaml_2.3.10             knitr_1.50              bit_4.6.0              
[22] plyr_1.8.9              RColorBrewer_1.1-3      grid_4.5.0             
[25] preprocessCore_1.70.0   xtable_1.8-4            ggplot2_3.5.2          
[28] scales_1.4.0            iterators_1.0.14        cli_3.6.5              
[31] rmarkdown_2.29          crayon_1.5.3            httr_1.4.7             
[34] DBI_1.2.3               cachem_1.1.0            affy_1.86.0            
[37] splines_4.5.0           parallel_4.5.0          BiocManager_1.30.25    
[40] XVector_0.48.0          vctrs_0.6.5             jsonlite_2.0.0         
[43] bit64_4.6.0-1           locfit_1.5-9.12         foreach_1.5.2          
[46] jquerylib_0.1.4         affyio_1.78.0           glue_1.8.0             
[49] codetools_0.2-20        gtable_0.3.6            GenomeInfoDb_1.44.0    
[52] UCSC.utils_1.4.0        tibble_3.2.1            pillar_1.10.2          
[55] htmltools_0.5.8.1       GenomeInfoDbData_1.2.14 R6_2.6.1               
[58] doParallel_1.0.17       evaluate_1.0.3          png_0.1-8              
[61] memoise_2.0.1           bslib_0.9.0             Rcpp_1.0.14            
[64] xfun_0.52               zoo_1.8-14              pkgconfig_2.0.3        
```