Project example
Anton Klåvus, Vilhelm Suksi
2025-06-23
Source:vignettes/project_example.Rmd
project_example.Rmd
In this project example, core functionality of notame is demonstrated in unison with the associated protocol article (Klåvus et al. 2020). Since substantive insight is not a focus of the project example, results visualizations are omitted.
Project setup
Set up path and logging
Let’s set up a path and start logging our project. Many functions of the package will log information when they have finished. This helps in monitoring the analysis process in case something goes wrong and in reviewing the results later. The log will also print to console, although we’ll exclude console output for the rest of the document for brevity.
Read data
The data is mock data with two balanced groups and two balanced time points. The time points also correspond to two batches. There are 50 samples with ten regularly interspersed QC samples and 80 features divided into four analytical modes.
data <- read_from_excel(
file = system.file("extdata", "example_set.xlsx", package = "notame"),
sheet = 1, corner_row = 11, corner_column = "H",
split_by = c("Column", "Ion_mode"))
The function read_from_excel() returns a list holding the three parts of the data:
-
exprs
: feature abundances across the samples
-
pheno_data
: sample information
-
feature_data
: feature information
Construct SummarizedExperiment object
These three parts can be used to construct SummarizedExperiment using
the native SummarizedExperiment constructor or MetaboSet objects using
construct_metabosets()
. The example data contains four
analytical modes, so we separate them using fix_object()
,
which also conveniently checks the object for compatibility with all of
notame.
data <- SummarizedExperiment(assays = data$exprs,
rowData = data$feature_data,
colData = data$pheno_data)
modes <- fix_object(data, split_data = TRUE)
Now we have a list of objects, one per mode (LC column x ionization
mode) or whatever split was supplied to
read_from_excel()
.
Tabular data preprocessing
Preprocessing by mode
Tabular data preprocessing is performed to complete the dataset by
way of reducing unwanted variation and data preparation dependent on
downstream methods. Steps performed separately for each mode include
marking missing values as NA, flagging features with low detection rate
(Broadhurst et al. 2018), drift correction
(Kirwan et al. 2013), and flagging of
low-quality features (Broadhurst et al.
2018). Visualizations are drawn to monitor the process and
explore the data globally. The visualizations()
wrapper
does not include feature-wise plots as it may not be feasible to inspect
them at this stage of the analysis. For example, drift correction
visualizations are best drawn for a subset of interesting features after
feature selection.
# Initialize empty list for processed objects
processed <- list()
for (i in seq_along(modes)) {
name <- names(modes)[i]
mode <- modes[[i]]
# Set all zero abundances to NA
mode <- mark_nas(mode, value = 0)
# Flag features with low detection rate
mode <- flag_detection(mode, group = "Group")
# Visualize data before drift correction
visualizations(mode, prefix = paste0(ppath, "figures/", name, "_ORIG"),
perplexity = 5, group = "Group", time = "Time",
id = "Subject_ID", color = "Group")
# Correct drift
corrected <- correct_drift(mode)
# Visualize data after drift correction
visualizations(corrected, prefix = paste0(ppath, "figures/", name, "_DRIFT"),
perplexity = 5, group = "Group", time = "Time",
id = "Subject_ID", color = "Group")
# Flag low-quality features
corrected <- corrected %>% assess_quality() %>% flag_quality()
# Visualize data after removal of low-quality features
visualizations(corrected, prefix = paste0(ppath, "figures/", name, "_CLEAN"),
perplexity = 5, group = "Group", time = "Time",
id = "Subject_ID", color = "Group")
# Save result of iteration
processed[[i]] <- corrected
}
Feature-wise flagging information, quality metrics, and brief drift
correction notes are included in the feature information after the above
steps. The correct_drift()
function performs drift
correction on all features with sufficient detection in QC samples by
default. In case check_quality = TRUE
,
correct_drift()
retains corrected values only for features
with improved quality metrics after drift correction, along with a note
documenting this action in feature information.
rowData(processed[[1]])$DC_note
## [1] "Drift_corrected" "Drift_corrected" "Drift_corrected" "Drift_corrected"
## [5] "Drift_corrected" "Drift_corrected" "Drift_corrected" "Drift_corrected"
## [9] "Drift_corrected" "Drift_corrected" "Drift_corrected" "Drift_corrected"
## [13] "Drift_corrected" "Drift_corrected" "Drift_corrected" "Drift_corrected"
## [17] "Drift_corrected" "Drift_corrected" "Drift_corrected" "Drift_corrected"
Preprocessing for the complete dataset
Next, it is time to merge the modes together and visualize the complete dataset.
merged <- merge_objects(processed)
visualizations(merged, prefix = paste0(ppath, "figures/_FULL"),
group = "Group", time = "Time",
id = "Subject_ID", color = "Group")
Then, QC samples can (and should) be removed, as they are no longer needed, and the dataset is visualized anew.
merged_no_qc <- drop_qcs(merged)
visualizations(merged_no_qc, prefix = paste0(ppath, "figures/FULL_NO_QC"),
group = "Group", time = "Time",
id = "Subject_ID", color = "Group")
If there are any missing values in the data, they need to be imputed. Let’s use random forest imputation to impute these values and visualize the dataset one last time before statistical analysis. Seed number should be set before random forest imputation to guarantee reproducibility of results.
set.seed(2024)
imputed <- impute_rf(merged_no_qc)
visualizations(imputed, prefix = paste0(ppath, "figures/FULL_IMPUTED"),
group = "Group", time = "Time",
id = "Subject_ID", color = "Group")
By default, the imputation procedure only operates on good quality features, i.e. those that have not been flagged. To use flagged features in statistical analysis, they should be imputed as well. This can be achieved through a second round of imputation, now with all features included. This two-step imputation makes sure that low-quality features don’t affect the imputation of quality features. Imputation could also be performed separately for each mode to reduce execution time, especially if the amounts of features still allows for good imputation results.
base <- impute_rf(imputed, all_features = TRUE)
It is a good idea to save the merged and processed data, so experimenting with different statistical analyses becomes easier.
Feature selection
Univariate analysis
First, we’ll use a linear model to evaluate the features in terms of the probability of obtaining the observed abundances given the null hypothesis, namely that there is no difference in feature abundance across study groups.
lm_results <- perform_lm(log(base), formula_char = "Feature ~ Group")
base <- join_rowData(base, lm_results)
Supervised learning
Univariate hypothesis testing does not take into account the multivariate, interconnected nature of the data. Parametric tests as the one above also make assumptions about the data which may undermine the reliability of the results, as they are likely to be violated in untargeted LC-MS datasets. Let’s use supervised learning to see which features best predict group membership to get another perspective. A random forest model doesn’t require further preprocessing and makes minimal assumptions.
rf_results <- fit_rf(base, y = "Group")
rf_importance <- importance_rf(rf_results)
base <- join_rowData(base,
rf_importance[, c("Feature_ID", "MeanDecreaseAccuracy")])
Next, univariate and supervised rankings of features could be combined in a final ranking. Such a final ranking would combine the qualities of univariate analysis and supervised learning. The final results are more practical than inferential; it helps in limiting the number of features undergoing labour- intensive scrutiny for biological meaning, for example by identification and pathway analysis. This may ultimately guide further efforts in productive directions. For example, in biomarker discovery, the results may prompt validation using targeted LC-MS.
That concludes our project example. The last thing to do is to write the results to an Excel file and finish the logging. Thanks!
write_to_excel(base, file = paste0(ppath, "results.xlsx"))
finish_log()
Session information
## R version 4.5.1 (2025-06-13)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.2 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
## [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
## [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
## [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
##
## time zone: UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] notame_0.99.2 SummarizedExperiment_1.38.1
## [3] GenomicRanges_1.60.0 GenomeInfoDb_1.44.0
## [5] IRanges_2.42.0 S4Vectors_0.46.0
## [7] MatrixGenerics_1.20.0 matrixStats_1.5.0
## [9] magrittr_2.0.3 ggplot2_3.5.2
## [11] Biobase_2.68.0 BiocGenerics_0.54.0
## [13] generics_0.1.4 BiocStyle_2.36.0
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.2.1 viridisLite_0.4.2 dplyr_1.1.4
## [4] farver_2.1.2 fastmap_1.2.0 digest_0.6.37
## [7] lifecycle_1.0.4 compiler_4.5.1 rngtools_1.5.2
## [10] rlang_1.1.6 sass_0.4.10 tools_4.5.1
## [13] yaml_2.3.10 knitr_1.50 lambda.r_1.2.4
## [16] doRNG_1.8.6.2 S4Arrays_1.8.1 labeling_0.4.3
## [19] htmlwidgets_1.6.4 DelayedArray_0.34.1 RColorBrewer_1.1-3
## [22] abind_1.4-8 BiocParallel_1.42.1 Rtsne_0.17
## [25] withr_3.0.2 purrr_1.0.4 itertools_0.1-3
## [28] desc_1.4.3 grid_4.5.1 scales_1.4.0
## [31] iterators_1.0.14 MASS_7.3-65 cli_3.6.5
## [34] rmarkdown_2.29 crayon_1.5.3 ragg_1.4.0
## [37] httr_1.4.7 cachem_1.1.0 parallel_4.5.1
## [40] formatR_1.14 BiocManager_1.30.26 XVector_0.48.0
## [43] vctrs_0.6.5 Matrix_1.7-3 jsonlite_2.0.0
## [46] bookdown_0.43 systemfonts_1.2.3 foreach_1.5.2
## [49] jquerylib_0.1.4 tidyr_1.3.1 ggdendro_0.2.0
## [52] missForest_1.5 glue_1.8.0 pkgdown_2.1.3
## [55] codetools_0.2-20 cowplot_1.1.3 stringi_1.8.7
## [58] gtable_0.3.6 futile.logger_1.4.3 UCSC.utils_1.4.0
## [61] tibble_3.3.0 pillar_1.10.2 pcaMethods_2.0.0
## [64] htmltools_0.5.8.1 randomForest_4.7-1.2 GenomeInfoDbData_1.2.14
## [67] R6_2.6.1 textshaping_1.0.1 evaluate_1.0.4
## [70] lattice_0.22-7 futile.options_1.0.1 openxlsx_4.2.8
## [73] bslib_0.9.0 Rcpp_1.0.14 zip_2.3.3
## [76] SparseArray_1.8.0 xfun_0.52 fs_1.6.6
## [79] pkgconfig_2.0.3