Project example

In this project example, core functionality of notame is demonstrated in unison with the associated protocol article (Klåvus et al. 2020). Since substantive insight is not a focus of the project example, results visualizations are omitted.

Project setup

Set up path and logging

Let’s set up a path and start logging our project. Many functions of the package will log information when they have finished. This helps in monitoring the analysis process in case something goes wrong and in reviewing the results later. The log will also print to console, although we’ll exclude console output for the rest of the document for brevity.

library(notame)
ppath <- tempdir()
init_log(log_file = file.path(ppath, "log.txt"))
## INFO [2025-06-23 22:38:51] Starting logging

Read data

The data is mock data with two balanced groups and two balanced time points. The time points also correspond to two batches. There are 50 samples with ten regularly interspersed QC samples and 80 features divided into four analytical modes.

data <- read_from_excel(
  file = system.file("extdata", "example_set.xlsx", package = "notame"), 
  sheet = 1, corner_row = 11, corner_column = "H", 
  split_by = c("Column", "Ion_mode"))

The function read_from_excel() returns a list holding the three parts of the data:

exprs: feature abundances across the samples
pheno_data: sample information
feature_data: feature information

names(data)
## [1] "exprs"        "pheno_data"   "feature_data"
sapply(data, class)
## $exprs
## [1] "matrix" "array" 
## 
## $pheno_data
## [1] "data.frame"
## 
## $feature_data
## [1] "data.frame"
sapply(data, dim)
##      exprs pheno_data feature_data
## [1,]    80         50           80
## [2,]    50         11            8

Construct SummarizedExperiment object

These three parts can be used to construct SummarizedExperiment using the native SummarizedExperiment constructor or MetaboSet objects using construct_metabosets(). The example data contains four analytical modes, so we separate them using fix_object(), which also conveniently checks the object for compatibility with all of notame.

data <- SummarizedExperiment(assays = data$exprs, 
                             rowData = data$feature_data,
                             colData = data$pheno_data)
                             
modes <- fix_object(data, split_data = TRUE)

Now we have a list of objects, one per mode (LC column x ionization mode) or whatever split was supplied to read_from_excel().

names(modes)
## [1] "HILIC_neg" "HILIC_pos" "RP_neg"    "RP_pos"
sapply(modes, class)
##              HILIC_neg              HILIC_pos                 RP_neg 
## "SummarizedExperiment" "SummarizedExperiment" "SummarizedExperiment" 
##                 RP_pos 
## "SummarizedExperiment"

Tabular data preprocessing

Preprocessing by mode

Tabular data preprocessing is performed to complete the dataset by way of reducing unwanted variation and data preparation dependent on downstream methods. Steps performed separately for each mode include marking missing values as NA, flagging features with low detection rate (Broadhurst et al. 2018), drift correction (Kirwan et al. 2013), and flagging of low-quality features (Broadhurst et al. 2018). Visualizations are drawn to monitor the process and explore the data globally. The visualizations() wrapper does not include feature-wise plots as it may not be feasible to inspect them at this stage of the analysis. For example, drift correction visualizations are best drawn for a subset of interesting features after feature selection.

# Initialize empty list for processed objects
processed <- list()
for (i in seq_along(modes)) {
  name <- names(modes)[i]
  mode <- modes[[i]]
  # Set all zero abundances to NA
  mode <- mark_nas(mode, value = 0)
  # Flag features with low detection rate
  mode <- flag_detection(mode, group = "Group")
  # Visualize data before drift correction
  visualizations(mode, prefix = paste0(ppath, "figures/", name, "_ORIG"),
                 perplexity = 5, group = "Group", time = "Time", 
                 id = "Subject_ID", color = "Group")
  # Correct drift
  corrected <- correct_drift(mode)
  # Visualize data after drift correction
  visualizations(corrected, prefix = paste0(ppath, "figures/", name, "_DRIFT"),
                 perplexity = 5, group = "Group", time = "Time", 
                 id = "Subject_ID", color = "Group")
  # Flag low-quality features
  corrected <- corrected %>% assess_quality() %>% flag_quality()
  # Visualize data after removal of low-quality features
  visualizations(corrected, prefix = paste0(ppath, "figures/", name, "_CLEAN"),
                 perplexity = 5, group = "Group", time = "Time",
                 id = "Subject_ID", color = "Group")
  # Save result of iteration
  processed[[i]] <- corrected
}

Feature-wise flagging information, quality metrics, and brief drift correction notes are included in the feature information after the above steps. The correct_drift() function performs drift correction on all features with sufficient detection in QC samples by default. In case check_quality = TRUE, correct_drift() retains corrected values only for features with improved quality metrics after drift correction, along with a note documenting this action in feature information.

rowData(processed[[1]])$DC_note
##  [1] "Drift_corrected" "Drift_corrected" "Drift_corrected" "Drift_corrected"
##  [5] "Drift_corrected" "Drift_corrected" "Drift_corrected" "Drift_corrected"
##  [9] "Drift_corrected" "Drift_corrected" "Drift_corrected" "Drift_corrected"
## [13] "Drift_corrected" "Drift_corrected" "Drift_corrected" "Drift_corrected"
## [17] "Drift_corrected" "Drift_corrected" "Drift_corrected" "Drift_corrected"

Preprocessing for the complete dataset

Next, it is time to merge the modes together and visualize the complete dataset.

merged <- merge_objects(processed)
visualizations(merged, prefix = paste0(ppath, "figures/_FULL"),
               group = "Group", time = "Time",
               id = "Subject_ID", color = "Group")

Then, QC samples can (and should) be removed, as they are no longer needed, and the dataset is visualized anew.

merged_no_qc <- drop_qcs(merged)
visualizations(merged_no_qc, prefix = paste0(ppath, "figures/FULL_NO_QC"),
               group = "Group", time = "Time",
               id = "Subject_ID", color = "Group")

If there are any missing values in the data, they need to be imputed. Let’s use random forest imputation to impute these values and visualize the dataset one last time before statistical analysis. Seed number should be set before random forest imputation to guarantee reproducibility of results.

set.seed(2024)
imputed <- impute_rf(merged_no_qc)
visualizations(imputed, prefix = paste0(ppath, "figures/FULL_IMPUTED"),
               group = "Group", time = "Time",
               id = "Subject_ID", color = "Group")

By default, the imputation procedure only operates on good quality features, i.e. those that have not been flagged. To use flagged features in statistical analysis, they should be imputed as well. This can be achieved through a second round of imputation, now with all features included. This two-step imputation makes sure that low-quality features don’t affect the imputation of quality features. Imputation could also be performed separately for each mode to reduce execution time, especially if the amounts of features still allows for good imputation results.

base <- impute_rf(imputed, all_features = TRUE)

It is a good idea to save the merged and processed data, so experimenting with different statistical analyses becomes easier.

save(base, file = paste0(ppath, "full_data.RData"))

Feature selection

Univariate analysis

First, we’ll use a linear model to evaluate the features in terms of the probability of obtaining the observed abundances given the null hypothesis, namely that there is no difference in feature abundance across study groups.

lm_results <- perform_lm(log(base), formula_char = "Feature ~ Group")
base <- join_rowData(base, lm_results)

Supervised learning

Univariate hypothesis testing does not take into account the multivariate, interconnected nature of the data. Parametric tests as the one above also make assumptions about the data which may undermine the reliability of the results, as they are likely to be violated in untargeted LC-MS datasets. Let’s use supervised learning to see which features best predict group membership to get another perspective. A random forest model doesn’t require further preprocessing and makes minimal assumptions.

rf_results <- fit_rf(base, y = "Group")

rf_importance <- importance_rf(rf_results)

base <- join_rowData(base, 
  rf_importance[, c("Feature_ID", "MeanDecreaseAccuracy")])

Next, univariate and supervised rankings of features could be combined in a final ranking. Such a final ranking would combine the qualities of univariate analysis and supervised learning. The final results are more practical than inferential; it helps in limiting the number of features undergoing labour- intensive scrutiny for biological meaning, for example by identification and pathway analysis. This may ultimately guide further efforts in productive directions. For example, in biomarker discovery, the results may prompt validation using targeted LC-MS.

That concludes our project example. The last thing to do is to write the results to an Excel file and finish the logging. Thanks!

write_to_excel(base, file = paste0(ppath, "results.xlsx"))
finish_log()

Session information

## R version 4.5.1 (2025-06-13)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.2 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
##  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
##  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
## [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
## 
## time zone: UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] notame_0.99.2               SummarizedExperiment_1.38.1
##  [3] GenomicRanges_1.60.0        GenomeInfoDb_1.44.0        
##  [5] IRanges_2.42.0              S4Vectors_0.46.0           
##  [7] MatrixGenerics_1.20.0       matrixStats_1.5.0          
##  [9] magrittr_2.0.3              ggplot2_3.5.2              
## [11] Biobase_2.68.0              BiocGenerics_0.54.0        
## [13] generics_0.1.4              BiocStyle_2.36.0           
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.2.1        viridisLite_0.4.2       dplyr_1.1.4            
##  [4] farver_2.1.2            fastmap_1.2.0           digest_0.6.37          
##  [7] lifecycle_1.0.4         compiler_4.5.1          rngtools_1.5.2         
## [10] rlang_1.1.6             sass_0.4.10             tools_4.5.1            
## [13] yaml_2.3.10             knitr_1.50              lambda.r_1.2.4         
## [16] doRNG_1.8.6.2           S4Arrays_1.8.1          labeling_0.4.3         
## [19] htmlwidgets_1.6.4       DelayedArray_0.34.1     RColorBrewer_1.1-3     
## [22] abind_1.4-8             BiocParallel_1.42.1     Rtsne_0.17             
## [25] withr_3.0.2             purrr_1.0.4             itertools_0.1-3        
## [28] desc_1.4.3              grid_4.5.1              scales_1.4.0           
## [31] iterators_1.0.14        MASS_7.3-65             cli_3.6.5              
## [34] rmarkdown_2.29          crayon_1.5.3            ragg_1.4.0             
## [37] httr_1.4.7              cachem_1.1.0            parallel_4.5.1         
## [40] formatR_1.14            BiocManager_1.30.26     XVector_0.48.0         
## [43] vctrs_0.6.5             Matrix_1.7-3            jsonlite_2.0.0         
## [46] bookdown_0.43           systemfonts_1.2.3       foreach_1.5.2          
## [49] jquerylib_0.1.4         tidyr_1.3.1             ggdendro_0.2.0         
## [52] missForest_1.5          glue_1.8.0              pkgdown_2.1.3          
## [55] codetools_0.2-20        cowplot_1.1.3           stringi_1.8.7          
## [58] gtable_0.3.6            futile.logger_1.4.3     UCSC.utils_1.4.0       
## [61] tibble_3.3.0            pillar_1.10.2           pcaMethods_2.0.0       
## [64] htmltools_0.5.8.1       randomForest_4.7-1.2    GenomeInfoDbData_1.2.14
## [67] R6_2.6.1                textshaping_1.0.1       evaluate_1.0.4         
## [70] lattice_0.22-7          futile.options_1.0.1    openxlsx_4.2.8         
## [73] bslib_0.9.0             Rcpp_1.0.14             zip_2.3.3              
## [76] SparseArray_1.8.0       xfun_0.52               fs_1.6.6               
## [79] pkgconfig_2.0.3

References

Broadhurst, David, Royston Goodacre, Stacey N Reinke, Julia Kuligowski, Ian D Wilson, Matthew R Lewis, and Warwick B Dunn. 2018. “Guidelines and Considerations for the Use of System Suitability and Quality Control Samples in Mass Spectrometry Assays Applied in Untargeted Clinical Metabolomic Studies.” Metabolomics 14: 1–17.

Kirwan, JA, DI Broadhurst, RL Davidson, and MR Viant. 2013. “Characterising and Correcting Batch Variation in an Automated Direct Infusion Mass Spectrometry (DIMS) Metabolomics Workflow.” Analytical and Bioanalytical Chemistry 405: 5147–57.

Klåvus, Anton, Marietta Kokla, Stefania Noerman, Ville M Koistinen, Marjo Tuomainen, Iman Zarei, Topi Meuronen, et al. 2020. “‘Notame’: Workflow for Non-Targeted LC–MS Metabolic Profiling.” Metabolites 10 (4): 135.

Anton Klåvus, Vilhelm Suksi

2025-06-23