General introduction
Anton Klåvus, Vilhelm Suksi
2025-06-23
Source:vignettes/introduction.Rmd
introduction.Rmd
Motivation
From the perspective of metabolites as the continuation of the central dogma of biology, metabolomics provides the closest link to many phenotypes of interest. This makes metabolomics research promising in teasing apart the complexities of living systems, attracting many new practitioners.
The notame R package was developed in parallel with an associated protocol article as a general guideline for data analysis in untargeted metabolomics studies (Klåvus et al. 2020). The main outcome is identifying interesting features for laborious downstream steps relating to biological context, such as metabolite identification and pathway analysis, which fall outside the purview of notame. Bioconductor packages with complementary functionality in Bioconductor include pmp, phenomis and qmtools; notame brings partially overlapping and new functionality to the table. There are also Bioconductor packages for preprocessing, metabolite identification and pathway analysis. Together, notame, Bioconductor’s dependency management and other Bioconductor functionality allow for quality, reproducible metabolomics research.
Installation
To install notame, install BiocManager first, if it is not installed. Afterwards use the install function from BiocManager.
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("notame")
library(notame)
How it works
SummarizedExperiment is the primary data structure of this package,
but MetaboSet is still supported for old users’ preference. One can use
a single peak table throughout the analysis as with MetaboSet and also
use multiple peak tables with SummarizedExperiment, using the
assay.type
and name
arguments.
The functionality of notame can be broadly divided into tabular data preprocessing and feature selection, excluding sample preprocessing and functionality related to biological context (Figure 1). Tabular data processing involves reducing unwanted variation and data preparation dependent on downstream methods. The many visualizations used for inspecting the process also serve as exploratory data analysis. Feature selection aims to select a subset of interesting features across study groups before laborious steps relating to biological context. Please see the documentation for an overview of functionality (?notame), the Project Example vignette for usage and the associated protocol article for more information (Klåvus et al. 2020).

Overview of untargeted LC-MS metabolomics data analysis.
Input
Data can be read with read_from_excel()
, which includes
checks and preparation of metadata. To accommodate typical output from
peak-picking software such as Agilent’s MassHunter or MS-DIAL, the
output is transformed into a spreadsheet for
read_from_excel()
. Alternatively, data in R can be wrangled
and passed to the construct_metabosets()
or
SummarizedExperiment()
constructor.

Structure of spreadsheet for read_from_excel().
There are a few obligatory fields for read_from_excel()
,
including “Injection_order” in sample information, “Mass” or “Average
mz” in feature data and “Retention time”, “RetentionTime”, “Average
rt(min)” or “rt” in feature information (not case sensitive). There are
further optional fields, including “Sample_ID” and “QC” in sample data
as well as “Feature_ID” in feature data, which are automatically
generated if unavailable. One or more fields in feature data can be used
to split the data into parts, usually LC column x ionization mode,
supplied as arguments to the split_by
parameter. If the
file only contains one mode, specify the name of the mode,
e.g. “HILIC_pos” to the name
parameter.
Tabular data preprocessing
The main functions return modified objects and are largely based on
pooled QC samples (Broadhurst et al.
2018). Tabular data preprocessing is generally performed
separately for each mode. The visualizations used to monitor tabular
data preprocessing are saved to file by default, but can also be
returned as ggplot objects. The visualizations()
wrapper
can be used for saving visualizations at different stages of
processing.
Feature selection
Univariate statistics functions return a data.frame
, to
be manually filtered before inclusion into the feature data of the
instance. Supervised learning functions return various data
structures.
Comprehensive results visualizations are returned as ggplot objects
and can be saved to file using save_plot()
. Interesting
features can be inspected with feature-wise visualizations which are
saved to file by default but can be returned as a list.
Utilities
General utilities include combined_data()
for
representing the instance in a data.frame
suitable for
plotting and various functions for data wrangling. For keeping track of
the analysis, notame offers a logging system operated using
init_log()
, log_text()
and
finish_log()
. notame also keeps track of all the
external packages used, offering you references for each. To see and log
a list of references, use citations()
.
Parallellization is used in many feature-wise calculations and is provided by the BiocParallel package. BiocParallel defaults to a parallel backend. For small-scale testing on Windows, it can be quicker to use serial execution:
BiocParallel::register(BiocParallel::SerialParam())
Authors & Acknowledgements
The first version of notame was written by Anton Klåvus for his master’s thesis in Bioinformatics at Aalto university (published under former name Anton Mattsson), while working for University of Eastern Finland and Afekta Technologies. The package is inspired by analysis scripts written by Jussi Paananen and Oskari Timonen. The algorithm for clustering molecular features originating from the same compound is based on MATLAB code written by David Broadhurst, Professor of Data Science & Biostatistics in the School of Science, and director of the Centre for Integrative Metabolomics & Computational Biology at the Edith Covan University.
If you find any bugs or other things to fix, please submit an issue on GitHub! All contributions to the package are always welcome!
Session information
## R version 4.5.1 (2025-06-13)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.2 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
## [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
## [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
## [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
##
## time zone: UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] BiocStyle_2.36.0
##
## loaded via a namespace (and not attached):
## [1] digest_0.6.37 desc_1.4.3 R6_2.6.1
## [4] bookdown_0.43 fastmap_1.2.0 xfun_0.52
## [7] cachem_1.1.0 knitr_1.50 htmltools_0.5.8.1
## [10] png_0.1-8 rmarkdown_2.29 lifecycle_1.0.4
## [13] cli_3.6.5 sass_0.4.10 pkgdown_2.1.3
## [16] textshaping_1.0.1 jquerylib_0.1.4 systemfonts_1.2.3
## [19] compiler_4.5.1 tools_4.5.1 ragg_1.4.0
## [22] evaluate_1.0.4 bslib_0.9.0 yaml_2.3.10
## [25] BiocManager_1.30.26 jsonlite_2.0.0 rlang_1.1.6
## [28] fs_1.6.6 htmlwidgets_1.6.4