Mass spectrometry has long stood at the heart of molecular discovery. Nevertheless, interpreting instrument data to extract biological insights can take weeks of expert analysis – and many findings remain hidden in complex biological data.
Matterwork’s Large Spectral Model (LSM) – the first self-supervised machine intelligence model built for multi-omics data – aims to change that. Trained on billions of spectra, it captures chemical and biological relationships directly from raw signals in minutes. Combined with Pyxis, an AI copilot for analytical chemists that streamlines method development and compound identification, and Amy, its counterpart for translational biologists, the platform connects every stage of the workflow – from data acquisition to biological interpretation.
Here, we speak with Niall O’Connor, CTO of Matterworks, about how the LSM is transforming LC-MS analysis, enabling faster discoveries, and helping scientists unlock biology at machine scale.
What impact might machine learning have on the life sciences – especially in a multi-omics context?
Machine learning in life sciences has traditionally been limited by data sets that are too small for training stand-alone models. Often, it’s simply too expensive or impractical to generate enough labeled samples to train a model properly. Self-supervised learning – a machine learning method that doesn’t rely on human-provided labels – on the other hand, removes that dependence. It enables us to make use of the billions of unlabeled spectra already available across platforms, organisms, and analytical methods – from small molecules and lipids, to peptides and proteins.
This provides a general foundation in biochemistry from which we can fine-tune models for specific objectives in a much more data-efficient way. So instead of needing thousands of labeled samples, we might only need tens – far more realistic for most life science applications. In a multi-omic context, it means we’re not restricting ourselves to just one type of analysis or molecule. The model learns broadly across many types of biochemistry to develop a richer understanding of how biology works at different levels.
How does the Large Spectral Model (LSM) build on that concept – and what makes it distinct from the large language models (LLMs) we hear so much about?
Large language models can summarize what metabolites mean in a biological context, such as cancer or the microbiome, but that depends on the groundwork being solid – accurate identification, correct concentrations, and well-filtered data.
We wanted to focus further upstream, improving that analytical foundation rather than just interpreting results. Therefore, we trained the Large Spectral Model (LSM) using a self-supervised approach, exposing it to a huge diversity of biomolecules across many platforms and instrument settings. The model learns directly from raw spectral data, encoding it into a lower-dimensional but highly informative representation.
When we use those encodings to train classifiers, they’re remarkably effective at distinguishing compound identities, biochemical properties, and even stratifying disease states. It’s clear the model has learned something genuinely meaningful about the underlying chemistry. That’s the key distinction: unlike a language model that takes text as input, the LSM takes spectral data.
It’s designed as a companion for analytical chemists – supporting better method selection, instrument tuning, identification, and quantification.
In practical terms, what kinds of challenges in LC-MS data interpretation does the LSM help to overcome?
One of the biggest impacts comes from how the model learns to recognize and separate signal from noise. Trained on over a billion spectra across many epochs, it’s seen all kinds of noise, artifacts, and instrument-specific quirks that usually require an experienced analyst to interpret correctly. As the model trains, it learns to distinguish between true chemical signal and instrument artifacts or background noise, effectively “deconvolving” the spectra.
Another area where it helps is compound identification. Traditional methods often rely on noise-sensitive metrics like cosine similarity, leading to false positives and negatives. The LSM instead works in a learned latent space, matching features more reliably and handling noisy data gracefully to produce more valid IDs. And for batch effects – a constant LC-MS headache – the model’s latent space reveals and corrects them directly. Altogether, the LSM refines the entire workflow: improving signal detection, stabilizing quantitation, and reducing noise-related errors from the start.
Could you walk us through how the model turns raw spectral data into biological insight – and how do tools like Pyxis and Amy fit into that workflow?
We see Pyxis as copilots for the analytical chemist, and Amy as the copilot for the translational biologist – together forming a connected workflow from method selection to biological insight.
As an example, we recently worked with a commercial partner to on mechanistic interpretation of a clinical study with about 100 subjects. Pyxis recommended a reverse-phase LC-MS2 method with an optimized sample preparation to maximize biological contrast between treatment and control groups. Once that method was established, MS2 data were generated, allowing Pyxis’s untargeted identification model to assign molecular IDs across the entire sample set. The model then confirmed, post-acquisition, that as the samples were randomized and acquired together using a rapid method, there was no observable batch effect – an important validation step. With those concentration values in hand, Amy could then perform downstream biological analysis – in this case, running a differential metabolite analysis, generating a volcano plot, and identifying metabolomic and lipidomic species used to evaluate multiple hypothesized mechanisms of action..
What’s powerful here is that the user eliminated in-laboratory method development, peak picking, and weeks or months of downstream analysis. They identified meaningful biological differences in one continuous workflow. The system also actively helps users make better experimental decisions – ensuring instruments are properly calibrated, methods are optimized, and data quality is high. This enables researchers to move from experimental design to actionable insight in days, rather than the weeks or months that traditional LC-MS workflows usually require.
Can you give any additional examples where the LSM has been applied and the impact it’s had?
We recently worked with a research group on a study aiming to distinguish ovarian cancer from cervical cancer and benign tumors using a blood assay. In their published work, the team identified a panel of ten lipid biomarkers that could differentiate between disease states. We took their raw spectral data and processed it through the LSM, which encodes the spectra into vector representations.
The traditional ten-lipid model achieved around 76 percent mean accuracy across validation folds, but using the LSM encodings, our model achieved about 98 percent accuracy on the same data splits. That improvement makes sense when you realize that the LSM is capturing far more of the biological signal. It’s not limited to a handful of biomarkers; it’s effectively using the entire biochemical fingerprint captured by the instrument.
While those ten lipids were definitely relevant, the model was able to detect other subtle, meaningful signals in the raw data missed by traditional workflows.
How do you approach validation and benchmarking to ensure consistent performance across different instruments and workflows?
Designing benchmarks that genuinely reflect real-world analytical environments is one of the toughest challenges in machine learning, and is something we’ve put a lot of work into as a result.
First, we benchmark the model across multiple instruments and platforms to test performance under varied real-world conditions. We also evaluate how well it generalizes to unseen analytes, withholding certain compounds during training before reintroducing them to assess recognition accuracy. Additionally, we look at matrix effects – using the same analytes in different sample matrices to evaluate how well the model can predict and adapt to those changes. To probe robustness, we test the same analytes in different sample matrices and design “hard” benchmarks focused on known analytical challenges such as isomer separation and chimeric spectra.
This isn’t just a computational exercise – our chemists are deeply involved, working with customers every day to assess whether the model has truly generalized to their instruments, analytes, and experimental designs. We take all of that feedback and feed it directly into our benchmarking loop – constantly validating the model to behave as expected in the messy, variable real world of analytical science.
What do you see as the biggest opportunities for AI-driven data interpretation in analytical science over the next few years?
One of the most exciting areas we’re working on right now is de novo identification. Traditionally, when you acquire a spectrum, you compare it against a spectral library to confirm the compound it represents. However, even the most comprehensive libraries only cover around 10 to 15 percent of what’s typically found in a sample, leaving an enormous amount of “dark matter” in the data uncharacterized.
The real opportunity now is to use these AI models to generate plausible molecular structures directly from spectral data to predict what those uncharacterized compounds could be. And it’s not just about curiosity – those unknowns are there for a reason, driving some function or phenotype that matters. Unlocking that with AI-driven interpretation could dramatically expand our understanding of complex systems and reveal entirely new classes of molecules that traditional workflows have missed.
Any final thoughts on machine learning and the future of mass spectrometry?
My belief is that by applying modern machine learning, we can move toward machine-scale insight in healthcare and life sciences. There’s an incredible amount of new biology sitting there in the signal, just waiting to be discovered – and the only way to truly unlock that is by using a new computational approach.
Mass spectrometry is an incredibly powerful tool, but to leverage it to its full potential, we need a fundamentally new way to analyze the data it produces. The LSM provides machine scale intelligence to unlock that potential and make better use of the data already available.
