In a major step toward decoding cellular architecture, researchers from UC San Diego, Harvard, and Stanford have built a comprehensive map of human cell organization by integrating proteome-scale interaction and imaging data. Using self-supervised machine learning, the team combined confocal microscopy and affinity purification-mass spectrometry to define 275 molecular assemblies spanning from protein complexes to organelles – laying the groundwork for new structural and functional discoveries.
We spoke with Leah Schaffer, co-lead author of the study, to discuss the development of the approach, as well as the role of AI in interpreting complex data and the broader implications for biomedical research.
What was the initial spark or challenge that led you to pursue a multimodal approach to mapping the cell?
Different data modalities capture information on proteins at different scales. For example, many mass spectrometry-based biophysical approaches reveal interactions between proteins, including protein pairs in close proximity within complexes. Imaging approaches typically reveal information on a larger scale, identifying where proteins are most localized within a cell, for instance. By integrating multiple modalities, the structure of cells can be mapped out across these scales more effectively to identify complexes within larger cell compartments. Furthermore, multimodal evidence can be generated for new discoveries, such as previously undocumented complexes.
Can you explain how you used machine learning to integrate the proteomics and imaging data?
Our approach uses deep machine learning to create a lower dimensional representation for each protein, capturing information from the original two original high dimensional datasets. Once lower dimensional representations have been obtained for each protein, downstream analyses – such as clustering at different resolutions – can be performed to determine protein assemblies of various sizes. This ranges from small protein complexes to larger structures, such as organelles.
What were the major technical or analytical challenges you encountered during the fusion of your datasets?
Without the presence of a comprehensive “gold standard” in biology, it can be difficult to evaluate different approaches. To overcome this, we used the available “gold standards” – the Gene Ontology, for example – as well as other proteome-wide orthogonal datasets. As new cell mapping datasets are generated, I anticipate that we will be able to evaluate protein assemblies more effectively, with (more) robust evidence.
Could you speak about the wider potential of using AI to glean insights from analytical data?
There are many exciting areas in which we’re seeing the impact of AI right now. One way we used LLMs in this project was to annotate protein assemblies and hypothesize their function, based on the available literature. In the future, I believe AI models will continue to enhance our understanding of the ways new data fits into our existing bases of biological knowledge, while also supporting new biological discoveries.
What impact could your approach have across fields?
There are still plenty of unknown protein complexes and pathways, of which many are disease-relevant. By mapping out cell structures with these global proteome-wide datasets, we can better understand the complexes or pathways each protein is involved in without the biases associated with how well-studied the proteins involved are. In this way, this approach could implicate proteins in disease-relevant pathways that could be strong drug targets. Additionally, cell maps could help to better disease understanding, such as how mutations in cancer converge at protein assembly level.
What are your team’s next steps?
Our next steps include the expansion of our cell mapping approach to include additional cellular contexts, such as different cell types or dynamic states like drug exposure. We’re also interested in integrating additional data modalities, e.g. functional data or high throughput approaches which map out cell structure. We envision the community using the Multiscale Integrated Cell portal to browse our cell maps for specific proteins of interest. Additionally, we hope our tools are continually used to construct cell maps from new datasets, enabling scientists to browse multimodal data and uncover new biological insights.