https://blog.stephenturner.us/p/weekly-recap-dec-2024-part-1 This week’s recap highlights the WorkflowHub registry for computational workflows, building a virtual cell with AI, a review on bioinformatics methods for prioritizing causal genetic variants in candidate regions, a benchmarking study showing deep learning methods are best for variant calling in bacterial nanopore sequencing, and a new ML model from researchers at Genentech for predicting cell-type- and condition-specific gene expression across diverse biological contexts. Others that caught my attention include a new tool for applying rearrangement distances to enable plasmid epidemiology (pling), a commentary on ethical governance for genomic data science in the cloud, a method for filtering genomic deletions using CNNs (sv-channels), in silico generation of synthetic cancer genomes using generative AI, a new tool for evaluating how close an assembly is to T2T, a long context RNA foundation model for predicting transcriptome architecture, open-source USEARCH 12, and the Dog10K database summarizing canine multi-omics. Audio generated with NotebookLM. Subscribe to Paired Ends (free) to get summaries like this delivered to your e-mail. Deep dive WorkflowHub: a registry for computational workflows Paper: Ove Johan Ragnar Gustafsson et al., "WorkflowHub: a registry for computational workflows", 2024. arXiv. https://arxiv.org/abs/2410.06941. Workflows in the life sciences are fragmented across multiple ecosystems including nf-core (Nextflow), Intergalactic Workflow Commission (Galaxy), the Snakemake catalog, Dockstore (mostly WDL), or in generalist repositories like Zenodo, DataVerse, GitHub, etc. Seqera pipelines is a curated collection of high-quality, open-source pipelines, and ONT maintains a collection of curated pipelines at EPI2ME Workflows (both are Nextflow only). WorkflowHub is an attempt to make workflows more FAIR, and supports Snakemake, Galaxy, Nextflow, WDL, CWL, and “kindaworkflows” in Bash, Jupyter, Python, etc. TL;DR: This paper introduces WorkflowHub.eu, a platform for registering, sharing, and citing computational workflows across multiple scientific disciplines. It promotes reproducibility and FAIR (Findable, Accessible, Interoperable, Reusable) workflows, supporting collaborations and workflow lifecycle management. Summary: The paper presents WorkflowHub.eu, a community-driven platform designed to address the challenge of finding, sharing, and reusing computational workflows. The registry supports workflows from diverse domains and integrates with various platforms, enabling workflows to become more discoverable, reproducible, and accessible for both humans and machines. The importance of WorkflowHub lies in its ability to promote collaboration, assign credit, and make workflows scholarly artifacts through FAIR principles. The platform has already registered over 700 workflows from numerous research organizations, demonstrating its global impact. Applications of WorkflowHub extend across different fields, including bioinformatics, astronomy, and particle physics, where computational workflows are essential for processing large-scale data. Highlights: * Provides a unified registry that links to community repositories, enabling workflow lifecycle support and promoting FAIRness. * Integrates with diverse workflow management systems (e.g., Nextflow, Snakemake) and services, enabling seamless sharing, citation, and execution of workflows. * Offers metadata support, making workflows easily searchable and findable across different domains and workflow languages. * WorkflowHub is connected to the LifeMonitor service (lifemonitor.eu), which allows workflow function and status to be reported to maintainers and users through regular automated tests driven by continuous integration. According to the docs, “LifeMonitor supports the application of workflow sustainability best practices. Much of this revolves around following community-accepted conventions for your specific workflow type and implementing periodic workflow testing. Following conventions allows existing computational tools to understand your workflow and its metadata.” * WorkflowHub: https://workflowhub.eu/ How to Build the Virtual Cell with Artificial Intelligence: Priorities and Opportunities Paper: Bunne et al., "How to Build the Virtual Cell with Artificial Intelligence: Priorities and Opportunities," arXiv, 2024. https://arxiv.org/abs/2409.11654. This paper from Stanford, Genentech, CZI, Arc Institute, Microsoft, Google, Calico, EMBL, Harvard, EvolutionaryScale, and many others, proposes an ambitious vision of using AI to construct high-fidelity simulations of cells and systems that are directly learned from biological data. TL;DR: This paper discusses the creation of an AI Virtual Cell, a comprehensive, data-driven model designed to simulate and predict cellular behaviors across different scales (molecular to multicellular) and contexts. It highlights the role of AI in building accurate cellular representations and guiding in silico experiments. AI virtual cells could be used to identify new drug targets, predict cellular responses to purturbations, and scale hypothesis exploration. Summary: The paper outlines a vision for developing AI-powered Virtual Cells, models capable of simulating cellular functions and interactions across molecular, cellular, and multicellular scales. Leveraging advances in AI and omics technologies, the Virtual Cell could serve as a foundation model to predict cell behavior, simulate responses to perturbations, and guide biological research. The importance of this work lies in its potential to revolutionize biological research by enabling high-fidelity, in silico experimentation, providing deeper insights into cellular mechanisms, and facilitating drug discovery and cell engineering. Applications range from identifying drug targets to predicting disease progression, offering a versatile tool for biologists and clinicians. The approach emphasizes collaboration across academia, AI, and biopharma industries to build this resource and ensure it becomes widely accessible. Methodological highlights: * Multi-scale modeling: The AI Virtual Cell captures interactions from molecules to tissues using AI techniques such as graph neural networks and transformers. * In silico experimentation: Enables testing of cellular responses to various perturbations without the need for costly physical experiments. * Universal Representations (UR): A framework that integrates multi-modal biological data to predict cellular behaviors across different contexts and species. A bioinformatics toolbox to prioritize causal genetic variants in candidate regions Paper: Martin Šimon et al., "A bioinformatics toolbox to prioritize causal genetic variants in candidate regions," Trends in Genetics, 2024. DOI: 10.1016/j.tig.2024.09.007. TL;DR: Complex polygenic traits are influenced by quantitative trait loci (QTLs) with small effect sizes, and proving a specific gene is causal and identifying the exact genetic variant(s) responsible for the QTL effect is difficult. This review introduces a bioinformatics toolbox for prioritizing causal genetic variants within quantitative trait loci (QTLs) using multiomics approaches. Summary: The paper addresses the challenge of pinpointing causal genetic variants in complex traits and diseases, especially when dealing with polygenic traits controlled by multiple quantitative trait loci (QTLs). Despite advances in mapping these loci, determining causality remains complex. The review proposes integrating bioinformatics and multiomics techniques to streamline the process of identifying and prioritizing candidate variants within QTLs. Using a case study of the Pla2g4e gene in mice, the authors illustrate how SNPs in regulatory elements can be filtered to focus on the most likely causal candidates. This approach reduces the experimental workload by narrowing down potential variants for functional validation. The work emphasizes the need for hierarchical filtering of SNPs and prioritizing those within known regulatory regions, which can accelerate genetic research and therapeutic discoveries. Highlights: * Introduces a hierarchical bioinformatics strategy to prioritize SNP-containing regulatory elements for identifying causal variants in QTLs. * Multiomics integration (transcriptomics, epigenomics) is used to map regulatory elements and their associated SNPs, refining candidate lists for experimental validation. * Emphasizes the use of SNPs in regulatory elements (promoters, enhancers, open chromatin) as primary filtering tools for narrowing down candidate variants. Benchmarking reveals superiority of deep learning variant callers on bacterial nanopore sequence data Paper: Hall et al., "Benchmarking reveals superiority of deep learning variant callers on bacterial nanopore sequence data," eLife, 2024. https://doi.org/10.7554/eLife.98300. TL;DR: This study benchmarks the performance of deep learning and traditional variant callers on bacterial genomes sequenced using Oxford Nanopore Technologies (ONT) nanopore sequencing. It finds that deep learning-based tools, particularly Clair3 and DeepVariant, outperform conventional methods, even surpassing Illumina sequencing in some cases. Summary: This paper evaluates the accuracy of SNP and indel variant callers on long-read bacterial genome data generated using Oxford Nanopore Technologies (ONT). The authors compare deep learning-based variant callers (Clair3, DeepVariant) with traditional ones (BCFtools, FreeBayes, etc.) using bacterial samples sequenced with high-accuracy ONT models (R10.4.1). The results show that Clair3 and DeepVariant consistently provide higher precision and recall than traditional methods, even outperforming Illumina sequencing in some regions, especially with ONT's super-accuracy basecalling model. The paper emphasizes the practical application of ONT sequencing and deep lea