Paired Ends Podcast

Stephen Turner
Paired Ends Podcast

Bioinformatics, computational biology, and data science updates from the field. Occasional posts on programming. blog.stephenturner.us

Episodes

  1. JAN 15

    AI in data science education

    https://blog.stephenturner.us/p/ai-in-data-science-education Something a little different for this week’s recap. I’ve been thinking a lot lately about the practice of data science education in this era of widely available (and really good!) LLMs for code. I provide some commentary at the top based on my own data science teaching experience, with a deep dive into a few recent papers below. Audio generated with NotebookLM. AI in the CS / data science classroom I was a professor of public health at the University of Virginia School of Medicine for about 8 years, where I taught various flavors of an introductory biomedical data science graduate course and workshop series. I taught these courses using the pedagogical practices I picked up while becoming a Software Carpentry instructor — lots of live coding, interspersed with hands-on exercises, with homework requiring students write and submit code as part of their assignments. I’ve seen firsthand what every experienced software developer deeply understands — how hands-on coding practice, with all its frustrations and breakthrough moments, builds deep understanding. This method of teaching is effective precisely because it embraces the productive struggle of learning. Students develop robust mental models through the cycle of attempting, failing, debugging, and finally succeeding. I think we are well into a pivotal moment in data science and computer science education. As Large Language Models (LLMs) like ChatGPT, Claude, and GitHub Copilot demonstrate increasingly sophisticated code generation abilities, educators are going to face extraordinary opportunities and profound challenges in teaching the next generation of data scientists. The transformative potential of LLMs as learning tools can’t be understated. I recently posted about my experience using Claude to help me write code in a language I don’t write (JavaScript), for a framework I wasn’t familiar with (Chrome extensions). In education, these AI assistants can provide instant, contextual help when students are stuck, offer clear explanations of complex code snippets, and even suggest alternative approaches to problems. They act as always-available teaching assistants, ready to engage in detailed discussions about programming concepts and implementation details. The ability to ask open-ended questions about code and receive thoughtful explanations represents an unprecedented resource for learners. However, this easy access to AI-generated solutions raises important questions about the nature of learning itself. Will students develop the same depth of understanding if they can simply request working code rather than struggle through implementation challenges? When I used to teach, my grad students and workshop participants were practitioners — they were taking my classes because they needed to use R/Python/etc. in their daily work. If I were them now, I’d turn on Copilot or have Claude/ChatGPT pulled up alongside my VSCode/RStudio. Do we really expect students to turn Copilot off just for their homework, and leave it on for their daily work? How do we balance the undeniable benefits of AI assistance with the essential learning that comes from wrestling with difficult problems without a 90% correct completion suggestion? The risk is that we might produce data scientists who can leverage AI tools effectively but lack the foundational knowledge to reason about code and data from first principles. As I try to follow the recent research on LLMs in data science education, including some of the papers below, these questions become increasingly pressing to me. The challenge before us is not whether to incorporate these powerful tools into our teaching, but how to do so in a way that enhances rather than short-circuits the learning process. The goal must be to harness LLMs’ capabilities while preserving the deep understanding that comes from genuine engagement with computational thinking and problem-solving. For another perspective and great insight into this topic, I suggest following Ethan Mollick’s newsletter, One Useful Thing. His recent essay on “Post-apocalyptic education” is a worthwhile read on this topic. Are you a CS or data science educator? How does this resonate with your thoughts? I’d love to chat more on Bluesky (@stephenturner.us). Papers (and a talk): Deep dive ChatGPT for Teaching and Learning: An Experience from Data Science Education Paper: Zheng, Y. “ChatGPT for Teaching and Learning: An Experience from Data Science Education” in SIGCSE 2024: Proceedings of the 55th ACM Technical Symposium on Computer Science Education. https://doi.org/10.1145/3585059.3611431. TL;DR: A practical examination of ChatGPT's use in data science education, studying how it impacts teaching effectiveness and student learning outcomes. The study gathered perspectives from students and instructors in data science courses, revealing both opportunities and challenges specific to the field. Summary: This study evaluates ChatGPT's integration into data science education through real-world practice and user studies with graduate students. The research investigates how effectively ChatGPT assists with tasks ranging from coding explanations to concept understanding. The findings suggest ChatGPT excels in explaining programming concepts and providing detailed API explanations but faces challenges in problem-solving scenarios. The study underscores the importance of balancing AI assistance with fundamental learning, particularly noting that ChatGPT's effectiveness varies across different aspects of data science education. Highlights: * User studies conducted with 28 students across multiple scenarios testing different aspects of ChatGPT's capabilities in data science education. * Quantitative evaluation using a 1-5 scale questionnaire measuring student perceptions across various learning tasks. * Mixed-methods approach combining student feedback with instructor observations and practical implementation tests. Implications of ChatGPT for Data Science Education Paper: Shen, Y. et al., “Implications of ChatGPT for Data Science Education” in SIGCSE 2024: Proceedings of the 55th ACM Technical Symposium on Computer Science Education. https://doi.org/10.1145/3626252.3630874. TL;DR: A systematic study evaluating ChatGPT's performance on data science assignments across different course levels. The research demonstrates how prompt engineering can significantly improve ChatGPT's effectiveness in solving complex data science problems. Summary: The study evaluates ChatGPT's capabilities across three different levels of data science courses, from introductory to advanced. The researchers found that ChatGPT's performance varies significantly based on problem complexity and prompt quality, achieving high success rates with well-engineered prompts. The work provides practical insights into integrating ChatGPT into data science curricula while maintaining educational integrity. Highlights: * Comparative analysis across three course levels using standardized evaluation metrics. * Development and testing of prompt engineering techniques specific to data science education. * Cross-validation of results using multiple assignment types and difficulty levels. * Evaluation framework for assessing LLM performance in data science education. * Collection of example assignments and their corresponding engineered prompts provided in supplementary materials. What Should Data Science Education Do With Large Language Models? Paper: Tu, X., Zou, J., Su, W., & Zhang, L. “What Should Data Science Education Do With Large Language Models?” in Harvard Data Science Review, 2024. https://doi.org/10.1162/99608f92.bff007ab. TL;DR: A comprehensive analysis of how LLMs are reshaping data science education and the necessary adaptations in teaching methods. The paper argues for a fundamental shift in how we approach data science education in the era of LLMs, focusing on developing higher-order thinking skills. Summary: The paper examines the transformative impact of LLMs on data science education, suggesting a paradigm shift from hands-on coding to strategic planning and project management. It emphasizes the need for curriculum adaptation to balance LLM integration while maintaining core competencies. The authors propose viewing data scientists more as product managers than software engineers, focusing on strategic planning and resource coordination rather than just technical implementation. Highlights: * Theoretical framework development for integrating LLMs into data science education. * Case study analysis using heart disease dataset to demonstrate LLM capabilities. * Critical analysis of LLM limitations and educational implications. * Prompts and responses with ChatGPT provided in supplementary materials. Talk: Teaching and learning data science in the era of AI Andrew Gard gave a talk at the 2024 Posit conference on teaching and learning data science in the era of AI. The talk’s premise is the fact that everyone learning data science these days (1) has ready access to Al, and (2) is strongly incentivized to use it. It’s a short, ~5 minute lightning talk. It’s worth watching, and it provides a few points of advice for data science learners. Other papers of note * Cognitive Apprenticeship and Artificial Intelligence Coding Assistants https://www.igi-global.com/gateway/chapter/340133 * It's Weird That it Knows What I Want: Usability and Interactions with Copilot for Novice Programmers https://dl.acm.org/doi/10.1145/3617367 * Computing Education in the Era of Generative AI https://dl.acm.org/doi/10.1145/3624720 * The Robots Are Here: Navigating the Generative AI Revolution in Computing Education https://dl.acm.org/doi/10.1145/3623762.3633499 * From "Ban It Till We Understand It" to "Resistance is Futile": How University Programming Instructors Plan to Adapt as More Students Use AI Code Generation and Explanation Tools such as

    11 min
  2. 12/23/2024

    The Enlightenment Conservatory

    https://blog.stephenturner.us/p/enlightenment-conservatory I had good intentions to give NaNoWriMo a try this year but didn’t get very far. Instead I gave OpenAI’s Creative Writing Coach GPT a try for a (very) short story I had in mind, inspired by my frustration trying to access closed-access research articles for a review article I’m preparing. I found it to be an excellent writing coach with specific advice for refining the role of the curators, expanding the perspective of the cultivators, deepening the emotional stakes, clarifying the catalyst for change, polishing the resolution, adding complexity, making the revolt more dramatic, and fine-tuning the language. Image created with DALL-E. Voiceover with ElevenLabs. In a world not so different from our own, there existed a fabled garden called the Enlightenment Conservatory. Here, ideas took root as seeds of thought, blooming into radiant flowers of discovery and wisdom. Each blossom held the promise of transformation - groundbreaking theories, profound insights, and untold wonders capable of reshaping the world. It was said that no other garden in existence could rival its beauty or its mystery. The Conservatory was tended by a diverse group of dedicated cultivators. These scholars came from all corners of the world, driven by an insatiable curiosity and a passion for nurturing new ideas. They spent their days and nights planting seeds of thought, carefully tending to them, and watching in awe as their conceptual flowers blossomed into vibrant displays of intellectual beauty. Each bloom was unique, representing the culmination of the cultivators' hard work, creativity, and brilliance. However, the Enlightenment Conservatory was not open to all. Surrounding it stood a tall, impenetrable wall, erected long ago by a powerful guild known as the Curators. Through a series of cunning maneuvers and ruthless acquisitions, the Curators had gained control over all the smaller intellectual gardens that once existed independently. Now, they ruled the Enlightenment Conservatory with an iron fist. The Curators enforced one unyielding rule: entry to the Conservatory came at an outrageous price. Even the cultivators - those who had poured their hearts and minds into planting and nurturing each idea - were not spared. To gaze upon their own intellectual blooms, they too had to pay the Curators' steep toll. Many could only catch fleeting glimpses of their creations from outside the towering walls, denied the chance to savor the fruits of their labor. Their brilliance was trapped behind gates they could never afford to open. The Enlightenment Conservatory was meant to be a place where people from all walks of life could come to marvel at the wonders of human thought and insight, where ideas could be shared freely and openly. But under the Curators' rule, it became a bastion of exclusivity. Only the wealthiest patrons and members of the most prestigious institutions could afford to enter and enjoy the intellectual bounty within. These privileged few would stroll through the Conservatory, plucking ideas at will, while the majority remained outside the walls, unable to access the knowledge and insights that had been so carefully cultivated. The Curators defended their dominion by calling themselves the stewards of the Enlightenment Conservatory. They claimed their strict oversight was essential to protect the garden from mediocrity, ensuring only the most refined and worthy ideas took root. Without their watchful gaze, they warned, the Conservatory would drown in a sea of weeds, its beauty choked by chaos. But the cultivators saw through the façade. They knew the Curators tended nothing; they merely harvested the fruits of others’ labor while the true blooms of genius often went unnoticed, left to wither in the shadows. They knew that the Curators did little to actually care for the Conservatory. The intellectual blooms within its walls were almost always unchanged from the moment they had been planted. The Curators did not prune, water, or tend to the flowers of thought; they simply collected fees and claimed ownership of every bloom. Worse still, they often overlooked some of the most extraordinary ideas, leaving them to wither and die, while promoting others simply because they had been paid to do so. But for all the Curators' lofty claims, the Enlightenment Conservatory began to wither. Its once-thriving ecosystem of ideas grew barren, choked by exclusion. Young cultivators from distant lands - those with the boldest, freshest seeds of thought - were turned away, unable to pay the Curators' crushing fees. Some gave up entirely, their unplanted ideas fading like dreams forgotten at dawn. Others tried to nurture their seeds in secret, but without the support of the Conservatory, their efforts bore no fruit. The world would never know the brilliance that had been lost, and the cultivators could only watch as the garden they loved fell into quiet decline. The cultivators' frustration grew into a quiet despair. They had poured their souls into planting seeds of thought, nurturing them with endless care, only to see their work imprisoned behind walls they could not afford to scale. What use was a garden of wisdom if it bloomed in the dark, unseen and unshared? They began to speak out, calling for change. They envisioned an Conservatory where all could enter freely, where the flowers of knowledge and insight could be shared by everyone, regardless of wealth or status. They dreamed of an intellectual paradise that truly reflected the diversity and richness of the world's ideas, unencumbered by the greed and control of the Curators. As the cries for change swelled, a few bold cultivators decided they could wait no longer. They slipped beyond the Conservatory's walls and began planting their seeds of thought in the wild, in open fields where anyone - rich or poor, learned or curious - could come and marvel. These free gardens burst into dazzling bloom, spilling over with ideas as vibrant and diverse as the people who tended them. The movement spread like wildfire, and as more cultivators turned away from the Conservatory, the Curators panicked. They scrambled to suppress the rebellion, but it was too late. The walls that had stood for centuries began to crack. In time, the Enlightenment Conservatory was no longer the sole sanctuary of wisdom. The walls that had once loomed so high crumbled to dust as people discovered they didn't need gates to access the beauty of knowledge. All around, new gardens flourished, each more diverse and vibrant than the last. The Conservatory itself, no longer shrouded in exclusivity, was reborn as a shared space for all. Its paths teemed with visitors, its flowers of insight blooming brighter than ever in the sunlight of collaboration and open exchange. At last, the cultivators' dream had come to life: a world where ideas could roam free, taking root wherever they were needed most. And so, the Enlightenment Conservatory was transformed. No longer a place of exclusion, it became a symbol of what could be achieved when knowledge, discovery, and insight were shared freely and openly, for the benefit of all. The cultivators continued their work, more inspired than ever, knowing that their intellectual blooms would flourish in a world where everyone could enjoy them, without barriers, without walls. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit blog.stephenturner.us

    8 min
  3. 12/20/2024

    Weekly Recap (Dec 2024, part 3)

    https://blog.stephenturner.us/p/weekly-recap-dec-2024-part-3 This week’s recap highlights the Evo model for sequence modeling and design, biomedical discovery with AI agents, improving bioinformatics software quality through teamwork, a new tool from Brent Pedersen and Aaron Quinlan (vcfexpress) for filtering and formatting VCFs with Lua expressions, a new paper about the NHGRI-EBI GWAS Catalog, and a review paper on designing and engineering synthetic genomes. Others that caught my attention include a new foundation model for scRNA-seq, a web-based platform for reference-based analysis of single-cell datasets, an AI system for learning how to run transcript assemblers, ATAC-seq QC and filtering, metagenome binning using bi-modal variational autoencoders, analyses of outbreak genomic data using split k-mer analysis, a review on Denisovan introgression events in modern humans, T2T assembly by preserving contained reads, and a commentary on AI readiness in biomedical data. Audio generated with NotebookLM. Subscribe to Paired Ends (free) to get summaries like this delivered to your e-mail. Deep dive Sequence modeling and design from molecular to genome scale with Evo Paper: Nguyen et al., "Sequence modeling and design from molecular to genome scale with Evo," Science, 2024. https://doi.org/10.1126/science.ado9336. Before getting to this new paper from the Arc Institute, there’s also a Perspective paper published in the same issue, providing a very short introduction that’s also worth reading (“Learning the language of DNA”). TL;DR: This study introduces Evo, a groundbreaking genomic foundation model that learns complex biological interactions at single-nucleotide resolution across DNA, RNA, and protein levels. This model, trained on a massive set of prokaryotic and phage genomes, can predict how variations in DNA affect functions across regulatory, coding, and noncoding RNA regions. The paper and the Perspective paper above emphasize the innovative use of the StripedHyena architecture, which enables Evo to handle large-scale genomic contexts efficiently, setting a precedent for future advancements in genome-scale predictive modeling and synthetic biology. Summary: The authors present Evo, a 7-billion-parameter language model trained on prokaryotic and phage genomes, designed to capture information at nucleotide-level resolution over long genomic sequences. Evo surpasses previous models by excelling in zero-shot function prediction tasks across different biological modalities (DNA, RNA, protein) and can generate functional biological systems like CRISPR-Cas complexes and transposons. It leverages advanced deep signal processing with the StripedHyena architecture to address limitations faced by transformer-based models, allowing it to learn dependencies across vast genomic regions. Evo's training was validated through experimental testing, including the synthesis of novel functional proteins and genome-scale sequence design, underscoring its potential to transform genetic engineering and synthetic biology. Methodological highlights: * Architecture: Utilized the StripedHyena, a hybrid attention-convolutional architecture, for efficient long-sequence processing at nucleotide-level resolution. * Training: Conducted on 2.7 million genomes, with a maximum context length of 131 kilobases. * Applications: Zero-shot predictions in mutation effects on fitness, and generation of operons, CRISPR systems, and large genomic sequences. * Code and models: Open source (Apache license) and available at https://github.com/evo-design/evo. Empowering biomedical discovery with AI agents Paper: Gao et al., "Empowering biomedical discovery with AI agents," Cell, 2024. https://doi.org/10.1016/j.cell.2024.09.022. I’ve covered AI agents for bioinformatics in the highlights sections of previous weekly recaps (e.g., BioMANIA and AutoBA). This is an interesting, if speculative, look into the present and future of agentic AI in life sciences research. TL;DR: This perspective paper discusses the potential of AI agents to transform biomedical research by acting as "AI scientists," capable of hypothesis generation, planning, and iterative learning, thus bridging human expertise and machine capabilities. Summary: The authors outline a future in which AI agents, equipped with advanced reasoning, memory, and perception capabilities, assist in biomedical discovery by combining large language models (LLMs) with specialized machine learning tools and experimental platforms. Unlike traditional models, these AI agents could break down complex scientific problems, run experiments autonomously, and propose novel hypotheses, while incorporating feedback to improve over time. This vision extends AI's role from mere data analysis to active participation in hypothesis-driven research, promising advances in areas such as virtual cell simulation, genetic editing, and new drug development. Key ideas: * Modular AI system: Integration of LLMs, ML tools, and experimental platforms to function as collaborative systems capable of reasoning and learning. * Adaptive learning: Agents dynamically incorporate new biological data, enhancing their predictive and hypothesis-generating capabilities. * Skeptical learning: AI agents analyze and identify gaps in their own knowledge to refine their approaches, mimicking human scientific inquiry. Improving bioinformatics software quality through teamwork Paper: Ferenc et al., "Improving bioinformatics software quality through teamwork," Bioinformatics, 2024. https://doi.org/10.1093/bioinformatics/btae632. One of the things this paper argues for is implementing code review. I used to work at a consulting firm, and I started a weekly code review session with me and my two teammates. In addition to improving code quality, it also increased the bus factor on a critical piece of software to n>1. I had a hard time scaling this. As my team grew from two to ~12, our weekly code review session turned into more of a regular standup-style what are you doing, what are you struggling with, etc., with less emphasis on code. I think the better approach would have been to make the larger meeting less frequent or async while holding smaller focused code review sessions with fewer people. On the other hand, I recently attended the nf-core hackathon in Barcelona where >140 developers came together to work on Nextflow pipelines, and I thought it was wildly successful. TL;DR: This paper argues that the quality of bioinformatics software can be greatly enhanced through collaborative efforts within research groups, proposing the adoption of software engineering practices such as regular code reviews, resource sharing, and seminars. Summary: This paper argues that bioinformatics software often suffers from inadequate quality standards due to individualistic development practices prevalent in academia. To bridge this gap, they recommend fostering teamwork and collective learning through structured activities such as code reviews and software quality seminars. The paper provides examples from the authors’ own experience at the Centre for Molecular Medicine Norway, where a community-driven approach led to improved coding skills, better code maintainability, and enhanced collaborative potential. This approach ensures researchers maintain ownership of their projects while leveraging the benefits of shared knowledge and collective feedback. Highlights: * Structured teamwork: Adoption of collaborative practices like code review sessions and quality seminars to improve software development culture. * Knowledge sharing: Emphasis on resource sharing to minimize redundant efforts and increase efficiency. * Community building: Cultivating a supportive environment that allows for skill-building across the team. * Resource website: The authors provide practical guidance and tools for fostering collaborative software development, accessible at https://ferenckata.github.io/ImprovingSoftwareTogether.github.io/. Vcfexpress: flexible, rapid user-expressions to filter and format VCFs Paper: Brent Pedersen and Aaron Quinlan, "Vcfexpress: flexible, rapid user-expressions to filter and format VCFs," Bioinformatics, 2024. 10.1101/2024.11.05.622129. On GitHub I “follow” both Brent and Aaron so I get notifications whenever either of them publish a new repo. Brent has published many little utilities that improve a bioinformatician’s quality of life (mosdepth, somalier, vcfanno, smoove, to name a few). TL;DR: Vcfexpress is a new tool for filtering and formatting Variant Call Format (VCF) files that offers high performance and flexibility through user-defined expressions in the Lua programming language, rivaling BCFTools in speed but with extended functionality. Summary: The paper introduces vcfexpress, a powerful new tool designed to efficiently filter and format VCF and BCF files using Lua-based user expressions. Implemented in the Rust programming language, vcfexpress supports advanced customization that enables users to apply detailed filtering logic, add new annotations, and format output in various file types like BED and BEDGRAPH. It stands out by balancing speed and versatility, matching BCFTools in performance while surpassing it in analytical customization. Vcfexpress can handle complex tasks such as parsing fields from SnpEff and VEP annotations, providing significant utility for high-throughput genomic analysis. Methodological highlights: * Lua integration: Unique support for Lua scripting enables precise filtering and flexible output formatting. * High performance: Comparable in speed to BCFTools, yet offers additional, customizable logic. * Template output: Allows specification of output formats beyond VCF/BCF, such as BED files. * Code availability: On GitHub at https://github.com/brentp/vcfexpress. Permissively licensed (MIT). The NHGRI-EBI GWAS Catalog: standards for reusability, sustainability and diversity Paper: Cerezo M et al., “The NHG

    11 min
  4. 12/13/2024

    Weekly Recap (Dec 2024, part 2)

    https://blog.stephenturner.us/p/weekly-recap-dec-2024-part-2 This week’s recap highlights a new way to turn Nextflow pipelines into web apps, DRAGEN for fast and accurate variant calling, machine-guided design of cell-type-targeting cis-regulatory elements, a Nextflow pipeline for identifying and classifying protein kinases, a new language model for single cell perturbations that integrates knowledge from literature, GeneCards, etc., and a new method for scalable protein design in a relaxed sequence space. Others that caught my attention include commentary on improving bioinformatics software quality through teamwork, targeted nanopore sequencing for mitochondrial variant analysis, a review on plant conservation in the era of genome engineering, a de novo assembly tool for complex plant organelle genomes, learning to call copy number variants on low coverage ancient genomes, a near telomere-to-telomere phased reference assembly for the male mountain gorilla, a method for optimized germline and somatic variant detection across genome builds, a searchable large-scale web repository for bacterial genomes, and an integer programming framework for pangenome-based genome inference. Audio generated with NotebookLM. (The hosts were very excited about this issue!) Subscribe to Paired Ends (free) to get summaries like this delivered to your e-mail. Deep dive Cloudgene 3: Transforming Nextflow Pipelines into Powerful Web Services Paper: Lukas Forer and Sebastian Schönherr. Cloudgene 3: Transforming Nextflow Pipelines into Powerful Web Services. bioRxiv, 2024. DOI: 10.1101/2024.10.27.620456. I got to meet both Lukas and Sebastian in person at the Nextflow Summit. Lukas gave a talk on nf-test, while Sebastian gave a talk on the Michigan Imputation Server (MIS). MIS is implemented in Nextflow and driven using Cloudgene, and has helped over 12,000 researchers worldwide impute over 100 million samples. This paper describes Cloudgene for turning a Nextflow pipeline into a web service. TL;DR: Cloudgene 3 provides a user-friendly platform to convert Nextflow pipelines into scalable web services, allowing scientists to deploy and run complex bioinformatics workflows without requiring web development expertise. Summary: Cloudgene 3 addresses the challenge of deploying Nextflow pipelines as scalable web services, allowing researchers to leverage computational workflows without the need for technical setup or coding. The platform simplifies the transformation of Nextflow pipelines into “Cloudgene apps,” which include user-friendly interfaces and allow for seamless dataset management, job monitoring, and data security. By supporting features like workflow chaining and dataset integration, Cloudgene 3 enables collaborative and flexible use of pipelines across various scientific domains, from genomics to proteomics. This tool expands accessibility to complex analyses, facilitating data sharing and enhancing reproducibility, and has already been implemented in large-scale services like the Michigan Imputation Server. Its open accessibility and adaptable deployment model (cloud or local infrastructure) highlight its utility for bioinformatics workflows. Methodological highlights: * Converts Nextflow pipelines into web services with a few simple steps, creating portable “apps” that include metadata, input/output parameters, and multi-step workflows. * Integrates real-time status updates and error handling for Nextflow tasks, leveraging a unique secret URL for each task to monitor progress. * Supports cloud platforms and local installations, providing compatibility with engines like Slurm and AWS Batch and storage options like AWS S3. New tools, data, and resources: * Cloudgene 3 platform: Free platform available at cloudgene.io. * Cloudgene 3 source code: https://github.com/genepi/cloudgene3. Comprehensive genome analysis and variant detection at scale using DRAGEN Paper: Behera, S., et al. Comprehensive genome analysis and variant detection at scale using DRAGEN. Nature Biotechnology, 2024. DOI: 10.1038/s41587-024-02382-1. DRAGEN was a godsend in a previous job. I needed a turnkey variant calling solution that was fast. I bought an on-prem DRAGEN FPGA server, which was capable of taking you from FASTQ files to VCF in ~30 minutes for a 30X human whole genome. Illumina has previously published white papers on DRAGEN’s speed and accuracy. The publication in Nature Biotechnology engendered some interesting discussion online. On one hand, the paper was a pleasure to read, and the benchmarks are compelling and well done. On the other, the method isn’t available to explore, reproduce, understand in detail, or build upon. Which raises the question — should this have been a peer-reviewed publication in the scientific record? Or should this just have been another white paper? At some point “papers” hawking some new and improved closed source method are thinly veiled advertisements stamped with the approval of peer review. I think there should be some place in the scientific literature for papers like this describing a closed-source method, but where benchmarks are independently evaluated by a team of peer reviewers. I just don’t know what that looks like in the current landscape of peer reviewed papers versus a vendor’s white paper. TL;DR: DRAGEN is a high-speed, highly accurate genomic analysis platform for variant detection, leveraging hardware acceleration, pangenome references, and machine learning. It outperforms traditional tools across variant types (SNVs, indels, SVs, CNVs, STRs) and is designed for large-scale, clinical genomics applications. Summary: This study presents DRAGEN, a platform that uses accelerated hardware and sophisticated algorithms to enable comprehensive variant detection at unprecedented speed and accuracy. By integrating pangenome references and optimizing for all major variant classes, DRAGEN achieves high concordance in identifying complex and diverse genomic variants, even in challenging regions. Benchmarking across 3,202 genomes from the 1000 Genomes Project highlights DRAGEN’s scalability and its advantages over traditional methods like GATK and DeepVariant, especially for clinically relevant genes. The platform’s robust performance across SNVs, SVs, CNVs, and STRs allows for large-cohort analyses critical for population-scale genomics and clinical diagnostics, facilitating variant discovery in diseases with both common and rare genetic underpinnings. Methodological highlights: * Uses pangenome references to enhance alignment accuracy and variant detection across diverse populations. * Optimized for rapid, parallel processing of SNVs, indels, CNVs, and STRs with an average processing time of ~30 minutes per genome. * Employs machine learning-based filtering to reduce false positives and improve accuracy in variant calling. * Integration of ExpansionHunter for STR analysis and specialized callers for pharmacogenomic variants (e.g., CYP2D6, SMN) ensures reliable detection in medically significant genes. Machine-guided design of cell-type-targeting cis-regulatory elements Paper: Gosai, S. J., et al. Machine-guided design of cell-type-targeting cis-regulatory elements. Nature, 2024. DOI: 10.1038/s41586-024-08070-z. TL;DR: This paper introduces a platform for designing synthetic cis-regulatory elements (CREs) with programmed cell-type specificity using a deep-learning-based model called Malinois, combined with a computational design tool, CODA, and massively parallel reporter assays (MPRAs) for validation. Summary: This study presents a framework for designing synthetic CREs that drive gene expression specifically in desired cell types. Using Malinois, a deep convolutional neural network trained on MPRA data from human cells, the researchers predict CRE activity and design synthetic elements targeting specific cell lines. The CODA (Computational Optimization of DNA Activity) platform then iteratively refines these designs to achieve high specificity, which is validated in vitro across multiple cell types and in vivo in mice and zebrafish. By outperforming natural CREs in specificity and robustness, these synthetic elements could significantly enhance targeted gene therapy approaches, especially by providing tools for precise gene expression control in therapeutic and research applications. The framework expands our capacity to engineer regulatory DNA for complex tissue-specific requirements, advancing possibilities for both biomedical research and gene therapy. Methodological highlights: * Malinois CNN model predicts cell-type-specific CRE activity directly from DNA sequences, validated with MPRA-based data in K562, HepG2, and SK-N-SH cells. * CODA optimization platform iteratively adjusts CRE sequences to increase cell-type specificity, employing algorithms such as Fast SeqProp for efficient sequence design. * High-throughput MPRA validates the activity of 77,157 synthetic and natural CRE sequences across cell types, showing superior specificity in synthetic CREs. New tools, data, and resources: * Code availability: https://github.com/sjgosai/boda2 (yes, this is the CODA repo, which is named “boda2” for “legacy reasons”). * Data availability: All the data used in the study is described in the data availability section of the paper. KiNext: a portable and scalable workflow for the identification and classification of protein kinases Paper: Hellec, E., et al. KiNext: A Portable and Scalable Workflow for the Identification and Classification of Protein Kinases. BMC Bioinformatics, 2024. DOI: 10.1186/s12859-024-05953-w. TL;DR: KiNext is a Nextflow-based pipeline for identifying and classifying protein kinases (kinome) from annotated genomes, enabling reproducible analysis and classification of kinase families across species. Summary: Protein kinases are crucial for cellular signaling and adaptation, and identifying the full kinome of an organism can reveal insights into its physi

    13 min
  5. 12/05/2024

    Weekly Recap (Dec 2024, part 1)

    https://blog.stephenturner.us/p/weekly-recap-dec-2024-part-1 This week’s recap highlights the WorkflowHub registry for computational workflows, building a virtual cell with AI, a review on bioinformatics methods for prioritizing causal genetic variants in candidate regions, a benchmarking study showing deep learning methods are best for variant calling in bacterial nanopore sequencing, and a new ML model from researchers at Genentech for predicting cell-type- and condition-specific gene expression across diverse biological contexts. Others that caught my attention include a new tool for applying rearrangement distances to enable plasmid epidemiology (pling), a commentary on ethical governance for genomic data science in the cloud, a method for filtering genomic deletions using CNNs (sv-channels), in silico generation of synthetic cancer genomes using generative AI, a new tool for evaluating how close an assembly is to T2T, a long context RNA foundation model for predicting transcriptome architecture, open-source USEARCH 12, and the Dog10K database summarizing canine multi-omics. Audio generated with NotebookLM. Subscribe to Paired Ends (free) to get summaries like this delivered to your e-mail. Deep dive WorkflowHub: a registry for computational workflows Paper: Ove Johan Ragnar Gustafsson et al., "WorkflowHub: a registry for computational workflows", 2024. arXiv. https://arxiv.org/abs/2410.06941. Workflows in the life sciences are fragmented across multiple ecosystems including nf-core (Nextflow), Intergalactic Workflow Commission (Galaxy), the Snakemake catalog, Dockstore (mostly WDL), or in generalist repositories like Zenodo, DataVerse, GitHub, etc. Seqera pipelines is a curated collection of high-quality, open-source pipelines, and ONT maintains a collection of curated pipelines at EPI2ME Workflows (both are Nextflow only). WorkflowHub is an attempt to make workflows more FAIR, and supports Snakemake, Galaxy, Nextflow, WDL, CWL, and “kindaworkflows” in Bash, Jupyter, Python, etc. TL;DR: This paper introduces WorkflowHub.eu, a platform for registering, sharing, and citing computational workflows across multiple scientific disciplines. It promotes reproducibility and FAIR (Findable, Accessible, Interoperable, Reusable) workflows, supporting collaborations and workflow lifecycle management. Summary: The paper presents WorkflowHub.eu, a community-driven platform designed to address the challenge of finding, sharing, and reusing computational workflows. The registry supports workflows from diverse domains and integrates with various platforms, enabling workflows to become more discoverable, reproducible, and accessible for both humans and machines. The importance of WorkflowHub lies in its ability to promote collaboration, assign credit, and make workflows scholarly artifacts through FAIR principles. The platform has already registered over 700 workflows from numerous research organizations, demonstrating its global impact. Applications of WorkflowHub extend across different fields, including bioinformatics, astronomy, and particle physics, where computational workflows are essential for processing large-scale data. Highlights: * Provides a unified registry that links to community repositories, enabling workflow lifecycle support and promoting FAIRness. * Integrates with diverse workflow management systems (e.g., Nextflow, Snakemake) and services, enabling seamless sharing, citation, and execution of workflows. * Offers metadata support, making workflows easily searchable and findable across different domains and workflow languages. * WorkflowHub is connected to the LifeMonitor service (lifemonitor.eu), which allows workflow function and status to be reported to maintainers and users through regular automated tests driven by continuous integration. According to the docs, “LifeMonitor supports the application of workflow sustainability best practices. Much of this revolves around following community-accepted conventions for your specific workflow type and implementing periodic workflow testing. Following conventions allows existing computational tools to understand your workflow and its metadata.” * WorkflowHub: https://workflowhub.eu/ How to Build the Virtual Cell with Artificial Intelligence: Priorities and Opportunities Paper: Bunne et al., "How to Build the Virtual Cell with Artificial Intelligence: Priorities and Opportunities," arXiv, 2024. https://arxiv.org/abs/2409.11654. This paper from Stanford, Genentech, CZI, Arc Institute, Microsoft, Google, Calico, EMBL, Harvard, EvolutionaryScale, and many others, proposes an ambitious vision of using AI to construct high-fidelity simulations of cells and systems that are directly learned from biological data. TL;DR: This paper discusses the creation of an AI Virtual Cell, a comprehensive, data-driven model designed to simulate and predict cellular behaviors across different scales (molecular to multicellular) and contexts. It highlights the role of AI in building accurate cellular representations and guiding in silico experiments. AI virtual cells could be used to identify new drug targets, predict cellular responses to purturbations, and scale hypothesis exploration. Summary: The paper outlines a vision for developing AI-powered Virtual Cells, models capable of simulating cellular functions and interactions across molecular, cellular, and multicellular scales. Leveraging advances in AI and omics technologies, the Virtual Cell could serve as a foundation model to predict cell behavior, simulate responses to perturbations, and guide biological research. The importance of this work lies in its potential to revolutionize biological research by enabling high-fidelity, in silico experimentation, providing deeper insights into cellular mechanisms, and facilitating drug discovery and cell engineering. Applications range from identifying drug targets to predicting disease progression, offering a versatile tool for biologists and clinicians. The approach emphasizes collaboration across academia, AI, and biopharma industries to build this resource and ensure it becomes widely accessible. Methodological highlights: * Multi-scale modeling: The AI Virtual Cell captures interactions from molecules to tissues using AI techniques such as graph neural networks and transformers. * In silico experimentation: Enables testing of cellular responses to various perturbations without the need for costly physical experiments. * Universal Representations (UR): A framework that integrates multi-modal biological data to predict cellular behaviors across different contexts and species. A bioinformatics toolbox to prioritize causal genetic variants in candidate regions Paper: Martin Šimon et al., "A bioinformatics toolbox to prioritize causal genetic variants in candidate regions," Trends in Genetics, 2024. DOI: 10.1016/j.tig.2024.09.007. TL;DR: Complex polygenic traits are influenced by quantitative trait loci (QTLs) with small effect sizes, and proving a specific gene is causal and identifying the exact genetic variant(s) responsible for the QTL effect is difficult. This review introduces a bioinformatics toolbox for prioritizing causal genetic variants within quantitative trait loci (QTLs) using multiomics approaches. Summary: The paper addresses the challenge of pinpointing causal genetic variants in complex traits and diseases, especially when dealing with polygenic traits controlled by multiple quantitative trait loci (QTLs). Despite advances in mapping these loci, determining causality remains complex. The review proposes integrating bioinformatics and multiomics techniques to streamline the process of identifying and prioritizing candidate variants within QTLs. Using a case study of the Pla2g4e gene in mice, the authors illustrate how SNPs in regulatory elements can be filtered to focus on the most likely causal candidates. This approach reduces the experimental workload by narrowing down potential variants for functional validation. The work emphasizes the need for hierarchical filtering of SNPs and prioritizing those within known regulatory regions, which can accelerate genetic research and therapeutic discoveries. Highlights: * Introduces a hierarchical bioinformatics strategy to prioritize SNP-containing regulatory elements for identifying causal variants in QTLs. * Multiomics integration (transcriptomics, epigenomics) is used to map regulatory elements and their associated SNPs, refining candidate lists for experimental validation. * Emphasizes the use of SNPs in regulatory elements (promoters, enhancers, open chromatin) as primary filtering tools for narrowing down candidate variants. Benchmarking reveals superiority of deep learning variant callers on bacterial nanopore sequence data Paper: Hall et al., "Benchmarking reveals superiority of deep learning variant callers on bacterial nanopore sequence data," eLife, 2024. https://doi.org/10.7554/eLife.98300. TL;DR: This study benchmarks the performance of deep learning and traditional variant callers on bacterial genomes sequenced using Oxford Nanopore Technologies (ONT) nanopore sequencing. It finds that deep learning-based tools, particularly Clair3 and DeepVariant, outperform conventional methods, even surpassing Illumina sequencing in some cases. Summary: This paper evaluates the accuracy of SNP and indel variant callers on long-read bacterial genome data generated using Oxford Nanopore Technologies (ONT). The authors compare deep learning-based variant callers (Clair3, DeepVariant) with traditional ones (BCFtools, FreeBayes, etc.) using bacterial samples sequenced with high-accuracy ONT models (R10.4.1). The results show that Clair3 and DeepVariant consistently provide higher precision and recall than traditional methods, even outperforming Illumina sequencing in some regions, especially with ONT's super-accuracy basecalling model. The paper emphasizes the practical application of ONT sequencing and deep lea

    28 min
  6. 11/22/2024

    Weekly Recap (Nov 2024, part 3)

    Full recap: https://blog.stephenturner.us/p/weekly-recap-nov-2024-part-3 This week’s recap highlights pangenome graph construction with nf-core/pangenome, building pangenome graphs with PGGB, benchmarking algorithms for single-cell multi-omics prediction and integration, RNA foundation models, and a Nextflow pipeline for characterizing B cell receptor repertoires from non-targeted bulk RNA-seq data. Others that caught my attention include benchmarking generative models for antibody design, improved detection of methylation in ancient DNA, differential transcript expression with edgeR, a pipeline for processing xenograft reads from spatial transcriptomics (Xenomake), public RNA-seq datasets and human genetic diversity, a review on bioinformatics approaches to prioritizing causal genetic variants in candidate regions, quantifying constraint in the human mitochondrial genome, a review on sketching with minimizers in genomics, and analysis of outbreak genomic data using split k-mer analysis. Deep dive Cluster-efficient pangenome graph construction with nf-core/pangenome Paper: Heumos, S. et al. Cluster-efficient pangenome graph construction with nf-core/pangenome. Bioinformatics, 2024. DOI: 10.1093/bioinformatics/btae609. Benchmarking in bioinformatics typically involves some measure of accuracy (precision, recall, F1 score, MCC, ROC, etc.), and compute requirements (CPU time, peak RAM usage, etc.). A metric I’ve been seeing more recently is the carbon footprint of a particular bioinformatics analysis. In the benchmarks performed here (detailed below), the authors calculated the CO2 equivalent (CO2e) emissions for running both nf-core/pangenome and another commonly used tool, showing that nf-core/pangenome took half the time for an analysis without increasing CO2e. It looks like the authors are using the nf-co2footprint plugin to do this. TL;DR: This paper introduces nf-core/pangenome, a Nextflow-based pipeline for constructing reference-unbiased pangenome graphs, offering improved scalability and computational efficiency compared to existing tools like the PanGenome Graph Builder (PGGB is highlighted next in this post!). Summary: The nf-core/pangenome pipeline offers a scalable and efficient method for building pangenome graphs by distributing computations across multiple cluster nodes, overcoming the limitations of PGGB, a widely used tool in the field. Pangenome graphs model the collective genomic content across populations, reducing biases associated with traditional reference-based approaches. This work showcases the pipeline’s power by constructing a graph for 1000 chromosome 19 human haplotypes in just three days and processing over 2000 E. coli sequences in ten days—tasks that would take PGGB much longer or fail due to computational limitations. The nf-core/pangenome pipeline emphasizes portability and seamless deployment in high-performance computing (HPC) environments using biocontainers. With these features, it enables population-scale genomic analyses for various organisms, supporting biodiversity and personalized genomics research. Methodological highlights: * Uses Nextflow for efficient workflow management and resource distribution, ensuring parallel processing and modular flexibility. * Avoids reference biases by aligning each sequence against all others with WFMASH, followed by graph induction with SEQWISH and graph simplification with SMOOTHXG. * The pipeline integrates ODGI for quality control and MultiQC for generating summary reports, ensuring comprehensive analyses. New tools, data, and resources: * GitHub repository: https://github.com/nf-core/pangenome. * Documentation/tests: https://nf-co.re/pangenome. * Code for the paper: https://github.com/subwaystation/pangenome-paper. Here’s a talk from last year’s Nextflow summit where Simon Heumos (lead author on this paper) talks about the workflow in detail. Building pangenome graphs Paper: Garrison, E. et al. Building pangenome graphs. Nature Methods, 2024. DOI: 10.1038/s41592-024-02430-3 (read free: https://rdcu.be/dXDTo) The benchmarking paper above discusses nf-core/pangenome in contrast to PGGB, the subject of this paper. This paper was originally published in April 2023, and this updated version of the preprint contains new experimental data. The authors from this paper and the previous nf-core/pangenome paper overlap substantially. TL;DR: This paper introduces PanGenome Graph Builder (PGGB), a reference-free tool that constructs unbiased pangenome graphs to capture both small and large-scale genetic variations. It avoids reference bias by using all-to-all alignments and provides scalable, lossless representations of genomic data. Summary: PGGB addresses limitations in traditional genome graph tools, which often rely on a single reference genome, leading to biases and loss of complex variation. The pipeline performs unbiased, reference-free alignments of multiple genomes using the WFMASH tool, followed by graph construction with SEQWISH and graph simplification with SMOOTHXG. This modular approach captures SNPs, structural variants, and large sequence differences across multiple genomes in a unified framework. The study demonstrates PGGB’s ability to scale efficiently, building complex pangenome graphs for datasets such as human chromosome 6 and primate assemblies. PGGB is validated against existing tools, showing superior performance in accurately representing small and structural variants. Its output facilitates downstream analyses such as phylogenetics, population genetics, and comparative genomics, supporting large-scale projects like the Human Pangenome Reference Consortium (HPRC). Methodological highlights: * Reference-free alignment: Uses WFMASH for all-to-all sequence alignment, enabling unbiased graph construction. * Graph induction and normalization: Constructs graphs with SEQWISH and smooths complex motifs with SMOOTHXG, improving downstream compatibility. * Sparsified alignment approach: Implements random sparsification to reduce computational costs while maintaining accurate genome relationships. New tools, data, and resources: * GitHub repository: https://github.com/pangenome/pggb. * Data: Example pangenomes and validation datasets available at https://doi.org/10.5281/zenodo.7937947. * Documentation: https://pggb.readthedocs.io. Benchmarking algorithms for single-cell multi-omics prediction and integration Paper: Hu, Y. et al. Benchmarking algorithms for single-cell multi-omics prediction and integration. Nature Methods, 2024. DOI: 10.1038/s41592-024-02429-w. (Read free: https://rdcu.be/dW01n). The idea behind integration of single cell data is to combine multiple types of single cell omics data (genomics, transcriptomics, epigenomics, etc) to get a more complete understanding of individual cell states. An example: maybe you use Seurat to map scRNA-seq data onto something like scATAC-seq obtained from the same tissue to identify nearest neighbor cells for a given cell across data types, and use the mapping to predict protein abundance or chromatin accessibility. This paper benchmarks many different integration approaches, making the distinction between vertical integration (different modalities), horizontal integration (batch correction across datasets), and mosaic integration (multi-omic datasets sharing one type of omics data). TL;DR: This study benchmarks 14 prediction algorithms and 18 integration algorithms for single-cell multi-omics, highlighting top performers such as totalVI, scArches, LS_Lab, and UINMF. It also provides a framework for selecting optimal algorithms based on specific prediction and integration tasks. Summary: Single-cell multi-omics technologies enable simultaneous profiling of RNA expression, protein abundance, and chromatin accessibility. This paper evaluates 14 algorithms that predict protein abundance or chromatin accessibility from scRNA-seq data and 18 algorithms for multi-omics integration. totalVI and scArches consistently excel in protein abundance prediction, while LS_Lab demonstrates superior performance in predicting chromatin accessibility. For multi-omics integration tasks, Seurat and MOJITOO lead in vertical integration, while UINMF and totalVI excel in horizontal and mosaic integration scenarios, respectively. The study uses 47 datasets across diverse tissues and experimental setups to provide robust recommendations for algorithm selection in different single-cell workflows. Methodological highlights: * Two evaluation scenarios: Intra-dataset (same dataset split into train/test) and inter-dataset (train on one dataset, test on another). * Six performance metrics: cell‒cell Pearson correlation coefficient (PCC), protein-protein PCC, cell-cell correlation matrix distance (CMD), protein-protein CMD, AUROC and RMSE. * Evaluation framework: Compares prediction and integration algorithms across single-cell RNA + protein and RNA + chromatin datasets. * Top-performing tools: no single algorithm consistently outperformed the others in every metric and dataset. Generally, performance was better for: * totalVI and scArches for protein abundance prediction. * LS_Lab for chromatin accessibility prediction. * Seurat and MOJITOO for merging RNA expression with protein abundance. * UINMF for integrating batches of scRNA+scATAC data. New tools, data, and resources: * Benchmarking pipeline: Available on GitHub at https://github.com/QuKunLab/MultiomeBenchmarking. * Datasets: 47 multi-omics datasets (CITE-seq, REAP-seq, SNARE-seq, 10x Multiome) were used to benchmark algorithm performance. Links to relevant resources provided in the Data availability section. Orthrus: Towards Evolutionary and Functional RNA Foundation Models Paper: Fradkin, P. et al. Orthrus: Towards Evolutionary and Functional RNA Foundation Models. bioRxiv 2024. DOI: doi.org/10.1101/2024.10.10.617658. I discovered this from the Bits in Bio newsletter (which I highly recommend subscribing to). There are plenty of protein

    17 min
  7. 10/14/2024

    Inciteful+Zotero to find relevant literature

    https://blog.stephenturner.us/p/inciteful-zotero-biologpt-semantic-scholar I am in the middle of writing a review / perspectives paper. One that I’m confident will be exciting once we get it published. Some sections of the review cover subject matter at the outer periphery of my expertise. These are areas where I don’t have as strong of a command of the relevant literature as my collaborators do. In my Zotero library I have a collection of a few relevant papers in these areas published by my co-authors, but I needed a way to quickly find other relevant papers in this field based on the small collection of papers I already have. Inciteful + Zotero was a perfect combination. I also found BioloGPT and Semanitic Scholar to be useful for other related tasks. Inciteful A colleague introduced me to Inciteful ( https://inciteful.xyz/ ), and at first glance it seemed to fit the bill perfectly. And, it’s free (really free, not freemium free, not free for now free — truly free, with no sign-up required). What does Inciteful do and why would you use it? From the general documentation: [Inciteful] builds a network of papers from citations, uses network analysis algorithms to analyze the network, and gives you the information you need to quickly get up to speed on that topic. You can find the most similar papers, important papers as well as prolific authors and institutions. And from the Use Cases documentation: Getting Familiar With a Body of Literature The first and most basic is familiarizing yourself with a body of literature. This happens all the time, you become interested in a topic that is not directly related to your current work and it’s tough to get a handle on the current state of that topic. Inciteful makes that easy. Finding Literature for a Paper in Progress As you are writing a paper it’s good to periodically check to be sure the what you are writing addresses the most recent literature in the topic. Rounding Out a Literature Review Literature reviews tend to start with keyword based searches using your academic search engine of choice. You often end up using complicated search strategies …Even after all these complicated searches, you have no way of knowing if anything slipped through the cracks. This is where Inciteful shines. You can go to inciteful.xyz and start building out a collection of papers by entering DOIs, PMIDs, arXiv URLs, etc., but I found the most effective way to do this was to seed my Inciteful search from a collection of papers I already have in my Zotero library. Using Inciteful with Zotero I’ve been using Zotero for reference management since the 2000s when it was initially only a Firefox browser extension. I used Mendeley for a period until it was acquired by Elsevier, then switched back to Zotero. Zotero is the only reference manager I’m aware of that works with MS Word, Google Docs, and integrates seamlessly with RStudio to insert BiBTeX citations in RMarkdown/Quarto. The Inciteful plugin for Zotero was clutch here. This allows you to highlight papers in your Zotero library, right click, and start a graph search, right from Zotero. Let’s take a look. Demo The literature review I’m writing isn’t a review on gene editing, but I work with a lot of brilliant genome engineers at Colossal. When I first got started here I read all the literature in this area I could get my hands on, because I came to Colossal with little background in synthetic biology. This demo uses a subset of papers in my CRISPR / genome engineering collection in Zotero. First, you can highlight all the papers you want to seed your search with. With the Inciteful plugin installed, right click, and start a graph search. This opens Inciteful in your browser and the first thing you’ll see is a citation graph. It’s a nice visual, but I personally never find these kinds of graphs all that useful. The real benefit comes from the tables below. The first table is a list of similar papers which tend to cite the papers I used as input from my Zotero library. Next you can see a list of the most important papers, by PageRank. Some of these I already have in my library. Others I don’t, but I can add these to the existing search with the “+” sign. Obvious missing paper here was the landmark 2012 paper from Emmanuelle Charpentier’s lab. Next I can get a list of papers that cite the largest number of papers I have in my collection, which are likely review papers. Finally, there are four sections that aren’t immediately useful for expanding my literature search. I’m glad the developer included these, because it’s interesting to see things like the top authors, institutions, journals, etc. The “Upcoming Authors” section was interesting to me. How are these found? See the little “SQL” button at the bottom of each panel? Clicking it you can actually see the SQL code that’s running behind the scenes. And you can modify and re-execute it! Here’s the SQL for the Upcoming Authors section. Other tools Inciteful isn’t the only tool that occupies this space. I spent a little time with BioloGPT and Semantic Scholar. Both of these are more AI-forward than Inciteful. BioloGPT BioloGPT (biologpt.com) is an interesting one. It’s less of a research discovery tool and more of a Biology-focused AI research tool. That is, you start from research questions rather than from a stack of papers you already have. From the documentation: BioloGPT is engineered to be a highly-detailed, evidence-based, and skeptical AI committed to truth-seeking and answering biology questions as accurately as possible. It rigorously cites all used papers to ensure reliability, and can even generate novel hypothesis, code, art, and experiments. By citing relevant data and maintaining a critical, empirical stance, BioloGPT directly counters potential research biases such as positive result bias, framing bias, ideologies, censorship, scientific corruption, and industry influence. I asked about current best practices for analyzing single-cell ATAC seq data (link). You get back a short summary answer, followed by a longer answer. At first this might seem like something you can get out of ChatGPT or other tools with a recent knowledge limit. The first place BioloGPT differs from a generic AI chatbot is that assertions are backed by citations, and hoving over them gives you a preview of the paper, a very short summary, citation counts, and an evidence assertion. BioloGPT then provides a code snippet. I’ve never actually used scanpy so I can’t verify the accuracy of this code, but it at least would help point you in the right direction of tools to take a look at, if nothing else. Here’s where BioloGPT helps with literature discovery to some degree. Next you’ll see a list of top search results, showing you recent literature relevant to your query. The next section presents potential hypotheses, a hypothesis graveyard, and potential experiments. This might be a little speculative, but I think the idea could help guide your research into areas you might not have explored, especially if it’s an area you’re not already intimately familiar with. Another interesting feature of BioloGPT is its ability to create plots based on an input query. The example query Graph of CD4 expression across all immune cell types produces the following result. The interactive graphic it produces is made with Plotly, and BioloGPT provides the sources it used to create the plot. Finally, BioloGPT saves your queries, and in your account page, you can fill in additional areas of interest. With this you’ll get a weekly roundup email pointing out new literature relevant to the questions you’ve asked and the topics you claim interest in. Semantic Scholar Semantic Scholar (semanticscholar.org) is a free AI-driven search and discovery tool. You start by searching for a subject or paper. Each paper has an AI-generated TLDR, with some citation information available on the side. Once you find a paper, you can save it to your Semantic Scholar library, and you can save papers into different collections. Once you do so, you can opt in to receiving regular emails with newly published research that’s highly related to papers in one or more of your collections. This is the feature I love most about Semantic Scholar. Where Inciteful doesn’t have a login or session persistence, Semantic Scholar helps with staying on top of recently published literature relevant to literature you’ve already collected. Others I didn’t try There are plenty of other tools that occupy this space. I haven’t had a chance yet to use Elicit (elicit.com) or Research Rabbit (researchrabbit.ai). There’s also PaperQA2 (“superhuman scientific literature search”) — see the blog post and GitHub page (Apache license). There’s Google’s NotebookLM ( https://notebooklm.google/ ), which is more of a research assistant tool than a literature discovery tool. I imagine we’ll see many more AI-driven research support tools like this in the near future. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit blog.stephenturner.us

    12 min
  8. 09/16/2024

    Illuminate preprints with an AI-generated podcast discussion

    Full post at https://blog.stephenturner.us/p/illuminate-preprints-with-an-ai-generated-podcast Google has a new experimental tool called Illuminate (illuminate.google.com) that takes a link to a preprint and creates a podcast discussing the paper. When I tested this with a few preprints, the podcasts it generated are about 6-8 minutes long, featuring a male and female voice discussing the key points of the paper in a conversational style. There are some obvious shortcomings. It doesn’t know how to pronounce words that aren’t real words (for example, bioRxiv, or Heng Li’s new Ropebwt3), and like many text-based genAI tools, it overuses the word “delve.” And, when I gave it my recent paper describing biorecap (blog post, paper), it delved into a discussion on generative AI ethics that I never wrote about in the paper. But, aside from these few quirks, I actually enjoyed listening to the audio it produced. I used Illuminate to generate podcasts discussing a few preprints on arXiv quantitative biology that caught my attention lately, or in the case of biorecap and pracpac, those that I authored. The full podcast at the top of this post has all six of these preprints together, timestamped with chapters if you’re listening to this in a podcast app. Alternatively, you can listen to each individual paper below. biorecap: an R package for summarizing bioRxiv preprints with a local LLM (https://arxiv.org/abs/2408.11707)  BWT construction and search at the terabase scale (https://arxiv.org/abs/2409.00613) Genomic Language Models: Opportunities and Challenges (https://arxiv.org/abs/2407.11435) Near to Mid-term Risks and Opportunities of Open-Source Generative AI (https://arxiv.org/abs/2404.17047) Guidelines for releasing a variant effect predictor (https://arxiv.org/abs/2404.10807) pracpac: Practical R Packaging with Docker (https://arxiv.org/abs/2303.07876) This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit blog.stephenturner.us

    43 min

About

Bioinformatics, computational biology, and data science updates from the field. Occasional posts on programming. blog.stephenturner.us

To listen to explicit episodes, sign in.

Stay up to date with this show

Sign in or sign up to follow shows, save episodes, and get the latest updates.

Select a country or region

Africa, Middle East, and India

Asia Pacific

Europe

Latin America and the Caribbean

The United States and Canada