Genomics Revolution

Brad Goodner

Podcast associated with Hiram College Genetics course. Focus is on the history of genomics and how a genomic view of life has impacted basic science as well as applied fields such as medicine and agriculture.

  1. EPISODE 1

    An Introduction and Some Definitions

    Welcome to the podcast Genomics Revolution. I am Brad Goodner, Professor of Biology at Hiram College. This is a podcast about the biggest explosion in biological knowledge in human history and it is has been happening all around us over the last 30 years. The Genomics Revolution is based on 1) some basic knowledge about DNA – how it is structured and faithfully replicated, 2) an ambitious goal to fully understand the complete genetic basis of human biology, 3) new ways to store, collate, and compare incredibly large data sets, and 4) lots of determined biologists, chemists, computer scientists, and statisticians working together in new collaborations that have smashed holes in academic disciplines and forged new interdisciplinary/multidisciplinary academic departments and biotechnology companies. Genomics has and will continue to transform all aspects of biological understanding from individual cells to organisms to communities and ecosystems. Genomics has also begun to dramatically change medicine, forensics, agriculture, and even anthropology and paleontology. Yet what is genomics? It is simply put the study of a genome or all the DNA of a cell or organism of interest. You and I have grown up knowing what DNA is, both as the hereditary material passed from one generation to the next and as a specific biological polymer – a double helix of strands each made up of units we know by their shorthand names – A’s, C’s, G’s, and T’s, linked together in a strand by strong bonds while the two strands interact to form the double helix using relatively weak bonds. You and I have also grown up knowing that what a gene is – a stretch of DNA sequence that tells a cell to do something. By “do something”, I mean that the DNA sequence of a gene tells a cell to make a copy of that DNA as RNA, a shorter-lived single stranded polymer where T’s are replaced by U’s. In most cases, the RNA made from a gene carries a code for making a very different polymer – a linear protein sequence made up of amino acid residues linked together. Proteins do most of the work of cells, but genes tell a cell what proteins to make. A genome is composed of all the genes of a given cell or organism along with some DNA that does not act as genes. That extra DNA is also important in other ways that we will discuss in a later episode. So DNA makes up a gene and all the genes in a cell plus some extra DNA equals a genome. That is a lot of information, usually 400,000 base pairs of DNA or more in the genomes of cellular organisms and the more can range up into the billions of base pairs for organisms such as we humans or corn plants. In future episodes, we will see why scientists wanted all of information, how they figured out the most efficiently ways to obtain the information, and how they came up with different ways to analyze the information. We will also start to talk about specific genomes, why we care about them, what they teach us about how genomes and organisms work, and how we might use that information to solve problems in healthcare, agriculture, or bio-energy. We will also see how we can add layers of additional information onto a given genome. For example, all of the RNAs made from a genome through the cellular process of transcription is called the transcriptome, while all of the proteins made based on the genes in a genome are called the proteome. There are many other “-omes” out there to be explored. Finally, we will consider the combined genomes of many organisms, typically microbes, that live in a particular habitat or community. Such metagenomes, literally meaning “above one genome”, have identified new organisms we could not identify earlier because we could never grow them. All of this sounds complex, but we will break it down into understandable 10-20 minute segments. I hope you keep listening to future episodes of Genomics Revolution. Talk to you again soon.

    5 min
  2. EPISODE 2

    Genetics Before Genomes

    I am Brad Goodner. Welcome back to Genomics Revolution. To fully understand the impact of having an organism’s complete DNA sequence, its genome, we need to put it into the proper context set by the previous 150 years. Genetics as an experimental science got its start in the middle of the 19th century with Mendel’s inheritance trials on pea plant phenotypes and with Meischer’s biochemical isolation of nuclein, what we now call DNA. Mendel’s ideas on the rules of inheritance in sexually reproducing eukaryotes was generalized into the concept of a gene as a definable unit of genetic information controlling a particular phenotype in the early 20th century, before DNA was confirmed as the genetic material. The work of Beadle and Tatum cemented this concept into “one gene encodes one protein which catalyzes one particular biochemical reaction, typically one step in a biochemical pathway.” For most of the 20th century, scientists studied one gene at a time. Their typical approach Involved isolating mutants – individual organisms with one or more mutations in a gene of interest that had a noticeable impact on a particular organismal phenotype. Mutations are nothing more than changes in a DNA sequence, but we didn’t have ways to determine a DNA sequence until the 1960’s. Scientists figured out that changes in a DNA sequence can potentially change the sequence of amino acid residues in a protein encoded by that DNA sequence. By the time I was in high school in the late 1970’s, two groups had worked out methods that allowed labs all over the world to sequence DNA routinely. One method, the Maxam-Gilbert chemical method, started with a DNA strand labeled at one end with a radioactive phosphorus in the 5’ phosphate group. Four tubes containing large amounts of the labeled DNA strand are each exposed to different chemical conditions that lead to breaks in a DNA strand at specific nucleotide residues. In one tube, breaks occurred at purine nucleotide residues. Remember that A and G are the bases in purine nucleotides. In another tube, breaks occurred only at G residues. In a 3rd tube, breaks occurred at pyrimidine, C or T, nucleotide residues. In a 4th tube, breaks occurred only at C residues. Imagine a DNA strand 24 nucleotide residues long with A, C, G, and T residues alternating. ACGTACGTACGT… and so on.In the first tube, breaks will be induced at A or G purines. Some of the DNA strands will be broken at position 1, others at position 3, others at position 5, others at position 7, and so on. In the second tube where breaks only occur at G residues, some of the strands will be broken at position 3, others at 7, and so on at a 4 base interval. In the 3rd tube, breaks will occur at C or T pyrimidines. Some DNA strands will be broken at position 2, others at position 4, and so on. In the 4th tube, breaks will occur at C residues – some at position 4, others at position 8, and so on at a 4 base interval. If we run the contents of each tube through a jello-like sieving matrix that separates DNA molecules on the basis of size, the smallest DNA fragments will run fastest. Remember our starting DNA strand, a 24-mer ACGTACGTACGT… The fragment breaking at position 1 will run the fastest and will only show up in the tube 1 lane. The fragment breaking at position 2 will be next but it will show up in both the tube 3 lane and the tube 4 lane. You could usually read 100-200 bases of sequence from one gel run. Several reactions to carry out, lots of radioactivity involved that no one wanted to be exposed to.  In 1979, Sutcliffe published the complete DNA sequence of one of the earliest recombinant DNA molecules, the cloning plasmid pBR322. Plasmids are nonessential extra DNA molecules, usually circles, found in Bacteria, Archaea, and some Eucarya. The plasmid pBR322 is a man-made recombinant molecule, built from several natural DNA pieces. Sutcliffe sequenced pBR322 using the Maxam-Gilbert chemical method. It’s 4362 base pair long sequence was one of the first DNA sequences I worked with when I started graduate school in 1983. The next year, I went to my first research conference where I heard Richard Barker give a talk about the sequence of a key DNA sequence. The T-DNA or transferred DNA is a piece of bacterial DNA involved in the plant disease crown gall. Barker had almost single-handedly used the Maxam-Gilbert chemical method to determine a sequence of 24,595 nucleotide residues that encoded 14+ proteins. This was a tremendous feat at that time. My fellow grad students and I were awed, but we also fearfully joked that we hoped Barker had already sired his children because of all the radioactivity and nasty chemicals involved. Because of these risks, the Maxam-Gilbert chemical method lost out over time to an enzymatic method of DNA sequencing perfected by Fred Sanger and colleagues. Sanger’s group built their enzymatic method around the way that cells naturally make new DNA strands by using the enzyme DNA Polymerase. This enzyme needs three components to build a new DNA strand. One, an old single-strand of DNA is needed as a template. The template strand is complementary to the new strand that will be made. By that I mean, A’s on the template strand will interact with T’s on the new strand and vice versa. G’s on the template strand will interact with C’s on the new strand and vice versa. Two, DNA Polymerase cannot start a new DNA strand from scratch, rather it has to add onto a pre-existing piece of single-stranded RNA in the cell or a piece of single-stranded DNA in the test tube. This starting piece is called a sequencing primer. By choosing the right sequencing primer, one can determine the sequence at different places along a large DNA strand. Three, DNA Polymerase catalyzes the formation of the new DNA strand using deoxyribonucleotides, the monomer subunits in a DNA strand polymer. In the Sanger enzymatic method, 4 tubes are set up with the same template DNA strand, the same starting complementary sequencing primer, and all 4 deoxyribonucleotides. The sequencing primer carried a radioactive label on one end. In the first tube, a little bit of a modified A nucleotide was added. The A was different in that once it was added onto a growing DNA strand, no more nucleotides can be added to it. So in this tube, the new DNA strands will each end with a modified A residue, but since the modified A is rare the termination of DNA strand grow will be rare and random in terms of which A is the termination point. In the second tube, a little bit of a modified C nucleotide was added. A little bit of modified G in tube 3 and a little bit of modified T in tube 4. Similar to the Maxam-Gilbert chemical method, the final labeled DNA strands in each tube are separated by size by running them through a jello-like sieving matrix called a sequencing gel. If we consider the same 24-base long DNA strand as before, ACGTACGTACGT…, then the lane on the sequencing gel using the results from tube 1 will show labeled fragments of the sequencing primer plus 1, plus 5, plus 9, plus 13, plus 17, and plus 21 in size. The lane for results from tube 2 will show fragments of the sequencing primer plus 2, plus 6, plus 10, plus 14, plus 18, and plus 22 in size. Eventually, the Sanger method became the method of choice for automated DNA sequencing machines and the radioactivity involved was replaced with four different fluorescent tags added to the four modified DNA terminating nucleotides. A laser at the end of the sequencing gel excites the fluorescent tag on each DNA fragment as it exits and the resulting fluorescence color tells us which nucleotide was at the end of the fragment. In 1978, Sanger and coworkers published a paper reporting the complete sequence of the DNA virus phiX174, the first viral genome sequenced. The viral DNA of 5386 nucleotide residues encodes 10 proteins. In 1982, the U.S. National Institutes of Health, NIH, established a ...

    11 min
  3. EPISODE 3

    The Power of an Idea - The Human Genome Project

    Welcome back. I am Brad Goodner. While genetics as a scientific discipline did not need DNA sequences to get started, it certainly progressed at a much faster clip once one could see what genes actually looked like and determine how genes change due to different mutations. We discussed the Maxam-Gilbert chemical and Sanger enzymatic methods for sequencing DNA strands in our last episode. Their impact was so immediate that Gilbert and Sanger shared ½ of a Nobel Prize just a few short years later. Amazingly, it was Sanger’s second!  The other ½ of that Nobel went to Paul Berg who led efforts in the early 1970’s to develop the methods that we now call recombinant DNA technology or DNA cloning. These techniques involved cutting DNA sequences at specific sites using bacterial enzymes called restriction endonucleases. Restriction is a medical term for cutting and endonucleases are enzymes that cut nucleic acids, in this case double-stranded DNA, within the molecule as opposed to at a free end. For example, the restriction endonuclease BamHI always cuts the DNA sequence 5’-GGATCC-3’. Notice that the complementary strand of DNA is also 5’-GGATCC-3’ just running right to left instead of left to right. Such sequences within a double-stranded DNA molecule are called palindromic sequences. By cutting DNA molecules with different restriction endonucleases and figuring out the sizes of the resulting DNA fragments, scientists could come up with physical maps of a DNA molecule. While they didn’t know from this data alone the complete sequence of the DNA molecule, it was a starting point based on some of the sequence information. By analogy, it was like knowing the layout of streets in a town without knowing every house on every street. In addition to enzymes that cut DNA, recombinant DNA cloning also involved enzymes that could sew DNA fragments back together. The medical term for sewing back up is ligation and these enzymes are called DNA ligases. This growing physical mapping information about DNA molecules and the initial efforts to sequence fairly small pieces of DNA strands were building on a much older history of genetic maps in different model genetic organisms such as fruit flies, baker’s/brewer’s yeast, and maize. By following the inheritance of different mutations through crosses, geneticists could start to arrange mutations and the genes they were in along linear maps of chromosomes. Through these efforts, they figured out that the number of genetic maps in a given organism usually equalled the number of different types of chromosomes in that organism. In bacteria such as E. coli that lack sexual reproduction, geneticists came up with modifications to their genetic mapping strategies. In most bacteria, the genetic map formed a circle which later matched the true circular nature of the chromosome. These genetic maps were cruder than the physical maps in terms of scale – putting towns in spatial reference to each other rather than individual streets and houses, but genetic maps had a real advantage. They were linked to traits, measurable phenotypes seen in an organism. All of these tools, old and new, were in the hands of geneticists and other scientists interested in DNA by the year 1980. Over the next 15 years, advances in recombinant DNA technology and DNA sequencing along with sociological changes in the way scientists and governments approached scientific challenges brought forth the Human Genome Project. Here are some of those changes. Scientists came up with ways to use restriction endonucleases to physically map the human genome. By 1995, the physical map had over 15,000 markers on it. The genetic map of the human genome had 400 mapped traits by 1987. Scientists came up with ways to handle and clone into plasmids bigger and bigger chunks of DNA – moving from a few thousand base pairs up to over 100 thousand base pairs.Kary Mullis and colleagues at Cetus Corporation develop a strategy for using DNA Polymerase to replicate user-defined short stretches of DNA over and over and over again to amplify the amount of the user-defined sequence. Their strategy, called Polymerase Chain Reaction, made it easy to obtain workable amounts of specific DNA sequences from a tiny amount of starting material. Scientists, both at universities and connected to business interests, made Sanger replication-based DNA sequencing into an automated technology. DNA sequencing became more of a standardized service that universities and research institutions provided to their researchers than an individual lab art form. Big name scientists wrote opinion pieces in the top scientific journals making a case for an all-out effort to sequence the human genome. Discussions about such an effort took place at several research conferences. In the United States, the National Institutes of Health, NIH for short, and the Department of Energy, DoE for short, independently started plans for sequencing the human genome. NIH makes sense given its mandate to promote human health, but DoE had two good reasons as well. Its governmental charge is to safeguard and promote energy supplies in the US of all types. One energy supply, nuclear power, has clear safety concerns when it comes to exposure to nuclear radiation and subsequent DNA damage. DoE wanted to better understand the impact of radiation on the human genome. DoE had a longer-term energy interest as well – bioenergy in the form of organic carbon polymers stored in algae, crops, and trees. Not as scientifically sexy a topic as the human genome, but very important in its own right. In the end, NIH took the lead but both government agencies were heavily involved in the Human Genome Project. Jim Watson of Watson and Crick fame was picked to head up the new effort. He stayed in the job for 5 years and was replaced by Frances Collins who led the US government-based efforts until it reached its original goal of a complete human genome sequence. Likewise, government-based scientific agencies in Europe and in Japan made similar decisions to be part of the Human Genome Project. Early on, several groups of scientists, government agencies and business-based efforts realized that the smaller genomes of other model organisms would be good starting points to better experimental strategies, sequencing technologies, and data analysis tools. There were lots of basic biology interest in these smaller genomes as well.  By the time the first drafts of a human genome sequence were published in 2001, there were already over 50 completed genome sequences of other cellular organisms – many Bacteria, a handful of Archaea, a fungus, a plant, a nematode, and a fruit fly. Next time, we will consider the two initially competing approaches that were taken to sequence the human genome and why one of those approaches became the standard for all subsequent genomes and metagenomes. Talk to you again soon.

    7 min
  4. EPISODE 4

    Two Ways to Solve a Genomics Jigsaw Puzzle

    Once the goal of obtaining a human genome sequence had been set by research scientists and several government agencies around the world, the big question was how to organize the effort. Any genome of a cellular organism, but especially the human genome, is a massive amount of information. How do you gather the information and how do you piece it all back together at the end? There was no technology available in the late 1980’s and early 1990’s, and there is still none to this day, that allows one to jump onto a giant DNA strand and determine its sequence. You have to break the genome into lots of pieces, figure out the sequence of all the pieces and put all the sequences back together in the right order so that the virtual genome equals the real physical genome. Two approaches ended up in a race with each other to sequence the human genome. The larger group was a public consortium of government-funded labs around the world, but mainly in the U.S., the U.K., and Japan. This effort was first led by James Watson of Watson & Crick fame, then by Francis Collins who saw it through to completion. The public effort focused on separating the human genome into individual chromosomes and sub-chromosome pieces to organize the sequencing and simultaneously developing really fine-scale physical maps of each chromosome to help assemble the sequence reads back in the right order. Now there was quite a bit of mapping information known for the human genome already, but much more detail was needed for this mapped-based strategy. The second, smaller effort was a private affair led by the for-profit company Celera Genomics and several big corporate donors. Celera Genomics and its sister non-profit research organization called The Institute for Genomics Research, TIGR for short, were founded by Craig Venter, a very successful biochemist turned entrepreneur who had once worked at NIH. Venter and his team felt that they had a better strategy – faster, cheaper, and more applicable to any genome of interest. Why wait to develop fine-scale physical maps of a genome? Why not just break the genome into random pieces and sequence them, but here is the rub. You don’t know which random pieces you are sequencing until you have sequenced them. How many random pieces do you have to sequence in order to get virtually all of them? In other words, how hard do you have to work to achieve your goal? This is actually a problem we have all dealt with on more than one occasion since we were little kids. Think about a really big bag of M&Ms of your favorite flavor. You know that there are seven colors represented in the bag. If you randomly pour out seven M&Ms into your hand, the probability that each color should be represented once is not one. There is an element of random chance in terms of which M&Ms fall out of the bag or in the case of a genome, which DNA fragments you randomly sequence. Craig Venter and his colleagues knew this was true with their so-called shotgun strategy to genome sequencing. In fact, they made use of a statistical distribution, called the Poisson distribution, that simulates such random events. The Poisson distribution can be used to understand random events through the an equation that allows us to calculate the probability of a particular outcome. For example, if the average number of any particular M&M color in your sample is one, what is this probability that a any given M&M color was not seen at all? Using m to represent the average and x to represent the number of interest, the Poisson distribution equation is:  Px,m = (mx . e-m)/x! For the M&M question, P0,1 = (10 . e-1)/0! = (1 . 0.37)/1 = 0.37 This means that there is a 37% probability that if we pour only 7 M&Ms out of the bag that a given color will not be represented. That is not good enough, whether our goal is getting one of each color of M&M or of getting every piece of a genome represented. We can use the Poisson distribution equation to determine how hard we would need to look. We can try different values for m, the average number of times we have seen a particular M&M color or genome fragment, in order to determine the probability of getting no hits. As we have already seen, the probability of getting no hits for a particular M&M color or genome fragment given an average number of hits of 1 is 0.37. For an average number of hits of 2, the probability of getting no hits for a particular genome fragment is 0.14. For an average of hits of 3, the probability of no hits for a particular genome fragment is 0.05. Now we are getting somewhere. If we sequence enough genome fragments to represent 3 times the number needed to cover the entire genome, we should have 95% of it done. If we go up to an average of hits of 5, the probability of no hits for a particular genome fragment is 0.01. Now we have 99% of it done. No need to map a genome first. Just sequence enough pieces to cover the genome 5 times or more. We will see the shotgun strategy in action in our next episode – the first genome sequenced from a cellular organism. See you then.

    7 min
  5. EPISODE 5

    The First Cellular Genome, Part 1

    Welcome back. I am Brad Goodner, Professor of Biology at Hiram College. We have reached the point of the first genome sequence from a cellular organism, published in July of 1995 in the journal SCIENCE. As we discussed earlier in Episode 4, Craig Venter and his colleagues at TIGR, The Institute for Genomics Research, came up with a probability-based approach, a shotgun approach, to sequencing a genome. Break it up into pieces and sequence enough pieces to cover the genome at least 5 times to hopefully obtain 99% of the genome sequence. The 1995 SCIENCE article by Robert Fleischmann and 39 coauthors, including Craig Venter, focused on the genome of Haemophilus influenzae strain Rd. This strain is a nonpathogenic sister of strains that can cause inner ear infections, respiratory infections and even bacterial meningitis. Many of you have been vaccinated against several pathogenic strains of Haemophilus influenzae. The genome of H. influenzae strain Rd is 1.83 million base pairs present as a single circular chromosome. This genome was chosen for its small size and because its G+C content of 38% was very close to that of humans. Fleischmann and coworkers grew up a culture of the bacterial strain and isolated DNA. They then randomly sheared the DNA into fragments using sonication and separated the fragments using gel electrophoresis. DNA of two size ranges were purified from the gel – 1500 to 2000 bp and 15,000 to 20,000 bp. The purified DNA fragments were then treated with DNA polymerases and exonucleases to generate blunt ends with phosphorylation 5’ ends. The blunt-ended fragments were ligated into a plasmid vector to make two libraries – small insert and large insert. From the small insert library, the researchers sequenced both ends of over 7000 plasmid clones and one end of over 9000 more clones. The average size of the sequence reads were around 450 bases. Overall, this resulted with over 11.6 million bases of sequence, just over 6X the size of the genome. The shotgun approach is action! Now the work was turned over to computer algorithms that looked for overlaps between the sequence reads that met a set sequence identity criterion. In this way, the initial sequence reads were assembled into 140 larger fragments called contigs. The researchers estimated that the remaining gaps between the contigs averaged about 100 bases in size. Some of the gaps were due to the randomness of the shotgun cloning methods while other gaps were due to the fact that certain genome fragments were somehow lethal to the E. coli host cells carrying the library plasmid clones. To close the gaps required human ingenuity. For example, the researchers used the ends of each contig to see if any of them encoded parts of the same protein. If so, they designed PCR primers from each potential adjoining end and using those primers with H. influenzae genomic DNA as the PCR template. In addition, the researchers also used the contig ends as hybridization probes on Southern blots of DNA from the large insert library clones. If the ends of two different contigs hybridized to the same large insert library clone, then the same PCR strategy could be used as well as the two ends of the large insert were sequenced. Using these strategies and a few others, the researchers were able to close all of the gaps. In this way, Fleischmann and coworkers figured out the first complete genome sequence of a cellular organism. The shotgun strategy was proven a success and became the model for virtually all subsequent genome projects. The cost of this project turned out to 48 cents per finished base pair or just under $900K. Since then, the cost of genome projects has dropped precipitiously to the point that today the same size genome could be sequenced for about $500. The genome era truly came alive with this publication, but there are biological implications beyond the technological ones. We will deal with the biological implications next time.

    6 min
  6. EPISODE 6

    The First Cellular Genome - Part 2

    Welcome back to Genomics Revolution. I am Brad Goodner. Last time we were together, we walked through the strategy used by Craig Venter’s team at TIGR, The Institute for Genomics Research, to sequence the first genome of a cellular organism, Haemophilus influenzae strain Rd. Today, we will finish up our analysis of the July 1995 SCIENCE article by focusing on the biological implications of knowing the complete sequence of an organismal genome. The H. influenzae strain Rd genome is a single circular chromosome of 1,830,137 base pairs. Previous to this work, the sequence of 122 protein-coding genes and their surrounding noncoding regions had been deposited in GenBank, the world’s foremost database of gene data. The authors of the genome paper, Robert Fleischmann and 39 coworkers, used a published computer algorithm and the previously known coding and noncoding sequences from H. influenzae to construct a model of how the coding sequences differed from the noncoding sequences. This may sound odd, but it turns out that the parts of any given genome that code for proteins, regardless of the specific proteins involved, share key characteristics such as certain dinucleotides, trinucleotides, tetranucleotides, etc.  that are more or less abundant than predicted by the single nucleotide base composition of the genome. These characteristics are unique to each species. Once the computer algorithm had been “trained” to distinguish coding from noncoding regions, Fleischmann and coworkers put the entire genome sequence through the algorithm to predict putative protein-coding genes. For the H. influenzae Rd genome, the algorithm predicted 1743 protein-coding genes or about 1 protein-coding gene per every 1000 base pairs. This rough estimate has held up remarkably well since then across the entire Bacteria and Archaea domains, but it is much smaller that that seen in the Eucarya domain. Of the 1743 predicted protein-coding genes, 1354 of them had 30% or greater protein sequence identity to genes previously sequenced in other organisms. Evolution keeps what works! However, that does not mean that we know what all of these proteins actually do. 1007 of these genes were similar to known genes that encode proteins of known function. 347 of them were similar to genes encoding “hypothetical” or “conserved hypothetical” proteins. That leaves 389 protein-coding genes with no similarity to previously sequenced genes. Some of these genes turned out to be shared with other organisms but just hadn’t been sequenced yet. However, some of them appear to be unique to the genus Haemophilus. This point appears to be true for all sequenced genomes. Evolution is also eternally creative. Fleischmann and coauthors found many other interesting biological features from their analysis of the H. influenzae Rd genome. Remember that the Rd strain is a nonpathogenic relative of known pathogenic strains. The Rd genome shows evidence of its pathogenic heritage as some virulence genes and regulatory sequences remain, but it also shows several losses of key virulence genes. Every genome sequenced since this 1995 breakthrough answers some longstanding questions, illuminates some previously unknown biological capacities, and brings up even more questions and hypotheses for future work. In future episodes, we will learn more about both well-known organisms and recently discovered ones through their genomes. See you next time.

    6 min
  7. EPISODE 7

    Introduction to a Survey of Genomes - Agrobacterium tumefaciens C58

    This is Brad Goodner.  Welcome back to Genomics Revolution.  In our first 6 episodes, we have introduced the terms genome and genomics, talked about how the field of genomics got its start, and looked at the steps of a genome project using the first ever sequenced cellular genome as an example. Now we will tour through a survey of some sequenced genomes.  All three domains of life will be represented, but the Bacteria and Archaea will get the lion’s share.  For each genome, we will learn why scientists are interested in the organism, some basic data about the genome, its genes and encoded proteins, a few surprises from the genome sequence, and an example of how scientists took the next step past having the genome sequence.  Each genome will be presented by a different student in the 2019 Hiram College Genetics course and they will put their own unique spins on their assigned subjects. To get us started, I will talk about my favorite genome, #50 in terms of getting published according to my counting.  This is the genome of Agrobacterium tumefaciens strain C58, an organism I have worked with on and off since 1983.  Agrobacterium, or Agro for short, is a genus from the Bacteria division alpha-Proteobacteria found in soils all over the world and is best known because some strains are plant pathogens.  These pathogenic strains contain a plasmid that allows them to do something no other bacterial pathogens can do – transfer a piece of their own DNA into their eukaryotic host cell where the expression of genes on the transferred DNA causes the cells of the plant host to act very very differently.  The “transformed” plant cells grow out of control because they make their own growth-stimulating hormones and they produce and secrete some strange compounds that Agro can use as C and N sources.  It turns out that Agro has been genetically engineering plants on its own for a long time before any humans thought about the possibility. I came back to work on Agro in 1996 when I read a paper by Allardet-Servent and coworkers (1) who showed that in strain C58 there were two DNA molecules greater than 1 Mbp that contained rRNA genes.  The presence of rRNA genes is usually indicative of a chromosome, but this would mean that strain C58 has 2 chromosomes and there was no previous evidence of this.  A group of 8 undergraduates at University of Richmond worked with me to generate and map a large collection of transposon insertions in essential genes of Agro C58.  Our paper (2), published in 1999 proved that there 2 chromosomes in strain C58. The larger chromosome is a circle of roughly 3 Mbp,but the smaller 2.1 Mbp chromosome is a  linear DNA molecule.  Agro‘ s closest relatives in the genus Rhizobium only show 1 circular chromosome of roughly 3.6 Mbp, so we wondered where did the smaller linear chromosome come from?  We imagined 3 possibilities.  One, the smaller chromosome originated from a breakage event in the original circular chromosome.  Two, the smaller chromosome came in from the outside such as a viral infection.  Three, some combination of the first two hypotheses.  It was this question that drove my lab to start sequencing the Agro C58 genome in 1999. We started with $3000 to build a genomic library and start sequencing library clones, but we knew the full cost would be closer to a a half million dollars.  In the fall of 1999, we presented some of our initial findings at a small research conference that focuses on the biology of Agrobacterium.  At the end of the conference, a gentleman approached me. “My name is Steve Slater”, he said, “and I work for a small company called Cereon Genomics.  We need to talk but not here.  I will call you tomorrow.”  On the flight back home, I told my wife Asha that I thought Steve Slater was going to tell me that his group had already sequenced the C58 genome.  However, the next day, Steve told me that Cereon Genomics was just beginning to sequence the genome and that they wanted to collaborate with my research team because of our genome map of transposon insertions.  Steve rightfully saw the value of using our map to orient and join up the sequenced pieces of the genome.  We reached an agreement between Cereon Genomics, its parent company Monsanto Corporation, and my research lab.  The agreement required all partners to agree as to when and how to publish the finished work.  If any one partner didn’t want to publish, the collaboration would stop. It was a fun but odd collaboration.  My students got to work with a lot more sequence information, but we had to use a dial-in modem connection on one computer to access the company database.  This restriction slowed us down but we made consistent progress and were basically finished with sequencing and assembling the genome sequence by the end of 2000.  Around that time, we became aware of another collaboration between an academic lab at University of Washington and DuPont Corporation that was also sequencing the Agro C58 genome.  It was a race but luckily in the end both collaborations agreed to publish back-to-back articles (3,4) in the journal SCIENCE that came out in December of 2001. So what we did we learn from the Agro C58 genome sequence?  First and foremost, the 2.1 Mbp linear chromosome was evolutionarily derived from a plasmid!  The origin of replication on this chromosome is clearly a member of the repABC plasmid family, very similar to those found on two large plasmids in strain C58.  Second, the linearity of the second chromosome is due to hairpin loops on each end where the top and bottom strands are connected through a stem-loop structure.  In a later paper, we obtained the full sequence of the hairpin loops and showed that the linear chromosome is found only in one subset of Agrobacterium and Rhizobium strains called biovar 1 (5).  During replication, the two “old” strands are still connected at their ends.  Once the hairpin loops are replicated, an enzyme called protelomerase recognizes the double-stranded hairpin sequences and makes staggered cuts to allow the two new daughter ds DNA molecules to separate and reform hairpin loops on each end.  We don’t know yet the evolutionary origin of the hairpin loops and the gene encoding protelomerase.  My personal hypothesis is that they came in as part of a linear bacteriophage.  Third, comparison of the two chromosomes of Agro C58 with the sequenced single chromosome of Sinorhizobium strain 1021 showed clear evidence that several large chunks of the ancestral circular chromosome moved to the plasmid that became the second chromosome.  I had several Hiram College students continue studying this phenomenon and this became part of another follow-up publication (6). There were a lot more insights gleaned from the Agro C58 genome sequence and we continue to link genes to functions using functional genomics experiments such as the Mariner-type transposon mutagenesis screen going in the 2019 Hiram College Genetics course.  However, I will leave those details for another time. The Agro C58 genome shows us how complex genomes can arise in the Bacteria domain and how genomes can rearrange over time.  Now let us see what we can learn from other genomes.  Stay tuned to Genomics Revolution. For More Information on Agrobacterium strain C58 & its genome:(1) Allardet-Servent et al., 1993.  Journal of Bacteriology 175:7869-75.(2) Goodner et al., 1999.  Journal of Bacteriology 181:5160-6.(3) Goodner et al., 2001.  Science 294:2323-8.(4) Wood et al., 2001.  Science 294:2317-23.(5) Slater et al., 2013.  Applied & Environmental Microbiology 79:1414-7.(6) Slater et al., 2009.  Jour...

    9 min
  8. EPISODE 8

    Survey of Genomes - E. coli O157:H7

    Welcome to Genomics Revolution. I’m Taylor Yamamoto from the 2019 Hiram College Genetics course hosting this episode on the genome of the bacteria Escherichia Coli O157:H7. I will be calling it E. Coli from now on. This strain of E. Coli is the most harmful strain to humans because it produces a toxin, called Shiga toxin, that causes bloody diarrhea and hemolytic-uremic syndrome, which is when red blood cells get damaged, and then cause a blockage in the kidneys. This can lead to life-threatening kidney failure. Comparatively, nonpathogenic E. Coli often inhabit the human gut without any adverse affects. E. Coli O157:H7 is normally spread fecal-orally, and has caused major gastrointestinal illness outbreaks in both North America and Asia. In this podcast, we will be focusing on the genomic sequence of the E. Coli O157:H7 strainthat caused an outbreak in Sakai City, Osaka, Japan in 1996.    The complete sequence of the chromosome is 5,498,450 base pairs in length. Additionally, the strain also has a large virulence plasmid that is 92,721 base pairs, and a cryptic plasmid that is 3,306 base pairs. In total, the genome is 5,594,477 base pairs long. The Sakai strain is 859,000 base pairs longer than the nonpathogenic strain of E. Coli, but its not like its just tacked onto the end of the sequence. There is a lot of the sequence that is conserved between the two strains, which probably represents the chromosome backbone that most E. Coli strains share, but there are also regions that are unique to the Sakai strain.    In order to cause an infection, bacteria first need to stick to the tissues they are hoping to infect. This is done via fimbriae, which act kind of like Velcro and help the bacteria stick to its target tissue. On the Sakai chromosome, there were fourteen regions that were identified in association with the production of this fimbriae. Five were conserved in the nonpathogenic strain, five were partially conserved in the nonpathogenic strain, and four were unique to the Sakai strain. One of the genes found in this region was actually found to be similar to a gene that codes for fimbriaein Salmonella.    The genomic sequencing of the Sakai strain can be used in the future identification and study of this harmful E. Coli strain. For example, it was used by Gadri et al when their lab was investigating the effects the presence of other bacteria can have on the proliferation rates of various strains of E. Coli O157:H7.  Because there have been multiple gastrointestinal illness outbreaks worldwide, Manning et al (3) have developed a system to identify SNPs in various strains of E. Coli and how they correspond to level of severity of the infection.    So, while this is definitely an international issue, you listening to this podcast is the first step in furthering education, and coming to a better solution. So thanks for listening.      References: (1) Marouani-Gadri, N., Augier, G., and Carpentier, B. (2009). Characterization of bacterial strains isolated from a beef-processing plant following cleaning and disinfection – Influence of isolated strains on biofilm formation by Sakai and EDL 933 E. Coli O157:H7. Retrieved from https://www.sciencedirect.com/science/article/pii/S0168160509002499?via%3Dihub. (2) Hayashi, T., Makino, K., Ohnishi, M., Kurokawa, K., Ishii, K., Yokoyama, K., Han, C.G., Ohtsubo, E., Nakayama, K., Murata, T., Tanaka, M., Tobe, T., Iida, T., Takami, H., Honda, T., Sasakawa, C., Ogasawara, N., Yasunaga, T., Kuhara, S., Shiba, T., Hattori, M., and Shinagawa, H. (2001). Complete Genome Sequence of EnterohemorrhagicEschelichia coli O157:H7 and Genomic Comparison with a Laboratory Strain K-12. Retrieved from https://www.ncbi.nlm.nih.gov/pubmed/11258796. (3) Manning, S., Motiwala, A., Springman, A., Qi, W., Lacher, D., Ouellette, L., Mladonicky, J., Somsel, P., Rudrik, J., Dietrich, S., Zhang, W., Swaminathan, B., Alland, D., and Whittam, T. (2007). Variation in virulence among Clades of Escherichia Coli O157:H7 associated with disease outbreaks. Retrieved from https://www.pnas.org/content/105/12/4868.

    4 min

Ratings & Reviews

3.5
out of 5
2 Ratings

About

Podcast associated with Hiram College Genetics course. Focus is on the history of genomics and how a genomic view of life has impacted basic science as well as applied fields such as medicine and agriculture.