How association genetics can find genes that help Joshua trees beat the heat

by JTGadmin

Joshua trees in Tikaboo Valley, Nevada (jby)

This post is by JTGP collaborator Jeremy Yoder, an Assistant Professor in the Department of Biology at California State University, Northridge, who studies ecological and evolutionary genetics.

Since we first launched the Joshua Tree Genome Project, we’ve told you that one big reason we want to sequence a Joshua tree genome is to find genes that are important for adaptation to climate, which could help us makes sure that Joshua trees survive and thrive in a climate-changed future. But we haven’t discussed in detail how we’ll find those climate-adapted genes. The key to that part of the project is association genetics, a method for sifting through the genome to find the parts that contribute to traits we care about.

Figure 1. An example of a “candidate gene” experiment testing whether different diploid genotypes are associated with different trait values, or phenotypes. Points are the phenotypes of individual trees, grouped by their diploid genotypes at a candidate gene; overlaid box plots show that trees with different genotypes differ strongly in their phenotypes. (jby)

To understand how association genetics works, first consider a case in which we already know of a gene that might be important for a particular Joshua tree trait, like height or flower shape or physiological performance — or even just growing in places that are hotter or cooler. To figure out whether different variants in the sequence of our “candidate gene” are related to differences in the trait, we could measure the trait in many trees and then sequence that gene in all of those trees. We would test the hypothesis that the gene shapes the trait by comparing the trait values of trees carrying different variants of the gene sequence. Figure 1 gives an example of what this might look like in a hypothetical case, with tree phenotypes plotted against diploid genotypes at our candidate gene — “homozygous” trees carrying two copies of the G variant have higher phenotype values than homozygous trees carrying two copies of the A variant, and “heterozygous” trees with one copy of each have intermediate phenotypes. In this case, we’d probably conclude that the candidate gene has some effect on the phenotype we decided to measure, because different variants of the gene are associated with significantly different phenotype values.

But what about when we don’t have a candidate gene in mind? This is a much more common situation, especially when we’re studying organisms like Joshua tree, which haven’t had much in-depth genetic analysis yet. Well, what we can do then is to conduct the same kind of test at many places across the whole genome — the more, the better. Genome-wide association (GWA) study work by collecting DNA sequence data from thousands or millions of variable loci in the genome, and comparing the variants at each of those places to the phenotypes of the individuals carrying those different variants. The handful of loci that show the strongest associations with the phenotype are, we understand, most likely to be within or close to genes that contribute to the measured phenotype differences.

Figure 2. An example of the output from a GWA study testing more than 6 million markers for associations with flowering time in Medicago truncatula, a close relative of alfalfa. Each point represents a single marker, positioned on the x-axis according to its place in the M. truncatula genome. Points with greater values on the y-axis are more strongly associated with differences in flowering time — note the cluster of strongly associated points on chromosome 7, which contains a gene previously known to affect flowering in other plant species. (Stanton-Geddes et al., 2013)

Modern genome sequencing makes it easier than ever to collect the genetic data necessary for a GWA project, but it’s still not simple to do one rigorously. First, you can’t quite just sequence and measure a bunch of Joshua trees and perform the kind of simple test I cartooned in Figure 1. In natural populations, we have to contend with a phenomenon called isolation-by-distance (IBD), which means that Joshua trees from different populations will likely differ at some points in the genome simply because they come from different populations. When you test millions of places in the genome, you’ll likely find some of those differences that have everything to do with IBD and nothing to do with the phenotype, so it’s necessary to use a statistical test that accounts for IBD. Second, even when you account for confounding population genetic effects, an association test is fundamentally correlational — it doesn’t directly demonstrate that the different genetic variants at an associated site actually create the phenotype differences you’ve measured.

Figure 3. Obligatory correlation-is-not-causation joke. (xkcd)

So a single GWA study needs to be connected to other results, from different kinds of experiments, to confirm that genes showing associations to a phenotype in one context show associations, or even direct effects, in other conditions. Ecological geneticists call this process of comparing different kinds of evidence for a gene’s effects “triangulation”.

Finally, to be as useful as possible, a GWA study needs a reference genome to provide context to its results — whether associated loci lie in genes, and what those genes might do. It’s possible to do association testing without a reference, collecting sequence data in such a way that you know individuals’ genotypes at many loci, but don’t know where those loci are with respect to each other in the genome. Sometimes you can still use this approach to determine that an associated locus is similar to a stretch of genetic sequence known to be a particular kind of gene in another, closely related species. More often, though, GWA without a genome results in a list of associated loci about which very little is known beyond the fact that they’re associated with the phenotype you measured.

Building up the genomic resources and experimental knowledge base required to support good GWA can take decades, but the Joshua Tree Genome Project’s collaborators bring together the range of expertise necessary to do it in the course of a four-year NSF-sponsored project. We’ve carefully planned our sampling design and statistical analysis to control for confounding population genetic effects. We’ll perform controlled experiments in Joshua tree physiology and gene expression to help “triangulate” the importance of climate- and growth-associated loci. And, first and foremost, we’re building a carefully annotated reference genome to provide context for GWA results. It’s going to be a lot of work, but it’s what we need to do to confidently identify the genes that help Joshua tree cope with extreme climates.