Whole Genome Approaches to Complex Kidney Disease
February 11-12, 2012 Conference Videos

Lessons From Genetic Analysis Workshop 17: Aggregation, Machine Learning, and Data Mining Approaches
Joan Bailey-Wilson, NHGRI

Video Transcript

00:00:02,100 --> 00:00:10,333
ANDREY SHAW: Our next speaker is Joan Bailey-Wilson. She is the head of the Statistical Genetics Section in the Inherited Disease Branch

00:00:10,333 --> 00:00:19,966
of the NHGRI. She’s a statistical geneticist interested in the genetics of complex disease and she is going to tell us about the lessons from

00:00:19,966 --> 00:00:28,766
Genetic Analysis Workshops 17. JOAN BAILEY-WILSON: I’m actually only going to

00:00:28,766 --> 00:00:38,832
tell you about a few of those lessons because there were a lot of them, but I am going to talk about several that I think are important and that

00:00:38,833 --> 00:00:53,833
really flow very well from Steve’s and Suzanne’s talks. So, I am going to cover some issues about aggregation; multiple different kinds of

00:00:53,833 --> 00:01:02,333
aggregation. A brief overview of what is machine learning for those of you who may not know much about it and how might machine learning

00:01:02,333 --> 00:01:13,766
and data mining approaches be useful and we’re going to talk about linkage analysis as a type of aggregation that can be very powerful for helping

00:01:13,766 --> 00:01:23,899
to interpret the results of sequence data. So, one of the things you’ve been hearing from all of us is that complex trait genetics is probably

00:01:23,900 --> 00:01:34,966
really pretty complex, so I am not going to spend a lot of time on this, but I studied lung cancer and this really hits what we have been saying is, if I

00:01:34,966 --> 00:01:44,966
just filter variants out because it occurs in a database for something like lung cancer there’s probably tons of non-smokers in those databases

00:01:44,966 --> 00:01:54,966
who might have risk variants for lung cancers, but because the environmental effect is so huge, they’re not going to be penetrant. So, we always

00:01:54,966 --> 00:02:06,699
have to not ignore the environment, and therefore, non-penetrants. So, we know that there are genes with rare alleles of large effect.

00:02:06,700 --> 00:02:16,066
We’ve known about them for a long time in cancer and we also, as I said, know that major environmental risk factors can be common, but

00:02:16,066 --> 00:02:26,499
major genetic risk alleles for serious diseases tend to be rare in the population and we think that’s due to selection over time against these

00:02:26,500 --> 00:02:41,366
things. BRCA1, a major risk locus for breast cancer, has over a thousand different individually really, really rare high penetrance risk alleles. So,

00:02:41,366 --> 00:02:53,999
it’s sort of the model for what we are all starting to go after these days, but then there are also really common risk factors that increase risk as

00:02:54,000 --> 00:03:07,900
well.You all have probably seen this many times from the GWAS story but it is really important for sequencing as well. Remember, it is easy to find

00:03:07,900 --> 00:03:22,500
the really high penetrant things with family studies but not so easy to find rare lower penetrance things in family studies. Common

00:03:22,500 --> 00:03:33,033
variants that have low effect size is what we’ve tended to find in GWAS studies because that’s what they’re powerful to find. And for common

00:03:33,033 --> 00:03:45,333
diseases, common risk alleles of large effect, so far, have seemed to be pretty rare across diseases—we don’t tend to see a lot of

00:03:45,333 --> 00:03:53,533
those—which is why GWAS haven’t found them. What we are hoping to find is some of the things we haven’t been able to find to date with our

00:03:53,533 --> 00:04:02,866
assisting tools. Now, linkage is something that when Steve was giving his talk Suzanne and I were both going, “Yeah, yeah, linkage and

00:04:02,866 --> 00:04:14,832
pedigrees,” and that’s because linkage is looking at co-segregation of the disease or trait with a genetic variant within a family. Now, if you have a

00:04:14,833 --> 00:04:27,266
really rare variant that is segregating through a family, if it has a big effect on the disease or trait, then you get familial aggregation; exactly the kind

00:04:27,266 --> 00:04:38,032
of pedigrees that Steve was talking about. If you can collect them you may have good power to detect very rare, important causal variants.

00:04:38,033 --> 00:04:47,533
Association, on the other hand, is looking to see whether a specific genetic variant is more common, say, among cases than among controls,

00:04:47,533 --> 00:04:59,299
or among people with extremes of a phenotypic distribution; there’s lots of methods available to test both those things. A really important thing to

00:04:59,300 --> 00:05:10,066
just get in your minds is that power to detect any sequence variant depends on the size of the effect, at what level you’re measuring the effect,

00:05:10,066 --> 00:05:21,099
and its frequency in the sample that you’re studying. So, the variant for familial hypercholesterolemia, which has been known

00:05:21,100 --> 00:05:34,400
forever, because I’m old, it has a very large effect in individuals with two copies of this variant on that individual’s trait value. It has a moderate

00:05:34,400 --> 00:05:45,100
effect in relatives of those people with two copies who generally have one copy, but it has virtually no effect in the population if you start

00:05:45,100 --> 00:06:00,200
talking about population heritability of that variant. It’s tiny, not because the variant does not have a big effect on the trait, but because it is so rare.

00:06:00,200 --> 00:06:14,600
So, for common diseases you can have very rare alleles that have big effects. So, breast cancer is a perfect model for that. We know; we detected

00:06:14,600 --> 00:06:25,933
BRCA1 and BRCA2 with linkage. Once we had more than thirty loci—yeah, I’m really, really old—once we had more than 30 markers it was found

00:06:25,933 --> 00:06:36,866
quickly and all of these mutations have been found, but they’re very rare and they don’t account for all of the population risk. So, other

00:06:36,866 --> 00:06:46,799
lo-sci, then, with major alleles have also been detected, but all of them were not enough to explain all the risk, so GWASs came along. We’ve

00:06:46,800 --> 00:06:58,000
identified a lot more loci where there are variants with small effects but they’re more common, and as I said, these common risk alleles tend to have

00:06:58,000 --> 00:07:12,100
small effects. So, back to this study. We’ve done a pretty good job at getting those Mendelian-type disease alleles. We’ve done a pretty good job at

00:07:12,100 --> 00:07:20,000
getting the common alleles of small effect out of the GWASs, and now we’d like to try and look at some of the other things. So, the Genetic

00:07:20,000 --> 00:07:31,466
Analysis Workshop has actually been held 17 times every 2 years, so for a long time it’s been going on, and what statistical geneticists like to

00:07:31,466 --> 00:07:40,766
do with it is test and compare methods. Everybody is developing new methods all the time. Suzanne gave you a list of some of the

00:07:40,766 --> 00:07:50,566
ones that are kind of the most popular right now and very interesting ones, and one of the things we like to do is test them on the same data, so

00:07:50,566 --> 00:08:01,099
that’s what this is. This workshop focused on study designs and analysis methods for whole exome sequence data and it was a partial

00:08:01,100 --> 00:08:14,300
exome—it wasn’t a full exome—and it was DNA-sequenced from the 1000 Genomes Project and that was used as the genomes for 697 people,

00:08:14,300 --> 00:08:25,133
and then those real genomes…they simulated phenotypes from the genotypes. So there were, of course, many individually very rare variants as

00:08:25,133 --> 00:08:36,699
well as many common variants. So these simulated traits, based on the sequence data, were extremely complex. They wanted to make

00:08:36,700 --> 00:08:48,233
this really hard to see how well methods could do. So, there were a few rare variants that had moderate to small effects on an individual’s risk of

00:08:48,233 --> 00:08:58,666
disease or on a quantitative trait; there were both. There were other rare variants that had very small effects on the traits and there were

00:08:58,666 --> 00:09:09,799
many, many—most RVs—had no effect. And then, there were some common sequence variants that had small to very small effects on an

00:09:09,800 --> 00:09:19,300
individual’s risk of the disease or on the quantitative traits, but again, most had no effect; sort of what we think common diseases might be

00:09:19,300 --> 00:09:33,300
like. So, there were the 697 unrelated individuals who had their phenotypes simulated and then they also used the same genotype data to

00:09:33,300 --> 00:09:39,966
simulate 8 extended pedigrees. They wanted to have just the same number of people, so what they did is they randomly sampled the founders

00:09:39,966 --> 00:09:53,566
of the pedigrees and then they simulated meioses using Mendelian rules and recombination, etc. to take the genotypes down the pedigrees across

00:09:53,566 --> 00:10:05,199
the genome, and then they used the exact same simulation model to produce the phenotypes. So, a variant in the unrelated sample had the exact

00:10:05,200 --> 00:10:21,133
same effect size for an individual as it did in the family sample, so it biologically would behave the same way. However, one of the problems with

00:10:21,133 --> 00:10:31,299
this is locus-specific heritabilities with quantitative traits, or risk of the disease for the qualitative traits were really, really small, and this is the kind

00:10:31,300 --> 00:10:45,066
of thing Suzanne was talking about; where you have rare variants that are not high penetrants, they’re pretty low penetrants. So, that meant that

00:10:45,066 --> 00:10:55,599
the locus-specific heritabilities for the cause of variants were quite small and there were only two causal genes in the unrelated individuals that

00:10:55,600 --> 00:11:08,766
actually had locus-specific heritabilities over .01, so they were really tiny, and this was not because they didn’t have a detectable effect in

00:11:08,766 --> 00:11:22,732
an individual, it was because they were so rare. So, the population amount of the heritability explained by any of these rare variants was tiny.

00:11:22,733 --> 00:11:33,033
So of course, just about everybody that analyzed the data used some sort of a collapsing scheme of rare variants. Some people collapsed within

00:11:33,033 --> 00:11:41,499
genes, some people collapsed within LD blocks. There were lots of different collapsing, but pretty much everybody did collapsing to have any

00:11:41,500 --> 00:11:56,033
power. However, almost all of the association methods had incredibly low power. Some of the methods had moderate power to detect these

00:11:56,033 --> 00:12:12,099
two loci that had multiple variants with moderate effect sizes, and the very standard methods like linear regression or logistic regression, and the

00:12:12,100 --> 00:12:24,533
whole host of different machine learning methods that were used, all had very similar power; meaning, pretty poor. So of course, that told us

00:12:24,533 --> 00:12:38,966
what we already expected, that if we were looking for risk alleles that are very rare and don’t have huge population-specific heritabilities, we

00:12:38,966 --> 00:12:50,366
were going to need gigantic samples. I think Andrey showed you that 6,000-and-some people if it’s the only moderately rare with a sort of

00:12:50,366 --> 00:13:03,066
medium-high risk. So, it told us huge samples were going to be needed to have good power. Another thing that we expected to see and we

00:13:03,066 --> 00:13:13,632
did observe, that if you enrich your subsamples…if you took subsamples of the people with extreme values—Steve’s study

00:13:13,633 --> 00:13:25,533
design—then you could, in fact, could get better power in smaller samples. This is classic genetics, which is why he did it, and it was

00:13:25,533 --> 00:13:35,133
interesting at least to see you did increase your power to detect casual variants. One of the things that was very worrying though, was the

00:13:35,133 --> 00:13:45,033
false positive rates were inflated even when you collapsed the rare variants, thereby cutting down your numbers of tests, and one of the things that

00:13:45,033 --> 00:13:56,533
was very clear is that the issues of ignoring correlations between your variables was huge, and there were, of course, the correlations due

00:13:56,533 --> 00:14:09,233
to linkage disequilibrium of close SNPs but there also were correlations across chromosomes where you would have a variant that was a

00:14:09,233 --> 00:14:22,133
singleton in the data that was then perfectly correlated with every other singleton variant in that individual in your sample and this is the kind

00:14:22,133 --> 00:14:35,066
of things that one needs to look at and correct for is this variants that’s giving me… is this gene that’s giving me a signal highly correlated with

00:14:35,066 --> 00:14:44,499
other genes elsewhere in the genome? These are interesting things that people are trying to look at and these kinds of correlations of your

00:14:44,500 --> 00:14:55,300
predictors can be problematic, so it’s sort of another thing to think about in your QC. Some methods did a little better than others at

00:14:55,300 --> 00:15:08,933
controlling those false positive rates in the unrelated individuals. For extended pedigrees the families now do not have all of the sequence

00:15:08,933 --> 00:15:19,799
variants that were in the unrelated individuals. They are a sample from that population of variants because only some of the unrelated

00:15:19,800 --> 00:15:32,966
individuals in the unrelated set became founders. So, what you saw was some of the causal rare variants that were in the unrelated set don’t

00:15:32,966 --> 00:15:42,299
appear in the families at all and this again would be typical. If you are taking aggregated families out of the population you will only be able to

00:15:42,300 --> 00:15:55,166
discover the causal variants that actually occur in those founders. However, by doing this, what we found, and again what we would expect,

00:15:55,166 --> 00:16:07,132
was that you saw enrichment in frequency of some of the really rare causal variants because they occurred in a founder and then segregated

00:16:07,133 --> 00:16:17,199
down through the family, so suddenly, instead of having one person in the unrelated sample with that variant, you had multiple people in a family

00:16:17,200 --> 00:16:31,200
with it and other people without it, so you had enough power to detect the effect of that variant on individuals’ traits. So, there were several

00:16:31,200 --> 00:16:45,033
genes, then, that could be detected as casual that there was no power to detect them in the unrelated samples. The locus-specific heritability

00:16:45,033 --> 00:16:56,233
is increased in the family sample for those variants, not because their actual effect on the trait has changed, but only because their

00:16:56,233 --> 00:17:07,633
frequency in the sample being analyzed has changed. None of this is new. A lot of this is called “Old Lessons Learned Anew,” and those

00:17:07,633 --> 00:17:17,099
of that are sort of old curmudgeons were going, “We knew this in 1970.” Well, I didn’t because I was still in high school but my mentor, Robert

00:17:17,100 --> 00:17:30,566
Elston, knew these things in 1970. He knew them earlier than that, probably. So, locus-specific heritabilities increased in the family samples

00:17:30,566 --> 00:17:41,266
because of the change in allele frequency. That meant that they had very good power if you used family-based association analysis to detect those

00:17:41,266 --> 00:17:51,566
casual variants. In fact, in the family-based association analysis, for two of the methods they were detected in 100% of replicates; it’s not quite

00:17:51,566 --> 00:18:06,899
power but it’s showing you its very good power. The other thing that was striking was if you used linkage analysis in your extended pedigrees you

00:18:06,900 --> 00:18:18,600
had absolutely genome-wide significant linkage evidence for these rare variants just doing old, old, old-fashioned two-point linkage analysis; not

00:18:18,600 --> 00:18:34,900
even multi-point, just two-point, which is really old. So again, extended pedigrees as Steve pointed out, were incredibly powerful to

00:18:34,900 --> 00:18:48,866
determine which were the true casual variants and the Type 1 error of these methods have been shown over the years to be well, well controlled.

00:18:48,866 --> 00:19:01,932
So, we know that these methods control Type 1 error very well and they are very powerful when you have extremely rare variants. Note though,

00:19:01,933 --> 00:19:11,399
remember I told you, you won’t find all of the causal variants this way; you’ll only find the ones that are actually segregating in those pedigrees,

00:19:11,400 --> 00:19:24,833
so you will, by definition, miss stuff, but you miss stuff in the unrelateds as well. So basically, linkage is very powerful to detect these high

00:19:24,833 --> 00:19:37,499
penetrance risk alleles. In families they tend to be rare in the population, but if you ascertain pedigrees correctly, such as ascertaining loaded

00:19:37,500 --> 00:19:48,400
families for disease or ascertaining data for quantitative traits where there are at least some people in the family who have extreme values of

00:19:48,400 --> 00:20:01,566
the quantitative traits, then this can be a very powerful way to detect these extremely rare variants that have moderate effect sizes.

00:20:01,566 --> 00:20:13,799
Association on the other hand, we know that they are most powerful for common variants and we’ve also have all agreed that most common

00:20:13,800 --> 00:20:27,900
variants have small effects. But even there, there are ways that you can increase your power, and that is you can ascertain extreme families,

00:20:27,900 --> 00:20:41,266
extreme families with multiple affecteds or families with high and low values of the trait, and use family-based methods, or you can ascertain

00:20:41,266 --> 00:20:57,299
cases who have a family history of a disease. You can take cases from pedigrees that have extreme high values and low values of the trait;

00:20:57,300 --> 00:21:11,666
things like that. So, there are ways by using ascertainment that you can increase your power for detection. What we’re hoping then, is that by

00:21:11,666 --> 00:21:21,366
doing ascertainment—this is kind of that typical power slide you’ve seen forever—and what we’re hoping by doing ascertainment is that…I

00:21:21,366 --> 00:21:37,499
don’t have a pointer, do I? Oh, here it is. We are hoping by doing ascertainment, we’re hoping to take the yellow thing and move it over, because

00:21:37,500 --> 00:21:48,700
by ascertaining pedigrees you’re hoping to enrich your sample for those very rare variants, and for the common variants in GWASs, by ascertaining,

00:21:48,700 --> 00:21:57,933
say, on family history, we’re hoping to increase the frequency of those things that are in-between the little blue circle and the GWAS

00:21:57,933 --> 00:22:12,366
yellow circle in your sample. So, it’s basically an oversampling strategy. So, the other thing that we have to worry about is I really don’t want you all

00:22:12,366 --> 00:22:23,899
to have to start thinking that rare variants are going to solve everything because some of the missing heritability may not be due to: we just

00:22:23,900 --> 00:22:35,933
haven’t found the rare variants. Complex traits are complex and we expect that you need to include environmental risk of course, but we

00:22:35,933 --> 00:22:44,699
expect that there are non-linear effects; that there are gene-by-gene and maybe gene-by-gene-by-gene-by gene pathway type

00:22:44,700 --> 00:22:57,333
interactions, etc., and most analysis ignore these because it’s very difficult to deal with them. So, this brings me into the next bit of my talk which is:

00:22:57,333 --> 00:23:07,599
how do we do some of this aggregation of our overall look at the genome? Statistical learning machines are one approach. These are very

00:23:07,600 --> 00:23:17,166
computer-intensive methods. They were, in fact, developed in computer science, not in statistics really, and they are designed to produce

00:23:17,166 --> 00:23:30,432
classifications--affected/unaffected—or predictions of a quantitative trait from extremely large numbers of potential predictors; what we in

00:23:30,433 --> 00:23:40,633
statistics call our independent variables and what they in machine learning tend to call features. They use samples where, for each individual,

00:23:40,633 --> 00:23:49,966
you know the outcome, case control, value of quantitative trait; you’ve got all these predictors. So far it’s the same as all of our other analyses.

00:23:49,966 --> 00:23:58,466
What the computer algorithms do is they use the predictors or the features to create models to best predict the observed outcomes. Again, just

00:23:58,466 --> 00:24:13,899
like regular statistics. So, in fact, linear and logistic regression are considered by a lot of people as very simple types of machines. Those

00:24:13,900 --> 00:24:22,666
machines though, they are what we call parametric and model-based. They use the data to estimate coefficients of the predictors—that’s

00:24:22,666 --> 00:24:32,899
the betas in a regression. The predictors are then your… the betas are your parameters and they also make assumptions, generally that the

00:24:32,900 --> 00:24:44,633
data follow a probability model. They’re logistically distributed; that’s a normal distribution, etc. Most methods that people currently think of as learning

00:24:44,633 --> 00:24:56,766
machines are non-linear and non-parametric: they don’t estimate coefficients of the predictors, they don’t assume the data follow a specific

00:24:56,766 --> 00:25:06,599
distribution. There are many, many different approaches to machines and each one, they performed better or worse in different situations,

00:25:06,600 --> 00:25:15,733
and there is no single optimal machine. In fact it has been proven mathematically that there cannot be an optimal machine. That math is above me; I

00:25:15,733 --> 00:25:30,766
take my friend Jim O’Malley’s word, that the proof is valid. There is a very… in the genetic epidemiology thing that came out of the Genetic

00:25:30,766 --> 00:25:39,566
Analysis Workshop, we have a brief review of some of these methods if any of you want to learn more about them. But here is a little just

00:25:39,566 --> 00:25:56,432
example of how this works. So, if we said a person is at risk for heart disease, if any one of these three conditions is true—either their LDL

00:25:56,433 --> 00:26:08,533
and their HDL is greater than 240 mg/DL, or HDL is less than 35 mg/DL, or the ratio of LDL to HDL is greater than 4, this is simulated data; a

00:26:08,533 --> 00:26:20,866
made-up model. So, if that’s true then those dotted lines are those diagnosis lines, and so the people—those light circles that are inside those

00:26:20,866 --> 00:26:34,699
dotted lines—they are the unaffecteds and the people outside are the people who are at risk for cardiovascular disease—for heart disease. This

00:26:34,700 --> 00:26:47,633
can be modeled by that very simple decision tree on the right which is just these three conditions. That’s pretty much the simplest example I could

00:26:47,633 --> 00:26:57,333
come up with of a tree and this is from Jim O’Malley’s really nice introductory book on this that just came out this past year. So, if you’re

00:26:57,333 --> 00:27:09,599
looking for a good introduction to machines, this is really useful for explaining and translating from computer science to statistics. But here’s an

00:27:09,600 --> 00:27:22,933
example of a more complex thing. The red are cases, the greens are controls. This would be really, really hard to model linearly. You can model

00:27:22,933 --> 00:27:36,799
this with this decision tree, which just makes simple yes/no, is this predictor above a value or below it, and it classifies each time it splits the

00:27:36,800 --> 00:27:50,000
data, and it’s trying to get at the end a division. You can sort of see the little red lines where most of the cases are on one side of those division

00:27:50,000 --> 00:28:01,066
lines and most of the controls are on the other side. It’s very much a heuristic process; it makes no assumptions about how those data should be

00:28:01,066 --> 00:28:14,199
filled. Now remember, I said there is no one machine that works perfectly every time, so there are ensembles of machines where you can take

00:28:14,200 --> 00:28:22,300
multiple decision trees, multiple methods, put them together with majority votes, and that often gives better results. Also you can use bootstrapping

00:28:22,300 --> 00:28:33,300
and permutation to get better prediction for a new data set and you can also use it to evaluate the importance of each of the predictors that are

00:28:33,300 --> 00:28:40,133
included in any of these trees, because of course that’s what somebody like me cares about. I want to know which of these predictors

00:28:40,133 --> 00:28:50,299
is really driving my tree, so I can go take it back to the lab, and figure out: what does that predictor actually do? What does that rare genetic variant

00:28:50,300 --> 00:29:06,700
actually do biologically? So, these are interesting methods. A lot of people are working on trying to apply these to genetic data. Another method that I

00:29:06,700 --> 00:29:16,533
am going to talk about—because my husband’s developing it here at NHGRI, so we are interested in it—is called Tiled Regression and it was the

00:29:16,533 --> 00:29:27,666
one I was telling you actually did a better job than most at controlling Type 1 error in the unrelated individuals, and it’s a method to look at

00:29:27,666 --> 00:29:41,099
aggregating variants across the genome, dealing with the fact that there are extreme correlations among your predictors. So you pick predefined

00:29:41,100 --> 00:29:52,766
regions, and those can be predefined anyway you want. We tend to like hot spot blocks based on linkage disequilibrium blocks. So, the tiles are

00:29:52,766 --> 00:30:05,766
the regions within a haplotype LD block. Collapse the rare sequence variants within the blocks, but you can use any predefined region you want; it

00:30:05,766 --> 00:30:15,966
can be genes, it can be whatever. For each tile then, you use multiple regression to say: are any of the variants in that region showing any

00:30:15,966 --> 00:30:29,732
association at all with the trait? If so, you keep the tile. If not, you toss all of those SNPs. You do this across all of your tiles and in the tiles that

00:30:29,733 --> 00:30:40,899
you have kept, you use stepwise regression to select important independent variables within the tile and bring them up to the next tile, where you

00:30:40,900 --> 00:30:55,566
start merging tiles together. So, then they are tested in stepwise higher order regressions to come up with chromosome-wide SNPs that have

00:30:55,566 --> 00:31:06,966
gotten rid of the correlations between them and still have some signal. Eventually, you get to all of the variants across the chromosome and the

00:31:06,966 --> 00:31:16,599
genome level, and then you use a lot of permutations to try and control your Type 1 errors. A lot of research is still going on on this

00:31:16,600 --> 00:31:31,900
method. The software is available; the Genetic Analysis Workshop paper about that is there. So, I agree with Steve, with Suzanne, with Andrey. I

00:31:31,900 --> 00:31:47,166
think that in the future we are going to be using a lot of Next-Gen sequencing to be examining a much broader spectrum of possible variants, not

00:31:47,166 --> 00:31:57,932
just non-synonymous changes in exons. I think we are going to need it in large numbers of people to really have power, and I don’t want you

00:31:57,933 --> 00:32:06,966
all to forget about GXG and GXE interactions because I do believe they are going to be quite important, and what a lot of folks are working on

00:32:06,966 --> 00:32:30,999
is building networks and pathways, to try and better understand what’s going on. So, I’ll take questions. I at least didn’t put any equations in.

00:32:31,000 --> 00:32:33,166
Yeah? MALE: Thanks for the talk. I was wondering if

00:32:33,166 --> 00:32:39,066
copy number variants were taken into account in any of these analyses and how that works and is accounted for?

00:32:39,066 --> 00:32:50,966
JOAN BAILEY-WILSON: Some people did look a little bit at CNVs but the data we had were not optimal for looking at CNVs and I don’t think there

00:32:50,966 --> 00:33:00,499
were any new lessons about CNVs that we didn’t already have from some of the other platforms. I may be missing something. There

00:33:00,500 --> 00:33:10,466
were many hundreds of papers at this workshop. It was the most well attended workshop we’ve ever had; it was huge. So, I

00:33:10,466 --> 00:33:17,232
may be missing someone’s paper but I don’t recall any big lessons learned about CNVs.

00:33:17,233 --> 00:33:24,233
MALE: And this is just a follow-up in general. We’ve talked a lot about sequence variants and the future being whole exome and whole genome

00:33:24,233 --> 00:33:28,733
and complex truths. I know that CNVs have been shown to have an impact, especially in developmental disorders—brain, congenital

00:33:28,733 --> 00:33:30,766
heart—and I’m wondering about the future impact of CNVs and these kinds of diseases.

00:33:30,766 --> 00:33:40,166
JOAN BAILEY-WILSON: Absolutely. I have later things. Let’s not forget all these other things, because in a lot of the traits I study I’m not

00:33:40,166 --> 00:33:49,932
convinced it’s going to be in the exons. I mean, I’m hoping, but I’m not convinced it will be. I think some of it may be in the regulatory regions. All of

00:33:49,933 --> 00:33:57,799
these other things that we know are so interesting and that’s why I think most of us do not want to throw away any of these variants.

00:33:57,800 --> 00:34:05,033
We really want to look at everything. Yeah? MALE: Just to comment about the CNVs. One of

00:34:05,033 --> 00:34:15,099
the things about CNVs is that most of the work that’s been done with CNVs previously have focused on CNVs that are tagged by SNPs on

00:34:15,100 --> 00:34:26,533
genotyping arrays, and typically CNVs represent sort of not well behaved regions of the genome because there are structural variants. Yet, the

00:34:26,533 --> 00:34:35,599
genotyping arrays have SNPs that are robustly genotyped all the time, so they’re in well behaved regions. So, you really don’t see a lot of CNVs

00:34:35,600 --> 00:34:45,166
from the genotyping arrays that are tagged by SNPs. So, part of 1000 Genomes and these other sequencing projects are now getting into regions

00:34:45,166 --> 00:34:58,899
that are not well behaved, and so by sequencing you’re going to pick up the smaller indels, the different types of CNVs that we have never seen

00:34:58,900 --> 00:35:10,633
before. I think this is going to be a whole new era in looking at the role of copy number variants with respect to phenotypes in disease.

00:35:10,633 --> 00:35:20,966
JOAN BAILEY-WILSON: Absolutely. One of the things that my friends who do a lot of CNV work with SNPs is it’s like the only CNVs off of SNP

00:35:20,966 --> 00:35:33,099
chips that they believe are really big ones, whereas in some of the sequence data that we have there are really interesting small indels, that

00:35:33,100 --> 00:35:39,133
we actually believe really are real. So, I absolutely agree with you.

00:35:39,133 --> 00:35:51,333
MARTIN POLLAK: Martin Pollack from Beth Israel Deaconess in Boston. Your last slide mentioned interactions between genes and pathways; I‘m

00:35:51,333 --> 00:35:58,733
curious. In the aggregation methods—and I guess this question is actually relevant to Dr. Leal’s talk as well—are there rational ways to look at

00:35:58,733 --> 00:36:09,999
accumulation of rare variants in multiple genes and related pathways? I ask because it seems there’s a complicated multiple hypothesis testing

00:36:10,000 --> 00:36:14,700
issue that comes up. JOAN BAILEY-WILSON: It is.

00:36:14,700 --> 00:36:23,300
MARTIN POLLAK: Because if we see variants in these five genes which, according to some gene ontology program, are all part of some particular

00:36:23,300 --> 00:36:28,166
pathway, how do you deal with the multiple hypothesis issues there and rationally do these analyses?

00:36:28,166 --> 00:36:39,199
JOAN BAILEY-WILSON: I agree. Multiple testing is a huge issue. In fact, in one of our GAW papers we did look at collapsing rare variants within

00:36:39,200 --> 00:36:51,166
pathways, and in this particular simulation it didn’t make much difference because they hadn’t really simulated things that way, but it’s an intriguing

00:36:51,166 --> 00:37:00,266
thing that a lot of people are thinking about: should we collapse only in genes or should we try within a pathway? There was a talk by

00:37:00,266 --> 00:37:14,399
someone from the Broad, a guy named Or Zuk—Z-U-K—who was looking at pathway-based association and it was really interesting kinds of

00:37:14,400 --> 00:37:24,966
things and it is something we worry about and think about, but once you start these kinds of analyses they really have to be exploratory to be

00:37:24,966 --> 00:37:34,799
hypothesis-generating so that then you can test it in an independent set of data. Once you start doing all these other extra things I can only say

00:37:34,800 --> 00:37:40,066
it’s got to be hypothesis-generating. MALE: Looking at the applications of the

00:37:40,066 --> 00:37:44,332
machine-learning techniques to say micro… JOAN BAILEY-WILSON: I’m sorry, I’m deaf. Could

00:37:44,333 --> 00:37:46,699
you talk a little louder into the mic? MALE: When you look at the applications of the

00:37:46,700 --> 00:37:57,233
machine-learning techniques to microarray data and then you find that it fails notoriously when you look at the heterogeneous populations such

00:37:57,233 --> 00:38:05,533
as predicting survival from cancer data or autoimmune diseases. Do we have any intuition why it should work in this kind of data?

00:38:05,533 --> 00:38:10,633
JOAN BAILEY-WILSON: Will heterogeneity be as much of a problem, do you mean?

00:38:10,633 --> 00:38:14,566
MALE: For instance, yes. JOAN BAILEY-WILSON: Heterogeneity is the

00:38:14,566 --> 00:38:26,899
bane of all geneticists’ existence and that is why a lot of us like…if you can find at least some aggregated families, you then are increasing the

00:38:26,900 --> 00:38:37,333
frequency of that variant. So you have a better chance to detect it, but you pay the price of: you miss the other variants that weren’t in your family

00:38:37,333 --> 00:38:47,766
sample. There is no perfect study design, and this is why I tend to say to my post-docs over and over again, I want to use every tool in my

00:38:47,766 --> 00:38:58,266
toolbox. Just because I have a new power drill does not mean I am throwing away my hammer and my saw because they all do different things.

00:38:58,266 --> 00:39:05,466
All of these study designs have different strengths and weaknesses and do different things, and heterogeneity is one of the reasons

00:39:05,466 --> 00:39:14,699
that I really like family studies as well as case-control studies. I do both.

00:39:14,700 --> 00:39:29,466
MALE: You began your talk by talking about a simulation that GAW-17 did, if I recall. Imagine it is like a treasure hunt where the simulator has

00:39:29,466 --> 00:39:39,399
chosen a gene; fliT1 appeared later on, so I’m going to guess maybe they said that links to preeclampsia, we’re going to concoct data where

00:39:39,400 --> 00:39:47,133
people with particular variants are said to have the phenotype, and then see if it’s found. Is that the way it was working?

00:39:47,133 --> 00:39:52,433
JOAN BAILEY-WILSON: That’s exactly what the Genetic Analysis Workshop does and it’s what Suzanne was talking about what she did with her

00:39:52,433 --> 00:40:03,733
simulation program as well. You make up a model, you say this is biological truth, you simulate the data—so you make up your data—to follow that

00:40:03,733 --> 00:40:11,199
biological truth and you throw the data out there and say, “Can you guys find it?” And it is a treasure hunt, but it’s really useful kinds of

00:40:11,200 --> 00:40:16,366
treasure hunts that we all do to say, “How do our methods perform?”

00:40:16,366 --> 00:40:22,899
MALE: And is it generally a single gene that you do or do you have a polygenic model? Polygenic for one disease?

00:40:22,900 --> 00:40:33,766
JOAN BAILEY-WILSON: For this there were, like, 50 causal variants per trait; these were incredibly complex traits. Several of them had environmental

00:40:33,766 --> 00:40:46,032
co-variants as well but it was really going for the incredibly complex. It’s going to be really hard to find any of these things, so you could see, all

00:40:46,033 --> 00:40:55,666
right, yeah, a lot of methods found the easier ones, a few methods found the moderately hard ones, and no methods found the really, really

00:40:55,666 --> 00:40:58,099
difficult ones. ROBERT STAR: Rob Star, NIH. That was

00:40:58,100 --> 00:41:02,866
beautiful. I have a question which I think is about the underlying hypotheses.

00:41:02,866 --> 00:41:10,132
JOAN BAILEY-WILSON: Sorry, this hearing aid just died—the battery died—so you have to talk really loud.

00:41:10,133 --> 00:41:21,566
ROBERT STAR: Sorry. So the question is: do these aggregation methods really change the false discovery rate? Because you’re still taking

00:41:21,566 --> 00:41:31,399
data and adding 10 things into one imputed variable and then bringing it up into the analysis, but you still have the underlying structure to

00:41:31,400 --> 00:41:37,333
worry about. So, how does that really help you? JOAN BAILEY-WILSON: Do you mean the

00:41:37,333 --> 00:41:40,033
machines or…? ROBERT STAR: Any of these methods because

00:41:40,033 --> 00:41:49,899
you’re still starting with…let’s say you have 100 SNPs per gene or variants per gene, you still have that. Are you fooling yourself? Are you

00:41:49,900 --> 00:41:52,433
deluding yourself? JOAN BAILEY-WILSON: You mean the collapsing

00:41:52,433 --> 00:41:54,699
of rare variants? ROBERT STAR: Yes.

00:41:54,700 --> 00:42:01,966
JOAN BAILEY-WILSON: Well, this is what Suzanne was saying. If you collapse wrong, you lose power, so that’s why there are so many

00:42:01,966 --> 00:42:11,032
different strategies for collapsing out there. As she showed in her simulations, some work better than others and it depends on what biological

00:42:11,033 --> 00:42:27,733
truth is. Look, if biological truth is that there is a single gene, like BRCA1, and even better than BRCA1, let’s talk sickle-cell anemia. There is a

00:42:27,733 --> 00:42:40,099
single gene and there is one mutation in that gene that causes that specific phenotype. Okay. Under that biological truth almost any method we use is

00:42:40,100 --> 00:42:53,700
going to work to find that gene, but as she simulated multiple different kinds of things—if you assume this and it’s true—you get a bump in

00:42:53,700 --> 00:43:06,866
power, but if you assume something that’s not true, then your power is not as good. It is part of doing statistical analysis at all. If your

00:43:06,866 --> 00:43:18,066
assumptions about your statistics are true, great, but if your assumptions are wrong, you almost always lose power and sometimes you also

00:43:18,066 --> 00:43:26,666
inflate your Type 1 errors, which is why we all tend to try multiple things and try to come up with—as she was saying—we try to come up with

00:43:26,666 --> 00:43:36,699
methods that are robust to the errors. FEMALE: Just a question to see whether I

00:43:36,700 --> 00:43:49,100
understand your story correctly, and to see how we can get this closer to biological truth. In these models—these heuristic discovery models—

00:43:49,100 --> 00:44:04,766
permutation is important, am I right? Would it make sense to use clinical intervention data—so a biological experiment—to consider those as a

00:44:04,766 --> 00:44:16,532
permutation and feed intervention data in your model to give, let’s say, a biological validation of what you predict? And is this being done?

00:44:16,533 --> 00:44:22,599
JOAN BAILEY-WILSON: Can you repeat louder? FEMALE: They said that permutation is being

00:44:22,600 --> 00:44:35,266
done and, if that’s true, then if you could do clinical intervention from that information.

00:44:35,266 --> 00:44:43,899
JOAN BAILEY-WILSON: Oh, that would be cool! You know, one of my post-docs is a pharmacogenomicist who wanted to also learn

00:44:43,900 --> 00:44:50,133
this and one of the things she wants to know is if you do clinical intervention, will different people respond differently and will that help us identify

00:44:50,133 --> 00:45:01,599
causal genes as well as response to treatment genes? So I think absolutely, those are all things that can be helpful, and certainly in some of the

00:45:01,600 --> 00:45:14,466
cancers, response to treatment has been one of the things that people have been using to try and get at what are important mutations somatically

00:45:14,466 --> 00:45:18,532
and perhaps back to the germline as well, so, absolutely.

00:45:18,533 --> 00:45:22,366
FEMALE: I’m delighted to hear that this makes sense. Thanks.

00:45:22,366 --> 00:45:26,932
JOAN BAILEY-WILSON: Sorry this hearing aid went out.

Date Last Updated: 9/18/2012

General Inquiries may be addressed to:
Office of Communications and Public Liaison
Building 31, Rm 9A06
31 Center Drive, MSC 2560
Bethesda, MD 20892-2560
Phone: 301.496.3583