Whole Genome Approaches to Complex Kidney Disease
February 11-12, 2012 Conference Videos

Lessons From Genetic Analysis Workshop 17: Aggregation, Machine Learning, and Data Mining Approaches
Joan Bailey-Wilson, NHGRI

Video Transcript

1
00:00:02,100 --> 00:00:10,333
ANDREY SHAW: Our next speaker is Joan Bailey-Wilson. She is the head of the Statistical Genetics Section in the Inherited Disease Branch

2
00:00:10,333 --> 00:00:19,966
of the NHGRI. She’s a statistical geneticist interested in the genetics of complex disease and she is going to tell us about the lessons from

3
00:00:19,966 --> 00:00:28,766
Genetic Analysis Workshops 17. JOAN BAILEY-WILSON: I’m actually only going to

4
00:00:28,766 --> 00:00:38,832
tell you about a few of those lessons because there were a lot of them, but I am going to talk about several that I think are important and that

5
00:00:38,833 --> 00:00:53,833
really flow very well from Steve’s and Suzanne’s talks. So, I am going to cover some issues about aggregation; multiple different kinds of

6
00:00:53,833 --> 00:01:02,333
aggregation. A brief overview of what is machine learning for those of you who may not know much about it and how might machine learning

7
00:01:02,333 --> 00:01:13,766
and data mining approaches be useful and we’re going to talk about linkage analysis as a type of aggregation that can be very powerful for helping

8
00:01:13,766 --> 00:01:23,899
to interpret the results of sequence data. So, one of the things you’ve been hearing from all of us is that complex trait genetics is probably

9
00:01:23,900 --> 00:01:34,966
really pretty complex, so I am not going to spend a lot of time on this, but I studied lung cancer and this really hits what we have been saying is, if I

10
00:01:34,966 --> 00:01:44,966
just filter variants out because it occurs in a database for something like lung cancer there’s probably tons of non-smokers in those databases

11
00:01:44,966 --> 00:01:54,966
who might have risk variants for lung cancers, but because the environmental effect is so huge, they’re not going to be penetrant. So, we always

12
00:01:54,966 --> 00:02:06,699
have to not ignore the environment, and therefore, non-penetrants. So, we know that there are genes with rare alleles of large effect.

13
00:02:06,700 --> 00:02:16,066
We’ve known about them for a long time in cancer and we also, as I said, know that major environmental risk factors can be common, but

14
00:02:16,066 --> 00:02:26,499
major genetic risk alleles for serious diseases tend to be rare in the population and we think that’s due to selection over time against these

15
00:02:26,500 --> 00:02:41,366
things. BRCA1, a major risk locus for breast cancer, has over a thousand different individually really, really rare high penetrance risk alleles. So,

16
00:02:41,366 --> 00:02:53,999
it’s sort of the model for what we are all starting to go after these days, but then there are also really common risk factors that increase risk as

17
00:02:54,000 --> 00:03:07,900
well.You all have probably seen this many times from the GWAS story but it is really important for sequencing as well. Remember, it is easy to find

18
00:03:07,900 --> 00:03:22,500
the really high penetrant things with family studies but not so easy to find rare lower penetrance things in family studies. Common

19
00:03:22,500 --> 00:03:33,033
variants that have low effect size is what we’ve tended to find in GWAS studies because that’s what they’re powerful to find. And for common

20
00:03:33,033 --> 00:03:45,333
diseases, common risk alleles of large effect, so far, have seemed to be pretty rare across diseases—we don’t tend to see a lot of

21
00:03:45,333 --> 00:03:53,533
those—which is why GWAS haven’t found them. What we are hoping to find is some of the things we haven’t been able to find to date with our

22
00:03:53,533 --> 00:04:02,866
assisting tools. Now, linkage is something that when Steve was giving his talk Suzanne and I were both going, “Yeah, yeah, linkage and

23
00:04:02,866 --> 00:04:14,832
pedigrees,” and that’s because linkage is looking at co-segregation of the disease or trait with a genetic variant within a family. Now, if you have a

24
00:04:14,833 --> 00:04:27,266
really rare variant that is segregating through a family, if it has a big effect on the disease or trait, then you get familial aggregation; exactly the kind

25
00:04:27,266 --> 00:04:38,032
of pedigrees that Steve was talking about. If you can collect them you may have good power to detect very rare, important causal variants.

26
00:04:38,033 --> 00:04:47,533
Association, on the other hand, is looking to see whether a specific genetic variant is more common, say, among cases than among controls,

27
00:04:47,533 --> 00:04:59,299
or among people with extremes of a phenotypic distribution; there’s lots of methods available to test both those things. A really important thing to

28
00:04:59,300 --> 00:05:10,066
just get in your minds is that power to detect any sequence variant depends on the size of the effect, at what level you’re measuring the effect,

29
00:05:10,066 --> 00:05:21,099
and its frequency in the sample that you’re studying. So, the variant for familial hypercholesterolemia, which has been known

30
00:05:21,100 --> 00:05:34,400
forever, because I’m old, it has a very large effect in individuals with two copies of this variant on that individual’s trait value. It has a moderate

31
00:05:34,400 --> 00:05:45,100
effect in relatives of those people with two copies who generally have one copy, but it has virtually no effect in the population if you start

32
00:05:45,100 --> 00:06:00,200
talking about population heritability of that variant. It’s tiny, not because the variant does not have a big effect on the trait, but because it is so rare.

33
00:06:00,200 --> 00:06:14,600
So, for common diseases you can have very rare alleles that have big effects. So, breast cancer is a perfect model for that. We know; we detected

34
00:06:14,600 --> 00:06:25,933
BRCA1 and BRCA2 with linkage. Once we had more than thirty loci—yeah, I’m really, really old—once we had more than 30 markers it was found

35
00:06:25,933 --> 00:06:36,866
quickly and all of these mutations have been found, but they’re very rare and they don’t account for all of the population risk. So, other

36
00:06:36,866 --> 00:06:46,799
lo-sci, then, with major alleles have also been detected, but all of them were not enough to explain all the risk, so GWASs came along. We’ve

37
00:06:46,800 --> 00:06:58,000
identified a lot more loci where there are variants with small effects but they’re more common, and as I said, these common risk alleles tend to have

38
00:06:58,000 --> 00:07:12,100
small effects. So, back to this study. We’ve done a pretty good job at getting those Mendelian-type disease alleles. We’ve done a pretty good job at

39
00:07:12,100 --> 00:07:20,000
getting the common alleles of small effect out of the GWASs, and now we’d like to try and look at some of the other things. So, the Genetic

40
00:07:20,000 --> 00:07:31,466
Analysis Workshop has actually been held 17 times every 2 years, so for a long time it’s been going on, and what statistical geneticists like to

41
00:07:31,466 --> 00:07:40,766
do with it is test and compare methods. Everybody is developing new methods all the time. Suzanne gave you a list of some of the

42
00:07:40,766 --> 00:07:50,566
ones that are kind of the most popular right now and very interesting ones, and one of the things we like to do is test them on the same data, so

43
00:07:50,566 --> 00:08:01,099
that’s what this is. This workshop focused on study designs and analysis methods for whole exome sequence data and it was a partial

44
00:08:01,100 --> 00:08:14,300
exome—it wasn’t a full exome—and it was DNA-sequenced from the 1000 Genomes Project and that was used as the genomes for 697 people,

45
00:08:14,300 --> 00:08:25,133
and then those real genomes…they simulated phenotypes from the genotypes. So there were, of course, many individually very rare variants as

46
00:08:25,133 --> 00:08:36,699
well as many common variants. So these simulated traits, based on the sequence data, were extremely complex. They wanted to make

47
00:08:36,700 --> 00:08:48,233
this really hard to see how well methods could do. So, there were a few rare variants that had moderate to small effects on an individual’s risk of

48
00:08:48,233 --> 00:08:58,666
disease or on a quantitative trait; there were both. There were other rare variants that had very small effects on the traits and there were

49
00:08:58,666 --> 00:09:09,799
many, many—most RVs—had no effect. And then, there were some common sequence variants that had small to very small effects on an

50
00:09:09,800 --> 00:09:19,300
individual’s risk of the disease or on the quantitative traits, but again, most had no effect; sort of what we think common diseases might be

51
00:09:19,300 --> 00:09:33,300
like. So, there were the 697 unrelated individuals who had their phenotypes simulated and then they also used the same genotype data to

52
00:09:33,300 --> 00:09:39,966
simulate 8 extended pedigrees. They wanted to have just the same number of people, so what they did is they randomly sampled the founders

53
00:09:39,966 --> 00:09:53,566
of the pedigrees and then they simulated meioses using Mendelian rules and recombination, etc. to take the genotypes down the pedigrees across

54
00:09:53,566 --> 00:10:05,199
the genome, and then they used the exact same simulation model to produce the phenotypes. So, a variant in the unrelated sample had the exact

55
00:10:05,200 --> 00:10:21,133
same effect size for an individual as it did in the family sample, so it biologically would behave the same way. However, one of the problems with

56
00:10:21,133 --> 00:10:31,299
this is locus-specific heritabilities with quantitative traits, or risk of the disease for the qualitative traits were really, really small, and this is the kind

57
00:10:31,300 --> 00:10:45,066
of thing Suzanne was talking about; where you have rare variants that are not high penetrants, they’re pretty low penetrants. So, that meant that

58
00:10:45,066 --> 00:10:55,599
the locus-specific heritabilities for the cause of variants were quite small and there were only two causal genes in the unrelated individuals that

59
00:10:55,600 --> 00:11:08,766
actually had locus-specific heritabilities over .01, so they were really tiny, and this was not because they didn’t have a detectable effect in

60
00:11:08,766 --> 00:11:22,732
an individual, it was because they were so rare. So, the population amount of the heritability explained by any of these rare variants was tiny.

61
00:11:22,733 --> 00:11:33,033
So of course, just about everybody that analyzed the data used some sort of a collapsing scheme of rare variants. Some people collapsed within

62
00:11:33,033 --> 00:11:41,499
genes, some people collapsed within LD blocks. There were lots of different collapsing, but pretty much everybody did collapsing to have any

63
00:11:41,500 --> 00:11:56,033
power. However, almost all of the association methods had incredibly low power. Some of the methods had moderate power to detect these

64
00:11:56,033 --> 00:12:12,099
two loci that had multiple variants with moderate effect sizes, and the very standard methods like linear regression or logistic regression, and the

65
00:12:12,100 --> 00:12:24,533
whole host of different machine learning methods that were used, all had very similar power; meaning, pretty poor. So of course, that told us

66
00:12:24,533 --> 00:12:38,966
what we already expected, that if we were looking for risk alleles that are very rare and don’t have huge population-specific heritabilities, we

67
00:12:38,966 --> 00:12:50,366
were going to need gigantic samples. I think Andrey showed you that 6,000-and-some people if it’s the only moderately rare with a sort of

68
00:12:50,366 --> 00:13:03,066
medium-high risk. So, it told us huge samples were going to be needed to have good power. Another thing that we expected to see and we

69
00:13:03,066 --> 00:13:13,632
did observe, that if you enrich your subsamples…if you took subsamples of the people with extreme values—Steve’s study

70
00:13:13,633 --> 00:13:25,533
design—then you could, in fact, could get better power in smaller samples. This is classic genetics, which is why he did it, and it was

71
00:13:25,533 --> 00:13:35,133
interesting at least to see you did increase your power to detect casual variants. One of the things that was very worrying though, was the

72
00:13:35,133 --> 00:13:45,033
false positive rates were inflated even when you collapsed the rare variants, thereby cutting down your numbers of tests, and one of the things that

73
00:13:45,033 --> 00:13:56,533
was very clear is that the issues of ignoring correlations between your variables was huge, and there were, of course, the correlations due

74
00:13:56,533 --> 00:14:09,233
to linkage disequilibrium of close SNPs but there also were correlations across chromosomes where you would have a variant that was a

75
00:14:09,233 --> 00:14:22,133
singleton in the data that was then perfectly correlated with every other singleton variant in that individual in your sample and this is the kind

76
00:14:22,133 --> 00:14:35,066
of things that one needs to look at and correct for is this variants that’s giving me… is this gene that’s giving me a signal highly correlated with

77
00:14:35,066 --> 00:14:44,499
other genes elsewhere in the genome? These are interesting things that people are trying to look at and these kinds of correlations of your

78
00:14:44,500 --> 00:14:55,300
predictors can be problematic, so it’s sort of another thing to think about in your QC. Some methods did a little better than others at

79
00:14:55,300 --> 00:15:08,933
controlling those false positive rates in the unrelated individuals. For extended pedigrees the families now do not have all of the sequence

80
00:15:08,933 --> 00:15:19,799
variants that were in the unrelated individuals. They are a sample from that population of variants because only some of the unrelated

81
00:15:19,800 --> 00:15:32,966
individuals in the unrelated set became founders. So, what you saw was some of the causal rare variants that were in the unrelated set don’t

82
00:15:32,966 --> 00:15:42,299
appear in the families at all and this again would be typical. If you are taking aggregated families out of the population you will only be able to

83
00:15:42,300 --> 00:15:55,166
discover the causal variants that actually occur in those founders. However, by doing this, what we found, and again what we would expect,

84
00:15:55,166 --> 00:16:07,132
was that you saw enrichment in frequency of some of the really rare causal variants because they occurred in a founder and then segregated

85
00:16:07,133 --> 00:16:17,199
down through the family, so suddenly, instead of having one person in the unrelated sample with that variant, you had multiple people in a family

86
00:16:17,200 --> 00:16:31,200
with it and other people without it, so you had enough power to detect the effect of that variant on individuals’ traits. So, there were several

87
00:16:31,200 --> 00:16:45,033
genes, then, that could be detected as casual that there was no power to detect them in the unrelated samples. The locus-specific heritability

88
00:16:45,033 --> 00:16:56,233
is increased in the family sample for those variants, not because their actual effect on the trait has changed, but only because their

89
00:16:56,233 --> 00:17:07,633
frequency in the sample being analyzed has changed. None of this is new. A lot of this is called “Old Lessons Learned Anew,” and those

90
00:17:07,633 --> 00:17:17,099
of that are sort of old curmudgeons were going, “We knew this in 1970.” Well, I didn’t because I was still in high school but my mentor, Robert

91
00:17:17,100 --> 00:17:30,566
Elston, knew these things in 1970. He knew them earlier than that, probably. So, locus-specific heritabilities increased in the family samples

92
00:17:30,566 --> 00:17:41,266
because of the change in allele frequency. That meant that they had very good power if you used family-based association analysis to detect those

93
00:17:41,266 --> 00:17:51,566
casual variants. In fact, in the family-based association analysis, for two of the methods they were detected in 100% of replicates; it’s not quite

94
00:17:51,566 --> 00:18:06,899
power but it’s showing you its very good power. The other thing that was striking was if you used linkage analysis in your extended pedigrees you

95
00:18:06,900 --> 00:18:18,600
had absolutely genome-wide significant linkage evidence for these rare variants just doing old, old, old-fashioned two-point linkage analysis; not

96
00:18:18,600 --> 00:18:34,900
even multi-point, just two-point, which is really old. So again, extended pedigrees as Steve pointed out, were incredibly powerful to

97
00:18:34,900 --> 00:18:48,866
determine which were the true casual variants and the Type 1 error of these methods have been shown over the years to be well, well controlled.

98
00:18:48,866 --> 00:19:01,932
So, we know that these methods control Type 1 error very well and they are very powerful when you have extremely rare variants. Note though,

99
00:19:01,933 --> 00:19:11,399
remember I told you, you won’t find all of the causal variants this way; you’ll only find the ones that are actually segregating in those pedigrees,

100
00:19:11,400 --> 00:19:24,833
so you will, by definition, miss stuff, but you miss stuff in the unrelateds as well. So basically, linkage is very powerful to detect these high

101
00:19:24,833 --> 00:19:37,499
penetrance risk alleles. In families they tend to be rare in the population, but if you ascertain pedigrees correctly, such as ascertaining loaded

102
00:19:37,500 --> 00:19:48,400
families for disease or ascertaining data for quantitative traits where there are at least some people in the family who have extreme values of

103
00:19:48,400 --> 00:20:01,566
the quantitative traits, then this can be a very powerful way to detect these extremely rare variants that have moderate effect sizes.

104
00:20:01,566 --> 00:20:13,799
Association on the other hand, we know that they are most powerful for common variants and we’ve also have all agreed that most common

105
00:20:13,800 --> 00:20:27,900
variants have small effects. But even there, there are ways that you can increase your power, and that is you can ascertain extreme families,

106
00:20:27,900 --> 00:20:41,266
extreme families with multiple affecteds or families with high and low values of the trait, and use family-based methods, or you can ascertain

107
00:20:41,266 --> 00:20:57,299
cases who have a family history of a disease. You can take cases from pedigrees that have extreme high values and low values of the trait;

108
00:20:57,300 --> 00:21:11,666
things like that. So, there are ways by using ascertainment that you can increase your power for detection. What we’re hoping then, is that by

109
00:21:11,666 --> 00:21:21,366
doing ascertainment—this is kind of that typical power slide you’ve seen forever—and what we’re hoping by doing ascertainment is that…I

110
00:21:21,366 --> 00:21:37,499
don’t have a pointer, do I? Oh, here it is. We are hoping by doing ascertainment, we’re hoping to take the yellow thing and move it over, because

111
00:21:37,500 --> 00:21:48,700
by ascertaining pedigrees you’re hoping to enrich your sample for those very rare variants, and for the common variants in GWASs, by ascertaining,

112
00:21:48,700 --> 00:21:57,933
say, on family history, we’re hoping to increase the frequency of those things that are in-between the little blue circle and the GWAS

113
00:21:57,933 --> 00:22:12,366
yellow circle in your sample. So, it’s basically an oversampling strategy. So, the other thing that we have to worry about is I really don’t want you all

114
00:22:12,366 --> 00:22:23,899
to have to start thinking that rare variants are going to solve everything because some of the missing heritability may not be due to: we just

115
00:22:23,900 --> 00:22:35,933
haven’t found the rare variants. Complex traits are complex and we expect that you need to include environmental risk of course, but we

116
00:22:35,933 --> 00:22:44,699
expect that there are non-linear effects; that there are gene-by-gene and maybe gene-by-gene-by-gene-by gene pathway type

117
00:22:44,700 --> 00:22:57,333
interactions, etc., and most analysis ignore these because it’s very difficult to deal with them. So, this brings me into the next bit of my talk which is:

118
00:22:57,333 --> 00:23:07,599
how do we do some of this aggregation of our overall look at the genome? Statistical learning machines are one approach. These are very

119
00:23:07,600 --> 00:23:17,166
computer-intensive methods. They were, in fact, developed in computer science, not in statistics really, and they are designed to produce

120
00:23:17,166 --> 00:23:30,432
classifications--affected/unaffected—or predictions of a quantitative trait from extremely large numbers of potential predictors; what we in

121
00:23:30,433 --> 00:23:40,633
statistics call our independent variables and what they in machine learning tend to call features. They use samples where, for each individual,

122
00:23:40,633 --> 00:23:49,966
you know the outcome, case control, value of quantitative trait; you’ve got all these predictors. So far it’s the same as all of our other analyses.

123
00:23:49,966 --> 00:23:58,466
What the computer algorithms do is they use the predictors or the features to create models to best predict the observed outcomes. Again, just

124
00:23:58,466 --> 00:24:13,899
like regular statistics. So, in fact, linear and logistic regression are considered by a lot of people as very simple types of machines. Those

125
00:24:13,900 --> 00:24:22,666
machines though, they are what we call parametric and model-based. They use the data to estimate coefficients of the predictors—that’s

126
00:24:22,666 --> 00:24:32,899
the betas in a regression. The predictors are then your… the betas are your parameters and they also make assumptions, generally that the

127
00:24:32,900 --> 00:24:44,633
data follow a probability model. They’re logistically distributed; that’s a normal distribution, etc. Most methods that people currently think of as learning

128
00:24:44,633 --> 00:24:56,766
machines are non-linear and non-parametric: they don’t estimate coefficients of the predictors, they don’t assume the data follow a specific

129
00:24:56,766 --> 00:25:06,599
distribution. There are many, many different approaches to machines and each one, they performed better or worse in different situations,

130
00:25:06,600 --> 00:25:15,733
and there is no single optimal machine. In fact it has been proven mathematically that there cannot be an optimal machine. That math is above me; I

131
00:25:15,733 --> 00:25:30,766
take my friend Jim O’Malley’s word, that the proof is valid. There is a very… in the genetic epidemiology thing that came out of the Genetic

132
00:25:30,766 --> 00:25:39,566
Analysis Workshop, we have a brief review of some of these methods if any of you want to learn more about them. But here is a little just

133
00:25:39,566 --> 00:25:56,432
example of how this works. So, if we said a person is at risk for heart disease, if any one of these three conditions is true—either their LDL

134
00:25:56,433 --> 00:26:08,533
and their HDL is greater than 240 mg/DL, or HDL is less than 35 mg/DL, or the ratio of LDL to HDL is greater than 4, this is simulated data; a

135
00:26:08,533 --> 00:26:20,866
made-up model. So, if that’s true then those dotted lines are those diagnosis lines, and so the people—those light circles that are inside those

136
00:26:20,866 --> 00:26:34,699
dotted lines—they are the unaffecteds and the people outside are the people who are at risk for cardiovascular disease—for heart disease. This

137
00:26:34,700 --> 00:26:47,633
can be modeled by that very simple decision tree on the right which is just these three conditions. That’s pretty much the simplest example I could

138
00:26:47,633 --> 00:26:57,333
come up with of a tree and this is from Jim O’Malley’s really nice introductory book on this that just came out this past year. So, if you’re

139
00:26:57,333 --> 00:27:09,599
looking for a good introduction to machines, this is really useful for explaining and translating from computer science to statistics. But here’s an

140
00:27:09,600 --> 00:27:22,933
example of a more complex thing. The red are cases, the greens are controls. This would be really, really hard to model linearly. You can model

141
00:27:22,933 --> 00:27:36,799
this with this decision tree, which just makes simple yes/no, is this predictor above a value or below it, and it classifies each time it splits the

142
00:27:36,800 --> 00:27:50,000
data, and it’s trying to get at the end a division. You can sort of see the little red lines where most of the cases are on one side of those division

143
00:27:50,000 --> 00:28:01,066
lines and most of the controls are on the other side. It’s very much a heuristic process; it makes no assumptions about how those data should be

144
00:28:01,066 --> 00:28:14,199
filled. Now remember, I said there is no one machine that works perfectly every time, so there are ensembles of machines where you can take

145
00:28:14,200 --> 00:28:22,300
multiple decision trees, multiple methods, put them together with majority votes, and that often gives better results. Also you can use bootstrapping

146
00:28:22,300 --> 00:28:33,300
and permutation to get better prediction for a new data set and you can also use it to evaluate the importance of each of the predictors that are

147
00:28:33,300 --> 00:28:40,133
included in any of these trees, because of course that’s what somebody like me cares about. I want to know which of these predictors

148
00:28:40,133 --> 00:28:50,299
is really driving my tree, so I can go take it back to the lab, and figure out: what does that predictor actually do? What does that rare genetic variant

149
00:28:50,300 --> 00:29:06,700
actually do biologically? So, these are interesting methods. A lot of people are working on trying to apply these to genetic data. Another method that I

150
00:29:06,700 --> 00:29:16,533
am going to talk about—because my husband’s developing it here at NHGRI, so we are interested in it—is called Tiled Regression and it was the

151
00:29:16,533 --> 00:29:27,666
one I was telling you actually did a better job than most at controlling Type 1 error in the unrelated individuals, and it’s a method to look at

152
00:29:27,666 --> 00:29:41,099
aggregating variants across the genome, dealing with the fact that there are extreme correlations among your predictors. So you pick predefined

153
00:29:41,100 --> 00:29:52,766
regions, and those can be predefined anyway you want. We tend to like hot spot blocks based on linkage disequilibrium blocks. So, the tiles are

154
00:29:52,766 --> 00:30:05,766
the regions within a haplotype LD block. Collapse the rare sequence variants within the blocks, but you can use any predefined region you want; it

155
00:30:05,766 --> 00:30:15,966
can be genes, it can be whatever. For each tile then, you use multiple regression to say: are any of the variants in that region showing any

156
00:30:15,966 --> 00:30:29,732
association at all with the trait? If so, you keep the tile. If not, you toss all of those SNPs. You do this across all of your tiles and in the tiles that

157
00:30:29,733 --> 00:30:40,899
you have kept, you use stepwise regression to select important independent variables within the tile and bring them up to the next tile, where you

158
00:30:40,900 --> 00:30:55,566
start merging tiles together. So, then they are tested in stepwise higher order regressions to come up with chromosome-wide SNPs that have

159
00:30:55,566 --> 00:31:06,966
gotten rid of the correlations between them and still have some signal. Eventually, you get to all of the variants across the chromosome and the

160
00:31:06,966 --> 00:31:16,599
genome level, and then you use a lot of permutations to try and control your Type 1 errors. A lot of research is still going on on this

161
00:31:16,600 --> 00:31:31,900
method. The software is available; the Genetic Analysis Workshop paper about that is there. So, I agree with Steve, with Suzanne, with Andrey. I

162
00:31:31,900 --> 00:31:47,166
think that in the future we are going to be using a lot of Next-Gen sequencing to be examining a much broader spectrum of possible variants, not

163
00:31:47,166 --> 00:31:57,932
just non-synonymous changes in exons. I think we are going to need it in large numbers of people to really have power, and I don’t want you

164
00:31:57,933 --> 00:32:06,966
all to forget about GXG and GXE interactions because I do believe they are going to be quite important, and what a lot of folks are working on

165
00:32:06,966 --> 00:32:30,999
is building networks and pathways, to try and better understand what’s going on. So, I’ll take questions. I at least didn’t put any equations in.

166
00:32:31,000 --> 00:32:33,166
Yeah? MALE: Thanks for the talk. I was wondering if

167
00:32:33,166 --> 00:32:39,066
copy number variants were taken into account in any of these analyses and how that works and is accounted for?

168
00:32:39,066 --> 00:32:50,966
JOAN BAILEY-WILSON: Some people did look a little bit at CNVs but the data we had were not optimal for looking at CNVs and I don’t think there

169
00:32:50,966 --> 00:33:00,499
were any new lessons about CNVs that we didn’t already have from some of the other platforms. I may be missing something. There

170
00:33:00,500 --> 00:33:10,466
were many hundreds of papers at this workshop. It was the most well attended workshop we’ve ever had; it was huge. So, I

171
00:33:10,466 --> 00:33:17,232
may be missing someone’s paper but I don’t recall any big lessons learned about CNVs.

172
00:33:17,233 --> 00:33:24,233
MALE: And this is just a follow-up in general. We’ve talked a lot about sequence variants and the future being whole exome and whole genome

173
00:33:24,233 --> 00:33:28,733
and complex truths. I know that CNVs have been shown to have an impact, especially in developmental disorders—brain, congenital

174
00:33:28,733 --> 00:33:30,766
heart—and I’m wondering about the future impact of CNVs and these kinds of diseases.

175
00:33:30,766 --> 00:33:40,166
JOAN BAILEY-WILSON: Absolutely. I have later things. Let’s not forget all these other things, because in a lot of the traits I study I’m not

176
00:33:40,166 --> 00:33:49,932
convinced it’s going to be in the exons. I mean, I’m hoping, but I’m not convinced it will be. I think some of it may be in the regulatory regions. All of

177
00:33:49,933 --> 00:33:57,799
these other things that we know are so interesting and that’s why I think most of us do not want to throw away any of these variants.

178
00:33:57,800 --> 00:34:05,033
We really want to look at everything. Yeah? MALE: Just to comment about the CNVs. One of

179
00:34:05,033 --> 00:34:15,099
the things about CNVs is that most of the work that’s been done with CNVs previously have focused on CNVs that are tagged by SNPs on

180
00:34:15,100 --> 00:34:26,533
genotyping arrays, and typically CNVs represent sort of not well behaved regions of the genome because there are structural variants. Yet, the

181
00:34:26,533 --> 00:34:35,599
genotyping arrays have SNPs that are robustly genotyped all the time, so they’re in well behaved regions. So, you really don’t see a lot of CNVs

182
00:34:35,600 --> 00:34:45,166
from the genotyping arrays that are tagged by SNPs. So, part of 1000 Genomes and these other sequencing projects are now getting into regions

183
00:34:45,166 --> 00:34:58,899
that are not well behaved, and so by sequencing you’re going to pick up the smaller indels, the different types of CNVs that we have never seen

184
00:34:58,900 --> 00:35:10,633
before. I think this is going to be a whole new era in looking at the role of copy number variants with respect to phenotypes in disease.

185
00:35:10,633 --> 00:35:20,966
JOAN BAILEY-WILSON: Absolutely. One of the things that my friends who do a lot of CNV work with SNPs is it’s like the only CNVs off of SNP

186
00:35:20,966 --> 00:35:33,099
chips that they believe are really big ones, whereas in some of the sequence data that we have there are really interesting small indels, that

187
00:35:33,100 --> 00:35:39,133
we actually believe really are real. So, I absolutely agree with you.

188
00:35:39,133 --> 00:35:51,333
MARTIN POLLAK: Martin Pollack from Beth Israel Deaconess in Boston. Your last slide mentioned interactions between genes and pathways; I‘m

189
00:35:51,333 --> 00:35:58,733
curious. In the aggregation methods—and I guess this question is actually relevant to Dr. Leal’s talk as well—are there rational ways to look at

190
00:35:58,733 --> 00:36:09,999
accumulation of rare variants in multiple genes and related pathways? I ask because it seems there’s a complicated multiple hypothesis testing

191
00:36:10,000 --> 00:36:14,700
issue that comes up. JOAN BAILEY-WILSON: It is.

192
00:36:14,700 --> 00:36:23,300
MARTIN POLLAK: Because if we see variants in these five genes which, according to some gene ontology program, are all part of some particular

193
00:36:23,300 --> 00:36:28,166
pathway, how do you deal with the multiple hypothesis issues there and rationally do these analyses?

194
00:36:28,166 --> 00:36:39,199
JOAN BAILEY-WILSON: I agree. Multiple testing is a huge issue. In fact, in one of our GAW papers we did look at collapsing rare variants within

195
00:36:39,200 --> 00:36:51,166
pathways, and in this particular simulation it didn’t make much difference because they hadn’t really simulated things that way, but it’s an intriguing

196
00:36:51,166 --> 00:37:00,266
thing that a lot of people are thinking about: should we collapse only in genes or should we try within a pathway? There was a talk by

197
00:37:00,266 --> 00:37:14,399
someone from the Broad, a guy named Or Zuk—Z-U-K—who was looking at pathway-based association and it was really interesting kinds of

198
00:37:14,400 --> 00:37:24,966
things and it is something we worry about and think about, but once you start these kinds of analyses they really have to be exploratory to be

199
00:37:24,966 --> 00:37:34,799
hypothesis-generating so that then you can test it in an independent set of data. Once you start doing all these other extra things I can only say

200
00:37:34,800 --> 00:37:40,066
it’s got to be hypothesis-generating. MALE: Looking at the applications of the

201
00:37:40,066 --> 00:37:44,332
machine-learning techniques to say micro… JOAN BAILEY-WILSON: I’m sorry, I’m deaf. Could

202
00:37:44,333 --> 00:37:46,699
you talk a little louder into the mic? MALE: When you look at the applications of the

203
00:37:46,700 --> 00:37:57,233
machine-learning techniques to microarray data and then you find that it fails notoriously when you look at the heterogeneous populations such

204
00:37:57,233 --> 00:38:05,533
as predicting survival from cancer data or autoimmune diseases. Do we have any intuition why it should work in this kind of data?

205
00:38:05,533 --> 00:38:10,633
JOAN BAILEY-WILSON: Will heterogeneity be as much of a problem, do you mean?

206
00:38:10,633 --> 00:38:14,566
MALE: For instance, yes. JOAN BAILEY-WILSON: Heterogeneity is the

207
00:38:14,566 --> 00:38:26,899
bane of all geneticists’ existence and that is why a lot of us like…if you can find at least some aggregated families, you then are increasing the

208
00:38:26,900 --> 00:38:37,333
frequency of that variant. So you have a better chance to detect it, but you pay the price of: you miss the other variants that weren’t in your family

209
00:38:37,333 --> 00:38:47,766
sample. There is no perfect study design, and this is why I tend to say to my post-docs over and over again, I want to use every tool in my

210
00:38:47,766 --> 00:38:58,266
toolbox. Just because I have a new power drill does not mean I am throwing away my hammer and my saw because they all do different things.

211
00:38:58,266 --> 00:39:05,466
All of these study designs have different strengths and weaknesses and do different things, and heterogeneity is one of the reasons

212
00:39:05,466 --> 00:39:14,699
that I really like family studies as well as case-control studies. I do both.

213
00:39:14,700 --> 00:39:29,466
MALE: You began your talk by talking about a simulation that GAW-17 did, if I recall. Imagine it is like a treasure hunt where the simulator has

214
00:39:29,466 --> 00:39:39,399
chosen a gene; fliT1 appeared later on, so I’m going to guess maybe they said that links to preeclampsia, we’re going to concoct data where

215
00:39:39,400 --> 00:39:47,133
people with particular variants are said to have the phenotype, and then see if it’s found. Is that the way it was working?

216
00:39:47,133 --> 00:39:52,433
JOAN BAILEY-WILSON: That’s exactly what the Genetic Analysis Workshop does and it’s what Suzanne was talking about what she did with her

217
00:39:52,433 --> 00:40:03,733
simulation program as well. You make up a model, you say this is biological truth, you simulate the data—so you make up your data—to follow that

218
00:40:03,733 --> 00:40:11,199
biological truth and you throw the data out there and say, “Can you guys find it?” And it is a treasure hunt, but it’s really useful kinds of

219
00:40:11,200 --> 00:40:16,366
treasure hunts that we all do to say, “How do our methods perform?”

220
00:40:16,366 --> 00:40:22,899
MALE: And is it generally a single gene that you do or do you have a polygenic model? Polygenic for one disease?

221
00:40:22,900 --> 00:40:33,766
JOAN BAILEY-WILSON: For this there were, like, 50 causal variants per trait; these were incredibly complex traits. Several of them had environmental

222
00:40:33,766 --> 00:40:46,032
co-variants as well but it was really going for the incredibly complex. It’s going to be really hard to find any of these things, so you could see, all

223
00:40:46,033 --> 00:40:55,666
right, yeah, a lot of methods found the easier ones, a few methods found the moderately hard ones, and no methods found the really, really

224
00:40:55,666 --> 00:40:58,099
difficult ones. ROBERT STAR: Rob Star, NIH. That was

225
00:40:58,100 --> 00:41:02,866
beautiful. I have a question which I think is about the underlying hypotheses.

226
00:41:02,866 --> 00:41:10,132
JOAN BAILEY-WILSON: Sorry, this hearing aid just died—the battery died—so you have to talk really loud.

227
00:41:10,133 --> 00:41:21,566
ROBERT STAR: Sorry. So the question is: do these aggregation methods really change the false discovery rate? Because you’re still taking

228
00:41:21,566 --> 00:41:31,399
data and adding 10 things into one imputed variable and then bringing it up into the analysis, but you still have the underlying structure to

229
00:41:31,400 --> 00:41:37,333
worry about. So, how does that really help you? JOAN BAILEY-WILSON: Do you mean the

230
00:41:37,333 --> 00:41:40,033
machines or…? ROBERT STAR: Any of these methods because

231
00:41:40,033 --> 00:41:49,899
you’re still starting with…let’s say you have 100 SNPs per gene or variants per gene, you still have that. Are you fooling yourself? Are you

232
00:41:49,900 --> 00:41:52,433
deluding yourself? JOAN BAILEY-WILSON: You mean the collapsing

233
00:41:52,433 --> 00:41:54,699
of rare variants? ROBERT STAR: Yes.

234
00:41:54,700 --> 00:42:01,966
JOAN BAILEY-WILSON: Well, this is what Suzanne was saying. If you collapse wrong, you lose power, so that’s why there are so many

235
00:42:01,966 --> 00:42:11,032
different strategies for collapsing out there. As she showed in her simulations, some work better than others and it depends on what biological

236
00:42:11,033 --> 00:42:27,733
truth is. Look, if biological truth is that there is a single gene, like BRCA1, and even better than BRCA1, let’s talk sickle-cell anemia. There is a

237
00:42:27,733 --> 00:42:40,099
single gene and there is one mutation in that gene that causes that specific phenotype. Okay. Under that biological truth almost any method we use is

238
00:42:40,100 --> 00:42:53,700
going to work to find that gene, but as she simulated multiple different kinds of things—if you assume this and it’s true—you get a bump in

239
00:42:53,700 --> 00:43:06,866
power, but if you assume something that’s not true, then your power is not as good. It is part of doing statistical analysis at all. If your

240
00:43:06,866 --> 00:43:18,066
assumptions about your statistics are true, great, but if your assumptions are wrong, you almost always lose power and sometimes you also

241
00:43:18,066 --> 00:43:26,666
inflate your Type 1 errors, which is why we all tend to try multiple things and try to come up with—as she was saying—we try to come up with

242
00:43:26,666 --> 00:43:36,699
methods that are robust to the errors. FEMALE: Just a question to see whether I

243
00:43:36,700 --> 00:43:49,100
understand your story correctly, and to see how we can get this closer to biological truth. In these models—these heuristic discovery models—

244
00:43:49,100 --> 00:44:04,766
permutation is important, am I right? Would it make sense to use clinical intervention data—so a biological experiment—to consider those as a

245
00:44:04,766 --> 00:44:16,532
permutation and feed intervention data in your model to give, let’s say, a biological validation of what you predict? And is this being done?

246
00:44:16,533 --> 00:44:22,599
JOAN BAILEY-WILSON: Can you repeat louder? FEMALE: They said that permutation is being

247
00:44:22,600 --> 00:44:35,266
done and, if that’s true, then if you could do clinical intervention from that information.

248
00:44:35,266 --> 00:44:43,899
JOAN BAILEY-WILSON: Oh, that would be cool! You know, one of my post-docs is a pharmacogenomicist who wanted to also learn

249
00:44:43,900 --> 00:44:50,133
this and one of the things she wants to know is if you do clinical intervention, will different people respond differently and will that help us identify

250
00:44:50,133 --> 00:45:01,599
causal genes as well as response to treatment genes? So I think absolutely, those are all things that can be helpful, and certainly in some of the

251
00:45:01,600 --> 00:45:14,466
cancers, response to treatment has been one of the things that people have been using to try and get at what are important mutations somatically

252
00:45:14,466 --> 00:45:18,532
and perhaps back to the germline as well, so, absolutely.

253
00:45:18,533 --> 00:45:22,366
FEMALE: I’m delighted to hear that this makes sense. Thanks.

254
00:45:22,366 --> 00:45:26,932
JOAN BAILEY-WILSON: Sorry this hearing aid went out.




Date Last Updated: 9/18/2012

General Inquiries may be addressed to:
Office of Communications and Public Liaison
NIDDK, NIH
Building 31, Rm 9A06
31 Center Drive, MSC 2560
Bethesda, MD 20892-2560
USA
Phone: 301.496.3583