Whole Genome Approaches to Complex Kidney Disease
February 11-12, 2012 Conference Videos

Rare Variant Analysis: Aggregation Methods
Suzanne Leal, Baylor College of Medicine

Video Transcript

1
00:00:00,700 --> 00:00:07,733
JEFFREY KOPP: So, why don’t we move on to the first speaker of this session, which is Dr. Suzanne Leal, who’s a professor in the

2
00:00:07,733 --> 00:00:14,733
Department of Molecular and Human Genetics at the Baylor School of Medicine, and she’s an expert in statistical genetics and genetic

3
00:00:14,733 --> 00:00:27,133
epidemiology and I think she’s going to tell us more about rare variant analysis aggregation methods.

4
00:00:27,133 --> 00:00:39,066
SUZANNE LEAL: All right. So, this is my title that was kind of given to me and I plan to talk about more than just aggregate methods and one thing I

5
00:00:39,066 --> 00:00:46,132
would like to talk about is QC, and I can’t stress enough how important QC is and I could probably give a few hours talk on QC, and actually that’s

6
00:00:46,133 --> 00:00:55,866
when we do analysis of rare variants. I’d like to say up front, is what we spend the most time is on QC, and I’ll bring up some of the problems but

7
00:00:55,866 --> 00:01:09,332
it’s certainly not an exhaustive list of the problems we run into. So, I would say that GWAS have been a huge success in understanding the

8
00:01:09,333 --> 00:01:19,599
etiology of complex traits, but of course, common variants don’t explain all of the etiology and in most cases they explain less than 10% of the

9
00:01:19,600 --> 00:01:29,100
heritability of complex traits. And so therefore, there’s been a huge interest in looking at rare variants and for the analysis of rare variants,

10
00:01:29,100 --> 00:01:38,833
unlike common variants, we’re using direct mapping where we actually are analyzing the causal variants where, when we were looking at

11
00:01:38,833 --> 00:01:49,899
common variants and using these genome-wide association studies, we were using interact mapping in order to detect associations. So for

12
00:01:49,900 --> 00:01:57,600
rare variants, although they’re rare and these variants are much newer in the population and then the common variants which are old and

13
00:01:57,600 --> 00:02:06,100
widespread and you see quite a bit of differences in the rare variant distributions and types of rare variants in different populations. So,

14
00:02:06,100 --> 00:02:16,300
they do generally have a larger phenotypic effect than the common variants but we have to remember that these phenotypic effects are not

15
00:02:16,300 --> 00:02:25,800
huge. In most cases, we are not observing familial aggregation. So although they have stronger effects probably in most cases than in

16
00:02:25,800 --> 00:02:33,100
the common variants, we are not talking about effect sizes that would cause familial aggregation, and although the variants are quite

17
00:02:33,100 --> 00:02:43,366
rare, collectively they’re quite common. So, in order to analyze these rare variants we have to perform direct mapping, so we have to identify

18
00:02:43,366 --> 00:02:52,099
the rare variants within our sample, so one approach, of course, is by using next generation sequencing. Currently most of the studies that

19
00:02:52,100 --> 00:03:01,400
are being carried out—they’re exome studies—but I believe this is just a stop-gap until the price of whole genome sequencing falls to a price

20
00:03:01,400 --> 00:03:12,166
that’s reasonable where we could start sequencing relatively large samples. So, this was mentioned earlier by Steve about screening

21
00:03:12,166 --> 00:03:23,966
databases and I just would like to point this out even more that this is really not going to work for complex traits because, for complex traits, we

22
00:03:23,966 --> 00:03:33,899
really do expect to see those variants in the databases. In people who are perfectly healthy these variants are not fully penetrant, so

23
00:03:33,900 --> 00:03:43,333
perfectly healthy people are running around carrying these variants, too, and of course our databases also do contain cases, but even if you

24
00:03:43,333 --> 00:03:51,066
had a fully healthy population you would expect to see these causal variants in those healthy individuals. So, how can we go about analyzing

25
00:03:51,066 --> 00:03:57,932
these rare variants? Of course, we could go about using the same methods we use for common variants, using a variety of different

26
00:03:57,933 --> 00:04:06,599
tests, but what was shown is that, you know, you would need extremely large sample sizes in order to detect associations using these common

27
00:04:06,600 --> 00:04:15,300
variant methods and this is due to the low allele frequency and also the extreme allelic heterogeneity. So, there have been a number of

28
00:04:15,300 --> 00:04:26,266
tests that have been developed and these tests all work on aggregating variants across a region, which is usually a gene, and one big problem we

29
00:04:26,266 --> 00:04:35,632
have when we’re performing these aggregate tests is misclassification. So, there are two types of misclassification we have where we can

30
00:04:35,633 --> 00:04:46,366
include non-causal variants in our aggregate analysis, which of course attenuates our signal, and we can also have exclusion of causal

31
00:04:46,366 --> 00:04:56,066
variants. This can occur when we are not including regions that contain the causal variants, it can also occur because we’re filtering out

32
00:04:56,066 --> 00:05:08,532
variants either in our QC process or by using bioinformatic tools. I would really like to stress the point that bioinformatic tools are telling us

33
00:05:08,533 --> 00:05:17,366
information about functionality and not causality, and people frequently use these words as though they had exactly the same meaning. And

34
00:05:17,366 --> 00:05:26,266
yes, a variant has to be functional in order to be causal, but a variant can be functional and not be causal for your phenotype of interest, so I really

35
00:05:26,266 --> 00:05:36,799
would like to stress that. So, because we know we have this misclassification problem, we really need methods that are robust to misclassification,

36
00:05:36,800 --> 00:05:44,700
in particular, inclusion of non-causal variants. So, there are different types of aggregate analysis you can use and you can just analyze rare

37
00:05:44,700 --> 00:05:53,800
variants. I’m just using 1% frequency as an example of what would be considered rare—there’s many different definitions—or you could

38
00:05:53,800 --> 00:06:04,466
do a joint analysis of rare and low frequency variants, and then there’s also methods that were specifically developed to look at variants when

39
00:06:04,466 --> 00:06:13,132
you have…they’re bidirectional. So, some variants within your gene region are detrimental and others are protective, or for a quantitative

40
00:06:13,133 --> 00:06:20,966
trait they just go in two different directions. And so, I would like to go through some of the methods that have been developed to analyze

41
00:06:20,966 --> 00:06:30,132
rare variants. This is an abbreviated list; I guess I should say, maybe a very abbreviated list. I have to admit, I have a very difficult time keeping up

42
00:06:30,133 --> 00:06:42,033
with the rare variant methods. It seems like every week there’s a new one in the literature. But let’s just go through a few of them. So, one of the first

43
00:06:42,033 --> 00:06:51,533
tests was this combined multivariate and collapsing method. And so, this is just looking within an individual, be it the case or control or

44
00:06:51,533 --> 00:07:01,399
for quantitative trait, whether or not they have a rare variant within the gene region, so you’re only counting the observation ones, and then in the

45
00:07:01,400 --> 00:07:12,033
simplest form you could just perform a Fisher exact test or you can incorporate the 01 coding within a regression framework. Then there’s the

46
00:07:12,033 --> 00:07:22,833
weighted sum statistic, and this method is up-weighing the rare variants within a region, so you’re developing weights and then you’re

47
00:07:22,833 --> 00:07:34,533
aggregating the variants across the region, and because you’re using internal information, you need to empirically estimate the P values for this

48
00:07:34,533 --> 00:07:48,066
second test; you cannot analytically obtain the P values. The next one is a kernel based adaptive cluster method and this uses adaptive weighting

49
00:07:48,066 --> 00:07:59,166
based on the multivariate genotype within a region. Again, because we’re using internal information for these weights, the P values must

50
00:07:59,166 --> 00:08:08,666
be estimated empirically. The next test is a variable threshold method. You can use different coding for this particular method. You could use

51
00:08:08,666 --> 00:08:22,166
the…you could just code based on the CMC or I should have maybe put it…or in this particular coding used in the Morris and Zeggini test which

52
00:08:22,166 --> 00:08:30,666
is very similar to the CMC, but here you’re actually counting the number of rare variants which are within the gene region and not just whether or

53
00:08:30,666 --> 00:08:40,366
not an individual has a rare variant. But what the variable threshold does is it doesn’t depend on a very strict cutoff, so you’re not saying, “I’m only

54
00:08:40,366 --> 00:08:48,532
going to analyze variants, say, with 1% frequency or less or 5% frequency or less.” What it does is it maximizes the test statistic over

55
00:08:48,533 --> 00:09:00,699
a variant of allele frequencies. So, rare cover is also a maximization approach, but instead of just using different allele frequency cutoffs, it actually

56
00:09:00,700 --> 00:09:11,566
maximizes the test statistic over all variants within a genetic region. So both of these methods, the variable threshold method and the

57
00:09:11,566 --> 00:09:20,332
rare cover method, you both have to empirically estimate the P values because you have to adjust for multiple testing in both schemes, and of

58
00:09:20,333 --> 00:09:28,633
course, you do also pay a penalty for multiple testing here; but I actually rather like the variable threshold method because you’re not keeping to a

59
00:09:28,633 --> 00:09:40,499
very strict cutoff method. So, I mentioned already the MZ method and these last two tests, which there are more out there now, were both

60
00:09:40,500 --> 00:09:47,633
developed to look at when you have both protective and detrimental variants within a region. So, C-alpha is just limited to analyzing

61
00:09:47,633 --> 00:10:00,233
case-control data and it’s looking at deviates from the expected binomial distribution, where for SKAT, you can analyze also quantitative traits

62
00:10:00,233 --> 00:10:09,999
and also it allows you to weight based on sampled variant frequencies, and again here for both of these methods, the P values are

63
00:10:10,000 --> 00:10:18,800
estimated empirically. So, it’s very difficult when you go and read the literature to compare the power of the different methods. You know,

64
00:10:18,800 --> 00:10:28,766
everybody wants their method to be, you know, published in a top journal, and so unfortunately we’re all human so we tend to tweak our

65
00:10:28,766 --> 00:10:38,799
simulated data to make our method look better than all the other methods, even though that particular underlying model that you have there

66
00:10:38,800 --> 00:10:46,266
might not be realistic at all. So of course, when you look at the literature, everybody’s method is, of course, the best method out there and so you

67
00:10:46,266 --> 00:10:55,166
really kind of wonder, well, what is really the best method? So, we went about to compare the power of these different methods and I don’t

68
00:10:55,166 --> 00:11:05,699
have enough time to show you all the different power comparisons, and you know, in one talk I had I said this is probably the worst slide you’ve

69
00:11:05,700 --> 00:11:17,933
ever seen because all of the entire power curves are all on top of each other and what I’m trying to make is the point is most of these tests had very

70
00:11:17,933 --> 00:11:27,199
small differences in their power and, depending on the underlying model, the one that’s the most powerful tends to flip around. The one thing you

71
00:11:27,200 --> 00:11:38,633
do see a big difference in power, though, is between the tests that were developed to look at more all detrimental or all protective variants, to

72
00:11:38,633 --> 00:11:47,066
those tests that were developed to look at when you have effects that are bidirectional in the gene. And so, these tests that were developed to

73
00:11:47,066 --> 00:11:56,932
look at when you have bidirectional effects generally are not as powerful when all your variants are unidirectional as using one of these

74
00:11:56,933 --> 00:12:08,499
tests here, for example. So, there are also additional considerations when you’re choosing a test to carry out your analysis. We know that

75
00:12:08,500 --> 00:12:17,733
confounding can lead to spurious association, and one particular problem we have is population substructure, and this has been well-documented

76
00:12:17,733 --> 00:12:26,899
for common variants to be a potential problem of having spurious association. However, the problem is even greater for rare variants

77
00:12:26,900 --> 00:12:35,766
because we really have different allelic spectrums even within European populations for these rare variants. So, this should be an

78
00:12:35,766 --> 00:12:45,166
additional consideration when you’re doing your analysis that you are using a method that you can include covariates in the analysis to control for

79
00:12:45,166 --> 00:12:55,466
potential confoundings. So, how about choosing a lot of different tests and just hitting them all on your data and saying, “I’ll just take the best one?”

80
00:12:55,466 --> 00:13:05,899
Well, although people do this, this isn’t really the best thing to do at all. If you’re going to do this you really have to correct for multiple testing, and

81
00:13:05,900 --> 00:13:17,366
in performing this multiple testing correction you might have a loss in power instead of a gain in power. So, I know this trait is not related to this

82
00:13:17,366 --> 00:13:28,732
conference but I don’t really work on a trait related to the theme of this conference, but I would like to use this as an example of analyzing

83
00:13:28,733 --> 00:13:42,299
exome data. Right now one of the traits that we’re analyzing using exome data is age of menarche and this is part of the NHLBI ESP study

84
00:13:42,300 --> 00:13:51,333
that we have information of age and menarche. So, this is a study which you’ve heard about before and currently we have exome data on

85
00:13:51,333 --> 00:14:00,133
around 5,400 individuals, and it consists of both African and European Americans. These individuals were selected from a variety of

86
00:14:00,133 --> 00:14:12,033
cohorts. Also, as you’ve heard, many of the traits were sampled on extremes, so we have extremes of BMI, blood pressure, LDL. We also

87
00:14:12,033 --> 00:14:22,633
have dichotomous traits such as stroke, Type II diabetes, lung-related traits, early-onset MI, and then we also have controls—around 1,000

88
00:14:22,633 --> 00:14:35,966
controls—that were just phenotyped because they have phenotypic information available for a large variety of traits. So, the more you look at

89
00:14:35,966 --> 00:14:46,499
this data, the more you say, “Oh, no, another problem,” and this keeps on arising. And so one thing is, not only is the sequencing being done in

90
00:14:46,500 --> 00:14:55,400
two different centers, the Broad, which you heard, and the University of Washington, but they all decided that they were going to use different

91
00:14:55,400 --> 00:15:03,233
capture arrays. So, the Broad has been very consistent and stuck to this one array while the University of Washington has gone on to use

92
00:15:03,233 --> 00:15:12,966
three different arrays, and I don’t go into this detail, but not only do we have the different arrays, but during time they also changed

93
00:15:12,966 --> 00:15:24,466
sequencing machines. So, we have some of the samples done on genome analyzer 2 and also high-seek machines. So, this is all things to

94
00:15:24,466 --> 00:15:34,966
contend with. So, here are just the different arrays and we can see that the older arrays do capture less of the genome. And so, one thing

95
00:15:34,966 --> 00:15:49,732
we have to do is we have to do some QC on our data before we get started on even doing further QC. So, things that we filter on is that they fail

96
00:15:49,733 --> 00:16:02,166
this support vector machine that was developed by Gonçalo Abecasis and his group. Now this actually just pulls out bad variant sites or

97
00:16:02,166 --> 00:16:13,799
allegedly bad variant sites but it doesn’t pull out individual sites. So on top of using this support vector machine, it’s also good to filter on, say, a

98
00:16:13,800 --> 00:16:23,166
depth of 10, so for each individual they need at that particular variant site, you need at least 10 reads to call the variant, and first we were

99
00:16:23,166 --> 00:16:32,499
looking at it without using genotype quality score because we know this biases towards a homozygous wild-type, but after further looking

100
00:16:32,500 --> 00:16:41,866
at the data we decided maybe that wasn’t the best thing to do, so we got a little bit more stringent and used this GQ score; we cleaned

101
00:16:41,866 --> 00:16:52,732
out the data with a GQ score of less than 30. And so, we also have some data on duplicates, so this is just the concordance rates between the

102
00:16:52,733 --> 00:17:02,899
duplicates for different screenings. So, you see just on SVM and then you have SVM+depth 10 in the red…oops, sorry about that…an SVM with

103
00:17:02,900 --> 00:17:12,333
this genotype quality score of 20 here, so you’re getting much better concordancy. So, here’s a little bit of information in general about…we’re

104
00:17:12,333 --> 00:17:26,399
actually, in reality, analyzing more genes than this but this is the intersect of the different capture arrays, and so we have around…if you take the

105
00:17:26,400 --> 00:17:39,933
intersect of all capture arrays you have around 15,500 genes and you have the number of variant sites—current individual—vary drastically

106
00:17:39,933 --> 00:17:54,266
from 0 to over 500 in African Americans. There’s very nice Ti/Tv ratios on this particular data, which would tell you how good your data is, and

107
00:17:54,266 --> 00:18:02,999
we have around a ratio of 1.3. This is on an individual; it’s not if you look at your whole data set. You would not see this ratio if you looked at

108
00:18:03,000 --> 00:18:12,633
your entire data set. But on the average for an individual, we have a ratio of about 1.3:1 of synonymous to nonsynonymous variants, and

109
00:18:12,633 --> 00:18:29,066
we see more variants in African Americans than in European Americans but this ratio does hold. So, one of the first steps in our QC is to perform

110
00:18:29,066 --> 00:18:44,799
PC analysis, or MDS, and so this is to remove outliers and we also wanted to assign our individuals to an ethnic group to analyze African

111
00:18:44,800 --> 00:18:56,000
Americans and European Americans separately, and instead of just using self-report we wanted to use the information for this PCA analysis. And

112
00:18:56,000 --> 00:19:06,133
so the first thing we did is we trimmed our data to get rid of those variants that are in strong LD with each other and then we used those variants with

113
00:19:06,133 --> 00:19:15,499
a minor frequency of less than .1% for the PCA analysis. And we see that we definitely have two groups here. Over here is the European

114
00:19:15,500 --> 00:19:23,333
Americans and here is the African Americans, and there’s generally good agreement with the self-report, although we do see some European

115
00:19:23,333 --> 00:19:37,399
Americans right here in the middle of the African Americans. We also wanted to look in our sample for duplicate samples and related individuals and

116
00:19:37,400 --> 00:19:49,533
so we want to remove the related individuals and also cryptic duplicate samples, and so in order to do this we used this program called KING, and

117
00:19:49,533 --> 00:19:56,799
we can see that we have a number of duplicates here. We have first-degree relatives and some of these we knew about. A number of the samples

118
00:19:56,800 --> 00:20:06,266
were from small pedigrees that were included and then we have a number of second- and third-degree relatives. And so, here is our

119
00:20:06,266 --> 00:20:19,832
distribution of related individuals. And so, we took one of the duplicates based on one that had the most information and then we just limit it to one

120
00:20:19,833 --> 00:20:27,966
individual from a family, also based on the quality of their exome data and also the available phenotypic data for them. So, here’s about

121
00:20:27,966 --> 00:20:37,132
evaluating our phenotype data and other QC measures that we did, too, was looking at the sex of the individuals, looking at Hardy-Weinberg

122
00:20:37,133 --> 00:20:48,233
equilibrium, which isn’t very powerful because most of these variants are extremely rare and so you’ll have no power at all to detect a deviation

123
00:20:48,233 --> 00:20:59,366
from Hardy-Weinberg equilibrium, cleaning up the database on missing this by target because, of course, if we just took it out across all samples,

124
00:20:59,366 --> 00:21:06,999
there would be some sites that are just missing because they’re not included on a particular target, so that’s other QC that we included here.

125
00:21:07,000 --> 00:21:17,366
And so, here we’re going to evaluate our phenotypic data. So, our phenotype that I’m going to present here is age of menarche. We have a

126
00:21:17,366 --> 00:21:30,666
mean age of almost 13 in our women from this study with a medium of 13. We have almost 1,500 European Americans with this phenotype and

127
00:21:30,666 --> 00:21:45,266
close to 1,300 European Americans, and here is the distribution of the age of menarche within our women and it’s pretty normally distributed. We

128
00:21:45,266 --> 00:21:55,432
cleaned out the duplicates from those individuals, related individuals, the women with missing phenotypes or exome data, we also took out

129
00:21:55,433 --> 00:22:09,899
extreme outliers, and after that we had slightly over 2,200 women to work with. So, one thing I would like to point out here is that we had this

130
00:22:09,900 --> 00:22:20,466
little problem with the study design here where these individuals…most of them are not a random sample except those 1,000 individuals who were

131
00:22:20,466 --> 00:22:29,799
selected because they were well-phenotyped. All the other individuals were ascertained because of some other phenotype, either a phenotype or

132
00:22:29,800 --> 00:22:40,200
an extreme, as I mentioned before. And so, we wanted to look at the correlation between age and menarche and these different phenotypes,

133
00:22:40,200 --> 00:22:50,033
and lucky for us with this particular phenotype, it’s not particularly correlated. We’re also working with waist-to-hip ratio and that’s a totally different

134
00:22:50,033 --> 00:23:00,099
matter. As you’ll see in a minute, I include this as the covariate, this cohort of ascertainment as a covariate in the analysis and it does work okay

135
00:23:00,100 --> 00:23:09,966
here. I think it’s working fine for this particular phenotype because it’s not particularly correlated with the other phenotypes which were used for

136
00:23:09,966 --> 00:23:20,766
ascertainment. But I just would like to stress that this is not going to, in many circumstances, this is not going to get rid of the problem and you can

137
00:23:20,766 --> 00:23:32,232
definitely have an inflated Type 1 error when you do this because if the other trait is associated with a particular gene, then you can also pick up

138
00:23:32,233 --> 00:23:41,599
this association due to a correlation with a first trait; it’s not because of pleiotropy, it’s just because they’re correlated, those two traits. So,

139
00:23:41,600 --> 00:23:49,233
you have to be very careful. There are methods out there to adjust for this but it’s something you should be very aware of when you’re analyzing

140
00:23:49,233 --> 00:24:02,866
these secondary traits within a cohort. All right. One thing we noticed right away is that we had the age of baseline. So, this is the age of these

141
00:24:02,866 --> 00:24:10,699
women when they were ascertained in the study and you can see that we have a lot of older women, which isn’t surprising, because a lot of

142
00:24:10,700 --> 00:24:17,766
these women are coming from the Women’s Health Initiative where it’s a study of postmenopausal women, and so we have quite

143
00:24:17,766 --> 00:24:31,332
an older population. So, this is well-known that there’s a generational effect for age of menarche, and when we looked at our data with

144
00:24:31,333 --> 00:24:44,166
both the European American and the African American women combined, we just saw a very weak P value here. However, we don’t fully

145
00:24:44,166 --> 00:24:55,032
know or can’t fully explain that we see a very strong effect in the African American women but not in the European American women. Of course,

146
00:24:55,033 --> 00:25:02,833
the European American women are much older; maybe we don’t have the power to detect this. We also were wondering if this had something to

147
00:25:02,833 --> 00:25:12,633
do with BMI. We adjusted for BMI; we still saw the same effect. So, this is what we would expect, to see this generational effect, but also in

148
00:25:12,633 --> 00:25:22,133
European American women, which we’re just not seeing in our sample. So, we are analyzing the African American and European American

149
00:25:22,133 --> 00:25:32,033
women separately and this is to allow for adequate control of population substructure and then we can also model separately in African and

150
00:25:32,033 --> 00:25:38,133
European American women, and one thing we’ve noticed about when we look at the variants within the gene in African and European

151
00:25:38,133 --> 00:25:47,733
Americans, there’s not a huge overlap of the variants. So, I really think this is the right way to go in the analysis; to analyze them separately.

152
00:25:47,733 --> 00:26:01,766
Here we have the two models and our model is slightly different for African Americans and European Americans, as I mentioned. So, what

153
00:26:01,766 --> 00:26:10,899
we did is we went back and we recalculated our PCA component separately in the European Americans and the African Americans, and when

154
00:26:10,900 --> 00:26:20,966
we were building our model we saw that, in order to control for population substructure, it was adequate to include one PCA component in

155
00:26:20,966 --> 00:26:29,899
the European Americans, but in the African Americans we needed to include two, and then we also included this generational effect, which

156
00:26:29,900 --> 00:26:38,466
is their age at ascertainment into the model for the African Americans, but we didn’t put it in for the European Americans because it was not

157
00:26:38,466 --> 00:26:48,699
statistically significant. Okay. One thing we ran into from the start is we really didn’t have a way of analyzing this data; there was nothing out

158
00:26:48,700 --> 00:27:00,433
there. So, we went about developing our own association tool, which the current name is VAT—we seem to change this quite often—and so

159
00:27:00,433 --> 00:27:12,666
this analysis tool allows not only the single variant analysis, which you can perform with many of the current software out there such as

160
00:27:12,666 --> 00:27:20,899
PLINK and Gene Able, but it also performs these aggregate tests and it’s regression-based, so you can analyze both qualitative and quantitative

161
00:27:20,900 --> 00:27:28,733
trait and you can also include covariates in the analysis. So, we needed to analyze quantitative traits and also needed to include these covariates

162
00:27:28,733 --> 00:27:43,766
in our analysis. It also allows you to work with the VCF files, it allows you to do some QC measures, and also has annotation, and the

163
00:27:43,766 --> 00:27:57,132
annotation is done with this variant tools software. Another program that’s out there to analyze this sequence data to perform

164
00:27:57,133 --> 00:28:07,666
association analysis with sequence data is PSEQ. However, we could not use that in its current form because you can only analyze

165
00:28:07,666 --> 00:28:18,432
case-control data and it’s also not possible to control for covariates in the analysis. So, the first thing we did was we analyzed the more frequent

166
00:28:18,433 --> 00:28:29,899
variants individually. So, here we analyzed all the variants and our only cutoff was on their frequency and we used linear regression and

167
00:28:29,900 --> 00:28:39,100
then additive model, using those two models I showed you for the African and European Americans, and we don’t see anything in African

168
00:28:39,100 --> 00:28:48,900
Americans. I mean, we certainly know we don’t have a Type 1 error problem, but you would rather not see that type of Q-Q plot, but we do

169
00:28:48,900 --> 00:29:01,500
have some points here in the European Americans. Then we went on to do the aggregate analysis using European and African

170
00:29:01,500 --> 00:29:13,066
Americans, and here we limited ourselves to variants that had frequency of less than 1% and we only analyzed nonsynonymous variants,

171
00:29:13,066 --> 00:29:22,299
stop-loss/stop-gain, and splice sites when we performed the analysis. I wouldn’t do both of these, but just for comparison purposes, I’m

172
00:29:22,300 --> 00:29:30,766
showing both of them. This is the results in the European Americans. Again, I don’t think we have

173
00:29:37,800 --> 00:29:40,000
any inflation here. This is very nice on our Q-Q plot and then we have, like, four genes coming up; it’s the same four genes here. However,

174
00:29:40,000 --> 00:29:50,500
when we go to our African Americans again, our plots are totally flat here. Also, we combined the results from African Americans and European

175
00:29:50,500 --> 00:30:01,266
Americans using meta-analysis. I don’t show you the results but basically any results we have is pretty much just driven by one ethnic group or the

176
00:30:01,266 --> 00:30:13,832
other; we really don’t see that two of them are contributing to the analysis. Our future direction with this is we are going to try to replicate our

177
00:30:13,833 --> 00:30:23,899
findings using the exome chip data, and we have looked at those genes with the top hits there from the aggregate analysis, and they’re quite

178
00:30:23,900 --> 00:30:35,766
well-represented in the exome chip. So, in conclusion I would like just to acknowledge some of the people from my group who have been

179
00:30:35,766 --> 00:30:48,899
working on this project and also the NHLBI Exome Sequencing Project team who have been really helpful. We work very closely with Paul Auer and

180
00:30:48,900 --> 00:30:55,600
Gonçalo Abecasis and a few other people. And so, I would like to open the floor to questions.

181
00:30:55,600 --> 00:31:10,866
FEMALE: A beautiful talk, thank you very much. My question is with regard to the replication studies.

182
00:31:10,866 --> 00:31:16,166
SUZANNE LEAL: Yes? FEMALE: So, when you identify genes using

183
00:31:16,166 --> 00:31:20,666
collapsing methods, how do you replicate them in independent cohorts?

184
00:31:20,666 --> 00:31:29,232
SUZANNE LEAL: I mean, there are several ways you could go about this. You could go and you could sequence the genes in another cohort, you

185
00:31:29,233 --> 00:31:39,866
could genotype the variants you have found because most of these signals are being driven by the more common of the rare variants—I didn’t

186
00:31:39,866 --> 00:31:48,466
show that here. So, it’s not being driven by the singletons or the doubletons, so losing those variants is not going to, you know, reduce your

187
00:31:48,466 --> 00:31:58,399
power by very much so you could just genotype the variants that you’ve found. The reason why we’re considering using the exome chip is

188
00:31:58,400 --> 00:32:07,633
because basically we have cohorts available to us that are being exome chipped and they have our phenotype of interest. Probably, we would do

189
00:32:07,633 --> 00:32:15,733
better if we could actually genotype the variants within those genes, but we’re hoping that the loss of some of the variants will be made up by

190
00:32:15,733 --> 00:32:18,866
the sample size. FEMALE: And the second question, with regard

191
00:32:18,866 --> 00:32:26,832
to your software. So, we are all familiar with PLINK and it became very popular for the analysis of GWAS studies. Is there anything in the works

192
00:32:26,833 --> 00:32:32,533
that could be widely used as a gold standard for the analysis of rare variants?

193
00:32:32,533 --> 00:32:41,533
SUZANNE LEAL: It’s coming. It’s on its way. I mean, we developed that pipeline and there’s PSEQ, which I mentioned, which should be

194
00:32:41,533 --> 00:32:51,533
pronounced not “P-Seek” but “Seq” as Shaun told me. So yeah. It takes a little while but this is definitely going to be…those tools will be

195
00:32:51,533 --> 00:32:57,533
available. FEMALE: Thank you.

196
00:32:57,533 --> 00:33:03,766
MALE: You mentioned the problem with the control datasets having cases but I don’t actually see why that’s a problem, because your

197
00:33:03,766 --> 00:33:08,532
probands here—your study set—should be enriched in just cases…

198
00:33:08,533 --> 00:33:16,833
SUZANNE LEAL: I’m not saying…what I’m making the point is that you can’t screen databases; that’s my point. You have to carry out an

199
00:33:16,833 --> 00:33:31,899
association study and this data, like from this ESP project, it is going to be available through dbGap, and so you could use this dataset. So, I think we

200
00:33:31,900 --> 00:33:43,033
will see, you know, people using these so-called convenience control datasets, but we’ll have bigger problems with the sequence data—at least

201
00:33:43,033 --> 00:33:52,699
initially—than we had with the genotype data because there’s going to be great variability in the capture arrays, the depth of coverage—I didn’t

202
00:33:52,700 --> 00:34:00,966
discuss that. Those are quite variable. This is very high-def coverage. We have a medium coverage of 110X but some have lower

203
00:34:00,966 --> 00:34:12,266
coverage. There can be differences in the alignment, and so there’s going to be very big differences in the variants that are called within

204
00:34:12,266 --> 00:34:25,666
genes, so you’re going to have problems with false positives. So, I just don’t believe for complex traits that you can, you know, just be screening

205
00:34:25,666 --> 00:34:33,199
databases and finding things that way; I just don’t think that’s going to work at all.

206
00:34:33,200 --> 00:34:37,633
MATT SAMPSON: Thanks for that talk. Matt Sampson, University of Michigan. Given that a lot of variant callers seem to be performing equally

207
00:34:37,633 --> 00:34:45,966
well, I’m wondering if there are certain programs that work better for smaller sample sizes or depending on the array sequence that was used

208
00:34:45,966 --> 00:34:50,432
to generate the sequencing. SUZANNE LEAL: I’m not really the right person to

209
00:34:50,433 --> 00:35:03,133
ask about comparisons of variant callers. One thing I would like to say is that when you call your variants, sometimes you get files where it doesn’t

210
00:35:03,133 --> 00:35:13,766
really have information on some individuals. So, when you call your variants for this type of study or even if you’re using pedigrees, you really need

211
00:35:13,766 --> 00:35:23,232
to know the status for people who are homozygous wild-type. You need to know: are they homozygous wild-type or was this variant

212
00:35:23,233 --> 00:35:31,199
not called? We ran into this immediately with some trio data. We were like, “What is this?” because we were missing all these sites. We

213
00:35:31,200 --> 00:35:41,800
were like, “We can’t use this data; you have to give us something else.” So, that’s one thing, but there is some variability and their abilities to call

214
00:35:41,800 --> 00:35:53,400
indels and there’s also differences in filtering the data for QC.

215
00:35:53,400 --> 00:36:00,800
JIRONG LONG: Jirong Long from Vanderbilt University. Thank you very much for your very last talk. I totally agree about the call system. We

216
00:36:00,800 --> 00:36:10,600
have a lot of problems about the [---]. So, I’m wondering whether you have a website or you have a paper or…I mean, everybody has a paper

217
00:36:10,600 --> 00:36:21,500
to describe the detail or procedure for [---] for the [---] data, but because of GWAS and in the beginning we had a problem but later we saw

218
00:36:21,500 --> 00:36:29,766
publications and we have some ideas of how to do [---] so we want to see whether you have any suggestion or papers or…?

219
00:36:29,766 --> 00:36:38,666
SUZANNE LEAL: I don’t know of any papers. First, I kind of poo-pooed having a paper on this, but more and more that I see of the problems, I

220
00:36:38,666 --> 00:36:45,666
think it probably is a good idea, definitely, to write something up on this, yeah.

221
00:36:45,666 --> 00:36:58,799
MALE: I’m curious about…so you say you have different numbers of rare variants in the African American compared to the European set. I’m just

222
00:36:58,800 --> 00:37:09,900
curious, how much do rare variants then intersect? I mean, do you actually have a very strong separation of the rare variants or…how

223
00:37:09,900 --> 00:37:15,366
do they intersect, if at all? SUZANNE LEAL: They do intersect somewhat,

224
00:37:15,366 --> 00:37:26,732
but it’s less of an overlap than…there is some intersection but it’s not that great of an intersection, I guess, for the really rare variants.

225
00:37:26,733 --> 00:37:32,866
Yes? MALE: Thank you, Suzanne. That was a great

226
00:37:32,866 --> 00:37:40,066
talk. A lot of it, I have to say, went over my head because I don’t really work in this area directly. I was interested that, as part of QC, you exclude

227
00:37:40,066 --> 00:37:44,332
variants that are in linkage disequilibrium. SUZANNE LEAL: Oh, no, no. I’m sorry. That’s just

228
00:37:44,333 --> 00:37:52,433
for my PCA analysis that I exclude variants that are in linkage disequilibrium—when I perform my PCA—

229
00:37:52,433 --> 00:37:56,233
but they are not excluded from the analysis.




Date Last Updated: 9/18/2012

General Inquiries may be addressed to:
Office of Communications and Public Liaison
NIDDK, NIH
Building 31, Rm 9A06
31 Center Drive, MSC 2560
Bethesda, MD 20892-2560
USA
Phone: 301.496.3583