Whole Genome Approaches to Complex Kidney Disease
February 11-12, 2012 Conference Videos

Rare Variant Analysis: Design and Analysis Strategies
Xihong Lin, Harvard University

Video Transcript

1
00:00:00,000 --> 00:00:14,066
ANDREY SHAW: Our next speaker is Xihong Lin, who is a professor of biostatistics from Harvard. She has an interest in developing statistical methods for high dimensional and correlated data, and the title of her talk

2
00:00:14,066 --> 00:00:25,032
today is Rare Variant Discovery, Design and Analytic Strategies. XIHONG LIN: Thank you. First of all, I would like to thank the organizers for

3
00:00:25,033 --> 00:00:46,733
inviting me. Can people hear? Thank you. So, both Suzanne and Joan have given excellent talks and built a strong foundation for my talk. So, my talk is an extension of what they have talked about. So as we know, in

4
00:00:46,733 --> 00:01:01,199
GWAS studies we have been using the individual SNP analysis, but for exome sequencing study and whole genome sequencing study and such individual based SNP analysis has little power for detecting the rare

5
00:01:01,200 --> 00:01:19,500
variants. Therefore, this talk will focus on three topics. The first topic is, people have proposed a Burden test that is a collapsing test and also non-Burden test. So a natural question is which test to use. The second

6
00:01:19,500 --> 00:01:34,366
question we will discuss in this talk is, how can we design the next generation whole exome work or genome sequencing study; in particular, whether the extreme phenotype sampling will be better than the random

7
00:01:34,366 --> 00:01:49,599
sampling. The third question I want to discuss is many of us have the GWAS data, so if we use the GWAS data to impute rare variants—how things work—would that be valuable or would it be too wrong to be

8
00:01:49,600 --> 00:02:05,866
valuable? So to start, the first thing we need to realize is when we work with rare variants, things are complicated. First of all, we need much more subjects to observe the rare variants and even if rare variants exist

9
00:02:05,866 --> 00:02:20,832
in the population. But if we don’t have enough subjects, we will not be able to observe them, even though they exist in the population. To illustrate, one can do this very simple binomial calculation to calculate the

10
00:02:20,833 --> 00:02:32,399
minimum sample size to observe at least one variance for a given population minor allele frequency with a high percentage of chance; for example, 99% chance. So, that means you are almost certainly observe

11
00:02:32,400 --> 00:02:46,200
these variants in your sample. So, if the population minor allele frequency is 10%, then you need at least 33 people to observe this variance, and so therefore for almost every single GWAS study you will be able to

12
00:02:46,200 --> 00:03:01,966
observe this variance. However, if the minor allele frequency is 1%, then you need at least 300 people to observe this variance. And now, if the minor allele frequency becomes smaller, like .1%, then you need at least

13
00:03:01,966 --> 00:03:15,066
3,000 people to observe these variants, even though the variants exist in the population, but if you don’t have enough people you may not observe it. So therefore, this makes analysis of the rare variants more challenging.

14
00:03:15,066 --> 00:03:33,132
To illustrate, suppose you can only afford to sequence 100 people; then you are able to observe almost all the variants with minor allele frequency which is now 1%, but almost 90% of the variants with minor allele

15
00:03:33,133 --> 00:03:46,199
frequency less than 1% will not be observed. So because they are too rare, if you don’t have enough people, even if they exist in the population, you cannot observe them. Suzanne talked about analysis pipeline, so I

16
00:03:46,200 --> 00:04:00,433
won’t iterate on the analysis pipeline for the sequencing studies. First of all, as Suzanne mentioned, the QC is very, very important, so I will illustrate. If you don’t do the QC well, if you don’t do the filtering well, then

17
00:04:00,433 --> 00:04:14,999
you may have the chi-[---] effect. The second is for the sequencing study, and to study the rare variants one needs to move away from the single individual SNP analysis and to focus on the region-based analysis.

18
00:04:15,000 --> 00:04:28,933
So a natural question to ask is how to define the regions, and if you are smart enough to define the region well, you may have power, but if you don’t define the region well, the region had too many non-variants not

19
00:04:28,933 --> 00:04:41,366
associated with disease, and you might lose power. So, in exome sequencing studies a natural way to define the region is a gene, and so in the whole genome study one can think about the moving window to

20
00:04:41,366 --> 00:04:55,932
cover the intergenetic region. Then after you define the region, how can you do the analysis? You observe both the common variants and the rare variants, so you want to use all the information in the analysis, and also

21
00:04:55,933 --> 00:05:10,333
for genetic study, adjustment for covariates such as population stratification is important. So, we want to build a statistical model to account for the covariates. After you are done defining the top region,

22
00:05:10,333 --> 00:05:24,933
then a natural question is: are those top regions really the truth? So, you want to think about: how can you design the validation study and how can you develop the strategy for functional analysis? So, every stage is

23
00:05:24,933 --> 00:05:42,733
challenging. So, let’s next talk about the statistical models to analyze the rare variants. So as I mentioned, for the rare variants we need to move away from the single SNP analysis to multi-marker analysis. So, suppose

24
00:05:42,733 --> 00:05:59,033
you have the covariates like age, gender, what population stratification—like a PC—to control for, and then suppose in this gene you have P SNPs—say, 100 SNPs in this gene—and many of them are rare. So therefore

25
00:05:59,033 --> 00:06:16,166
for each column, there are many, many zeros, and only small number of 1 or 2. Now what you want to do is like what we do in single SNP analysis; we do a regression model. We regress a disease phenotype, like for the

26
00:06:16,166 --> 00:06:31,099
continuous trend we can do a linear regression, and for the case control trait we can do a logistic regression, and we regress them on the covariates and also the P SNPs. Ideally, if all the betas associated with

27
00:06:31,100 --> 00:06:46,133
the P SNPs are equal to zero, then that means that there’s no association of any SNP within the gene and the disease outcome. So therefore, we want to test all the 100 betas equal to zero, and how can we do that? So,

28
00:06:46,133 --> 00:06:58,933
you can see if you do a 100 degree freedom test there is no power, so one natural way to do it is let’s do collapse; let’s collapse them. So, how can we collapse them? We are going to restrict to the rare variants. So,

29
00:06:58,933 --> 00:07:12,066
how can you define the rare variants? We cut the variants based on some criteria, for example the minor allele frequency less than 5% and then we collapse all the rare variants. So by doing that, then you can see

30
00:07:12,066 --> 00:07:30,132
that if all the betas—beta 1 to beta100—if all the betas are the same and you take the beta out, then inside what you have is basically a dose, that is a total number of rare variants within that gene. So, this basically is a

31
00:07:30,133 --> 00:07:43,299
collapsing method so you assume all the beta is the same associated with each individual variants and you take it out, and you calculate those. So, there are various versions of the collapsing method, as Suzanne

32
00:07:43,300 --> 00:07:57,166
mentioned, like CMC method, or WSS method, or MZ method, or the variant threshold method. This also explains in what situations the collapsing method works. The collapsing method and the Burden method is optimal

33
00:07:57,166 --> 00:08:10,499
when all the rare variants—all the 100 variants within this gene—are casual variants, so that means all the betas [---] to zero and also all the variants have the effect in the same direction and also within the same

34
00:08:10,500 --> 00:08:26,000
magnitude. In that situation the Burden test is going to be the best. However, the picture is not always that beautiful and it is rare that all the rare variants in this region are causal. Therefore, we want to account for

35
00:08:26,000 --> 00:08:40,666
the situation where a majority of the variants in this region are not causal and also some variants may be protective and some variants may be harmful. So, one approach is called the SKAT method we developed and

36
00:08:40,666 --> 00:08:54,366
the key feature is robustness. We don’t make any assumptions, so that means we allow most of the variants to have no effect, so there are lots of null variants or the effect could have the same direction or it could

37
00:08:54,366 --> 00:09:03,699
have different direction. So, what is the basic idea? As you recall in the collapsing method, what we do is we collapse the variants first and we collapse them and calculate the total dose. Therefore, if in the effect

38
00:09:03,700 --> 00:09:22,400
some of them are positive and some of them negative then they cancel out. Now for the SKAT methods, we don’t collapse the variants, what we do is collapse the statistics. Therefore, suppose you do a single marker

39
00:09:22,400 --> 00:09:35,433
analysis. For each individual marker and in GWAS you calculate the [---] test or you calculate the likelihood ratio test. Here we don’t do the likelihood ratio, we don’t do the [----], we calculate the score statistics,

40
00:09:35,433 --> 00:09:48,099
and when we calculate the score statistic, which is called the Mantel-Haenszel statistic in epidemiology, we aggregate this individual Mantel-Haenszel statistic for each individual marker and then we put in a weight.

41
00:09:48,100 --> 00:10:01,666
So what is a weight? The weight basically is we want to incorporate biological knowledge in the analysis. We want up with the rare variants and down with the common variants. You can see this SKAT statistic is

42
00:10:01,666 --> 00:10:17,432
basically very easy to calculate. We basically collapse individual marker test statistics. Because there is a score statistic like in the Mantel-Haenszel test that people learn in graduate school we only need the null

43
00:10:17,433 --> 00:10:30,266
model. So, this is a model without any genetic [---], without any SNPs, so with only covariants. So, this is the same across the genome for every single gene. Therefore, you just need to fit the null model once, so

44
00:10:30,266 --> 00:10:45,299
therefore computationally, this is very, very fast because you just need to fit the model once. The P value can be calculated very quickly analytically using a mixture of chi-square distribution. The C-alpha test is a special

45
00:10:45,300 --> 00:11:01,666
case of the SKAT to see how fast it takes and first I will show you the weight function. In order to improve the power we want to incorporate the weight when we collapse the individual test statistics. What we want

46
00:11:01,666 --> 00:11:20,866
to do is, if there is no weight—so, basically everybody we could do equally—we use this beta density as a weight which has a broad class of different shapes; no weight is a special case of beta density. So, you

47
00:11:20,866 --> 00:11:33,866
can see how we make the rare variants with more, and common variants with less, so that means that, based on population genetics, we believe that rare variants are more likely to be causal variants, have a bigger

48
00:11:33,866 --> 00:11:51,832
effect, and so we up with them more. To show you the speed…we can do this using the score test. It’s really, really fast and you can see that if one has a laptop to analyze the whole genome and you select a 300Kb

49
00:11:51,833 --> 00:12:07,433
regional time, you total it using a laptop—you will need, like, seven hours to analyze a whole genome—and computationally this is attractive. I don’t need the permutation and also I can control for the covariates. So, this

50
00:12:07,433 --> 00:12:20,899
comes with a question. In summation, a Burden test is more powerful when a large fraction of the variants are causal within the region and also the effects are in the same direction. In this situation Burden test is

51
00:12:20,900 --> 00:12:36,500
going to perform well and is going to be better than SKAT. SKAT is more powerful when a small fraction of the variants are causal in the region or the effects have mixed directions—some are positive, some negative. In

52
00:12:36,500 --> 00:12:53,700
reality, when we scan the genome both scenarios can happen and we don’t know which one is which. Therefore, a natural question is: which test to use? , we developed this test called Optimal Unified Test. What is

53
00:12:53,700 --> 00:13:09,366
Optimal Unified Test? Let’s talk about the Unified Test first. The unified test is a weighted average of the Burden test and also the SKAT test statistics. So, you can see if I make the rho and this is a weight, if I make

54
00:13:09,366 --> 00:13:22,099
the rho=0, that becomes a SKAT. If I make the rho=1 then this becomes the Burden test. Then the question is: how can we choose the rho? You

55
00:13:22,100 --> 00:13:30,400
can see for this unified test both the Burden test and also the SKAT test have special cases. So the question is: how can you choose the rho? What

56
00:13:30,400 --> 00:13:49,500
we can do is we can use a data to estimate the optimal rho to maximize the power. So, it turns out this unified test is an extended SKAT test with the correlation of the beta to be rho. So therefore, if you think about the

57
00:13:49,500 --> 00:14:06,900
correlation of the betas… if you look here… if all the betas you called are the same, that means the correlational beta is 1, then you collapse, right? If the betas are not related to each other, then that is a SKAT. So

58
00:14:06,900 --> 00:14:30,800
therefore, you can see that rho basically measures the correlation among the betas. This is extended SKAT family and so the P value of the optimal unified test we call the SKAT-O—O stands for optimal—is quickly

59
00:14:30,800 --> 00:14:47,200
calculated analytically, so therefore I don’t need to do permutation; I can call it analytically very quickly. One thing we need to worry about in the sequencing—in the whole genome sequencing—is the size of the test;

60
00:14:47,200 --> 00:15:00,600
whether we can make sure the Ypte 1 error rate is right. So, suppose we do the whole exome sequencing and totally there are 20,000 genes. Therefore, the alpha level—the genome-wide significance alpha level—is

61
00:15:00,600 --> 00:15:23,500
about 10-6. If you look here then you can see that at 10-6 you can see for the continuous trait they are all close to 10-6. For binary trait they are all 10-6 except when the sample size is about 500 and for the case

62
00:15:23,500 --> 00:15:39,900
control trait the SKAT is a little bit conservative, and so this is what we are going to fix later. So to show you the power, first of all we assume the causal variants are all in the same direction and all the causal variants

63
00:15:39,900 --> 00:15:59,933
are in the same direction. If I assume 10% of variants are causal, the blue is the collapsing method and the orange is a SKAT and the red is a SKAT-O. So, you can see if a majority of the variants are noises, the SKAT is

64
00:15:59,933 --> 00:16:13,999
better than the collapsing method and the SKAT-O tries to approximate the SKAT. Now, if I make 50% of variants causal then you can see the collapsing method is better because all the causal variants are in the

65
00:16:14,000 --> 00:16:30,200
same direction and a majority of them are causal; then the collapsing method is better than the SKAT but then the SKAT-O tries to approach the collapsing method. Now, if you see the situation where the variants are in

66
00:16:30,200 --> 00:16:43,466
different directions—80% of the causal variants are positive, 20% are negative—the you can see that in all those situations the SKAT is better than the collapsing method and the SKAT-O tries to approximate the SKAT.

67
00:16:43,466 --> 00:17:01,099
Therefore, the SKAT-O is more effective to different situations of the biology. So in reality, both scenarios can happen. This can be seen from this early gene sequencing study of the Dallas Heart Study. So in this

68
00:17:01,100 --> 00:17:17,633
study, they sequenced about 3,500 people and they sequenced regions, and totally there are about 93 variants in the 3 genes. The majority of them are singletons; almost 50% are singletons. There are three ethnicity

69
00:17:17,633 --> 00:17:35,299
groups, so the trait is continuous; it’s a TG. So here is a result. This is based on the Burden test and if I put all the three things together and analyze them simultaneously, then the Burden P value is about 10-4 and

70
00:17:35,300 --> 00:17:51,533
the SKAT P value is about 10-5. So, for the SKAT-O you can see it has tried to approximate the SKAT. This situation becomes rho. In this situation you can estimate it and it is pretty small, so that tells you the SKAT should

71
00:17:51,533 --> 00:18:07,066
be better than the Burden. Now, let’s look a different scenario for this ANGPTL3. The Burden has a P value of about 2x10-3 and for the SKAT it is about 9x10-3 and the SKAT-O tries to approximate the Burden; in this

72
00:18:07,066 --> 00:18:20,366
situation you can see the beta are more in the same direction. Therefore, the SKAT-O looks more robust and picks up the more powerful test between these two scenarios. Similarly, for the next gene, you can see

73
00:18:20,366 --> 00:18:33,666
the SKAT. The rho is very small, so that indicates that probably the variance may be in different directions and so the SKAT-O is more powerful than the Burden, the SKAT is more powerful than the Burden,

74
00:18:33,666 --> 00:18:49,832
and the SKAT-O tried to approximate the SKAT; so you can see this P value is closer to that. As I mentioned in the simulation study we found that the SKAT test may be conservative when the sample size is small.

75
00:18:49,833 --> 00:19:04,733
Do we care? Yes, we do. The reason is, in the current exome sequencing study the sample sizes are pretty small because the cost is high; right now it’s about $1,500 a person, so one cannot sequence a lot

76
00:19:04,733 --> 00:19:18,499
of people. So, the large sample-based P value for both the Burden test and also for the SKAT test are conservative. So for example in this simulation study we simulated 100 cases and 100 controls under the

77
00:19:18,500 --> 00:19:35,500
null—no genetic effect. You can see the Q-Q plot is off the 95 degree line, so we need to improve it to fix the size. So, how can we do it? It turns out we can calculate the small sample adjusted P value analytically

78
00:19:35,500 --> 00:19:50,166
by more accurately approximating the test statistic distribution. You can see the green line here is before the correction and the black line here is after correction. You can see after it, with a small sample P value

79
00:19:50,166 --> 00:20:07,266
correction, they are all on the straight line, so I am able to crack the size for small sample exome sequencing study to make the size right.To illustrate, this was a motivated by this NHLBI Exome Sequencing Project,

80
00:20:07,266 --> 00:20:26,899
so we are part of the lung [---], so the particular phenotype that we have is Acute Lung Injury, ALI. So, in this study we have the Utah sequencing center and sequenced about 90 subjects and about half were case and

81
00:20:26,900 --> 00:20:41,000
were controls using the whole exome sequencing technology. First, we needed to filter the SNPs so we just used the Broad GSA criteria and those are the criteria that Broad provided and they seemed like they work

82
00:20:41,000 --> 00:21:01,766
quite well. After the filtering there was about 106K SNPs left over for analysis. So, why is the filtering important? Here I will show you if you don’t filter then there is likely to be a batch effect, so one needs to be

83
00:21:01,766 --> 00:21:16,866
very careful. This is before filtering. This is from batch 1. This is a Utah Sequencing Center using an older platform—Illumina platform—in between and Illumina updated the platform and this is a new platform. You

84
00:21:16,866 --> 00:21:32,199
can see most of the controls that were sequenced in the first batch, in most cases, were sequenced in the second batch. So, if you do the principle component analysis, definitely you can see this big principle

85
00:21:32,200 --> 00:21:44,300
component; two structures. So, we were really worried, but then after the filtering you can see they were all mixed up, so that is good. So, one thing that is very, very important is filtering is very, very important to do

86
00:21:44,300 --> 00:22:06,133
when you analyze your data. Also, when they did the sequencing we shaped the sample to them, so probably the control wasn’t even on the same place and in some cases were on the same place and the

87
00:22:06,133 --> 00:22:13,966
sequencing center just sequenced the controls first and sequenced the cases later. So, I think if we knew better we would have mixed the sample before we sent it to the sequencing center. This is the SKAT

88
00:22:13,966 --> 00:22:29,132
result and also this is the SKAT-O result so this, as you recall, is optimal of the Burden and the SKAT test. So, before the small sample correction you can see the Q-Q plot is off. After the small sample correction you can

89
00:22:29,133 --> 00:22:46,533
see that all are straight lines, so that is good for this dataset. Okay. Now I will move on to the design. Several people discussed the designs of the sequencing study. So, the first question we answer is, suppose I know

90
00:22:46,533 --> 00:22:58,566
what design to use; I am going to have a case control design or I am going to have a continuous phenotype. Then, what are the key parameters which can be used to design the sequencing study to do the

91
00:22:58,566 --> 00:23:11,332
power calculation? There are a few things. First, we need to specify the percentage of causal variants. So remember, in my regression model I have I have 100 betas and I want to know what percentage of betas are

92
00:23:11,333 --> 00:23:24,733
not equal to 0, so that means those are the causal variants. Second, I need to know amount of causal variants. Suppose 20 betas are not equal to 0, 20% of variants are causal. I want to know: among those 20 betas,

93
00:23:24,733 --> 00:23:38,633
what percentage are protective and what percentage are harmful? Also, I need to specify the effect size of the beta and so here we allow the effect size to be a decreasing function of minor allele frequency. So, if

94
00:23:38,633 --> 00:23:52,466
the minor allele frequency is high we assume the effect size is high…sorry, correction…if the minor allele frequency is low we assume the effect size is high and if the minor allele frequency is high we assume the

95
00:23:52,466 --> 00:24:07,966
effect size is low, and also for case control sampling we need have the prevalence as well. So, we can do the analytic power and sample size calculation for both the SKAT and SKAT-O using the SKAT package. Also,

96
00:24:07,966 --> 00:24:23,766
to illustrate here, here is just one example. Suppose I use a Windows size of 3Kb to mimic the exome sequencing and so suppose our power is 80% and the genome-wide significance level is 10-6. So, if I assume

97
00:24:23,766 --> 00:24:36,999
continuous trait with the heritability is 6% , this a little bit high; you can make this down. So, if 20% is causal you need to sequence about 5,000 people. If 30% of variants are causal, you need to sequence about 2,000

98
00:24:37,000 --> 00:24:52,633
people. For the case control trait, if the problem is 10%, you need to sequence about 9,000 people, and if 30% are causal variants then you need to sequence about 4,000 people. So, if the prevalence is lower, then

99
00:24:52,633 --> 00:25:05,399
you need to sequence less people. So, here we assumed, as we showed, a decreasing function with minor allele frequency with odds ratio of about 2.5 for minor allele frequency equal to 3%. So, you can do

100
00:25:05,400 --> 00:25:22,200
this sample size any way you would like in the SKAT package. Okay, so this is another question we face because right now exome sequencing cost is still pretty expensive, so even if it would be $6 a month and

101
00:25:22,200 --> 00:25:33,833
suppose I could only afford to sequence 500 people in my cohort…suppose my cohort had 10,000 people, I could only afford to sequence 500 people. What sampling strategy should I use to increase analysis

102
00:25:33,833 --> 00:25:48,633
power to study for the rare variants effect? Should I do random sampling and randomly select 500 people? Or should I do extreme sampling and pick 250 people in the upper pill and 250 people in the lower pill. So,

103
00:25:48,633 --> 00:26:02,233
which sampling will give me better power? So, here are our choices. First is random sampling. So, I have a continuous trait and so, for example, the BMI for example, and then I can use a continuous trait and apply the

104
00:26:02,233 --> 00:26:17,399
SKAT-O method, or I can do extreme phenotype sampling, basically like this. So, suppose you think about the whole BMI distribution. I pick up 250 here, 250 here, and then after I pick through the extreme phenotype

105
00:26:17,400 --> 00:26:28,733
sampling there are several analysis strategies. The first analysis strategy is to treat all those guys the same—treat them as a case and treat those as a control. So, you treat the outcome as yes/no outcome and do

106
00:26:28,733 --> 00:26:42,433
logistical regression, or you treat those as a continuous and treat those as a continuous. Then you can see the distribution, if you use this 500 people, the distribution is not normal anymore; it is a truncated normal. So

107
00:26:42,433 --> 00:26:55,533
therefore in the analysis, you can use a continuous phenotype but you need to use a truncated normal. So therefore, we have three analysis strategies. The first is treated as case-control data; high versus low

108
00:26:55,533 --> 00:27:10,866
through logistic regression and do the Burden test, so we call this an EDP. The “D” stands for dichotomous treat and I do the Burden test. Second, I can treat this as a case-control analysis, and I apply SKAT-O, so this is a

109
00:27:10,866 --> 00:27:24,599
dichotomous trait with SKAT-O. The last strategy is I observe the BMI, treat them as a continuous, but the regression is not simple linear regression; one needs to use the truncated normal linear regression and I

110
00:27:24,600 --> 00:27:41,333
apply SKAT-O, so we can develop this and we have developed a strategy for this analysis, and you can see, this is going to be the most powerful strategy. So, here’s a power comparison from the simulation, so

111
00:27:41,333 --> 00:27:58,866
let’s look at this. The sample size is 4,000 and the blue indicates random sampling, use a continuous trait to do random sampling, and this pink is the dichotomous sampling using the Burden test. This green is

112
00:27:58,866 --> 00:28:14,332
dichotomous sampling using the SKAT-O. So, the last one is the continuous analysis using truncated normal and using the SKAT-O. And so for the first scenario, I will assume 40% of variants are causal and

113
00:28:14,333 --> 00:28:28,966
some betas are positive, some betas are negative. Then you can see that if I use the truncated normal continuous SKAT-O strategy, it’s going to be the most powerful. Similarly, if I still have 40% of variants to be causal but

114
00:28:28,966 --> 00:28:45,099
all the effect are in the same direction, then you can see the collapsing method pick-up, but the SKAT-O using the continuous trait is still more powerful. So, the next question is: why shouldn’t phenotype sampling

115
00:28:45,100 --> 00:29:00,933
help improve the analysis power? The reason is that extreme phenotype sampling helps enrich causal variants. So, here is the minor allele frequency in minus log 10 minor allele frequency. So this side is more

116
00:29:00,933 --> 00:29:15,299
rare, this side is common. So, this is the observed minor allele frequency divided by population minor allele frequency. You can see if no variant is causal as there’s no enrichment. But if heritability is at 2.6, then you can

117
00:29:15,300 --> 00:29:30,833
see, I think, enriched rare variants by two times, and so that’s why it helps to improve the analysis power. So, this is your pickup of top 10% and bottom 10%, but if you pick up top 20% and bottom 20%, then the

118
00:29:30,833 --> 00:29:49,766
enrichment is going to be less; that makes sense. Okay. The last part of this talk is that many of us have the GWAS data, so I want to know if I use GWAS data to impute exome data, how that works. Will that work or

119
00:29:49,766 --> 00:30:03,466
will this be junk? So therefore, the first question we want to answer is we want to evaluate the performance of imputing rare variants using the GWAS on the 1000 Genome data with the exome data. We use the

120
00:30:03,466 --> 00:30:18,566
exome seq data as a gold standard and then we impute the rare variants and then we want to compare how good the imputation is. Second, after we impute the data, I want to see if I do the association study using the

121
00:30:18,566 --> 00:30:36,966
imputed SNP to study the association and using either the Burden test or SKAT-O, whether the P value agrees with my P value using the exome seq data. So, the data set we have is this 89 subjects and the ALI

122
00:30:36,966 --> 00:30:49,366
subjects. For these 89 subjects we have both exome seq data and also we have GWAS data because these 89 subjects are part of a larger GWAS we have; we have about 2,000 people in the GWAS study. Okay,

123
00:30:49,366 --> 00:31:06,266
so let’s see how that works. First of all, let’s look at the imputation. So, we used the 1000 Genome to impute; for the GWAS data we have Illumina 610K. So totally, using 1000 Genome we can impute about 38

124
00:31:06,266 --> 00:31:28,466
million SNPs, and so this is overlap. You can see that the imputed SNPs and also the GWAS SNPs have very, very high overlap; almost 99% of the SNPs, the GWAS SNPs being imputed. But if you look at the exome

125
00:31:28,466 --> 00:31:46,966
seq data with the imputed SNPs and data, then you can see almost we have about 79,000 SNPs we can impute for the exome seq data. So, about 29,000 SNPs and they are not in the imputed data; so what are

126
00:31:46,966 --> 00:31:59,366
these 29,000 SNPs? Most of them are singletons and so they are only appearing in exome seq data. They are not in the 1000 Genome so that’s why we cannot impute them. And also, the agreement with exome seq

127
00:31:59,366 --> 00:32:13,166
data and the GWAS data you can see we have, for the GWAS data there’s only very small fraction of a SNP on the exome, and among those 9,000 SNPs the agreement is almost 99.9%. So that means exome

128
00:32:13,166 --> 00:32:29,966
sequencing data calling is very good; we can have a good agreement here. Okay. So, to look a little bit closer, I’m going to focus on chromosome 1. So for chromosome 1, totally there are about 11,000 exome seq SNPs

129
00:32:29,966 --> 00:32:53,466
using the exome sequencing platform, so among these 11,000 exome seq SNPs about 8,000 of them were able to be imputed from the GWAS and the 1000 Genome data. So, you can see among these 8,000 SNPs which

130
00:32:53,466 --> 00:33:09,866
are part of the exome seq data you can see almost 4,000—50%—are common variants and about 2,000 have minor allele frequency between 1% to 5% and about 1,700 have minor allele frequency of less than 1%.

131
00:33:09,866 --> 00:33:29,066
And now, if I look at the scores of the imputed SNP with exome seq SNP with a score of greater than .3, so I basically focus on the good quality imputed SNPs. So, you can see many of the common variants…the

132
00:33:29,066 --> 00:33:42,766
imputation quality is good but for the rare variants you can see the imputation quality is not very good; so that makes sense. To see how this works, here I applied the histogram of the imputed dose against the

133
00:33:42,766 --> 00:33:57,266
exome seq genotype, so we used exome seq genotype as a gold standard. So, let’s look at the common variants first. You can see the common variants have the genotype 0, 1, and 2; 0 stands for wild-type.

134
00:33:57,266 --> 00:34:10,666
So, you can see the box plot of the dose almost agree exactly with 0, 1, 2, so that means for common variants the imputation worked very well. For uncommon variants, for minor allele frequency between 1% to 5%

135
00:34:10,666 --> 00:34:25,566
you can see for the wild-type agreed very well but for the heterozygous and the homozygous you can see the imputed dose, so even the median is close to 1 and the 2, but you can see there is quite a bit of variability

136
00:34:25,566 --> 00:34:43,166
here. Also, for the rare variants basically you can see that for the wild-type agreed very well for heterozygous and there is some variability and some imputation which does not agree with the exome seq. So overall,

137
00:34:43,166 --> 00:34:58,966
the imputation worked pretty well. Okay. Now, the next question we asked ourselves is, suppose I’m using imputed data to do the association study and also I have exome seq data to do the imputation study? So for

138
00:34:58,966 --> 00:35:14,166
each gene, I concocted two P values. One is a P value using exome seq data, the other is P value using the imputed SNP data. How much do they agree? It turns out that they are no bad. So for example, if I use all SNP, I

139
00:35:14,166 --> 00:35:32,566
use the SKAT-O, so you can see this is the P value on the log scale, and so this is imputed, this is a P value using exome seq, so the r score is about .75. If I restrict to the rare variants, the r score is about .74, so not

140
00:35:32,566 --> 00:35:49,466
too bad using the imputed data. This is using the CMS method and you can see the agreement similarly and if I use all the SNPs the agreement between the P value using the imputed data and exome seq data is about

141
00:35:49,466 --> 00:36:11,366
.9; if I restrict to rare variants, the r score is about .7, so it’s not too bad. So in summary, the Burden test and the SKAT work well in different situations and the SKAT-O is an optimal unified test that works well in

142
00:36:11,366 --> 00:36:26,199
both situations, so [---] in biology which is often unknown. So, we have developed the software to help people with the power and sample size calculation for designing sequencing studies, especially if you need to

143
00:36:26,200 --> 00:36:42,200
write a large grant, you need the power calculation, so you can use this package. We have found that extreme phenotype sampling can help enrich rare variants if you want to use a continuous truncated normal

144
00:36:42,200 --> 00:36:59,200
outcome to do the extreme phenotype sampling that can help increase analysis power. We also, based on the limited data we have, we found out that analysis using imputed rare variants using the GWAS+1000

145
00:36:59,200 --> 00:37:16,100
Genome has a promising result. This is really a team work with many of the people, especially students and post-docs, so those are the people involved in the original SKAT paper and the SKAT-O paper. So, this is my

146
00:37:16,100 --> 00:37:27,100
post-doc, Seunggeun Lee and associates, the driving force for the SKAT package and also for the imputation, Lin Li is my post-doc and he’s a driving force for that. Ian Barnett is my student who helped with the

147
00:37:27,100 --> 00:37:46,400
extreme phenotype project, and for the ALI NHLBI Exome Sequencing Project, those are the investigators who have helped with the Exome Sequencing Project. For the SKAT and SKAT-O package you can find it

148
00:37:46,400 --> 00:37:59,600
from this website, and for those other references and for the SKAT that was published in 2001 in the American Journal of Human Genetics and SKAT-O is under review, and the C-alpha test is a special case of SKAT.

149
00:37:59,600 --> 00:38:14,100
There are multiple Burden test references, I think Suzanne mentioned, and there’s one very nice paper we just published in PLoS Genetics this month and they compared different methods of association tests, and if

150
00:38:14,100 --> 00:38:31,800
you are interested you can take a look at this paper. Thank you. MALE: One question concerning the use of GWAS data in imputation is

151
00:38:31,800 --> 00:38:48,900
that this GWAS was sort of the old version—the Illumina 660W, I guess—so you only had, like, 500,000 SNPs. If you were to use like a 2.5 million SNP GWAS, how much would that affect the imputation quality?

152
00:38:48,900 --> 00:38:58,000
XIHONG LIN: That’s a good question. I don’t know. We don’t have that data so it’s hard. I cannot answer that question. Probably other people have that experience and can help answer that question.

153
00:38:58,000 --> 00:39:06,033
MALE: Thank you. PETER SONG: I’m Peter Song from Biostat, University of Michigan. I help a

154
00:39:06,033 --> 00:39:27,333
lot of my colleagues from nephrology to design studies, so I’m very interested in your software, SKAT. First of all, thank you for the interesting informative talk and very nice work. So, in the design you told

155
00:39:27,333 --> 00:39:43,166
us that we need to specify the effect size of these SNPs, so I wonder if this effect size will be the same in the design. Let’s say you believe there are 20 SNPs that are causal, then in the design you will assign the same

156
00:39:43,166 --> 00:39:56,066
effect size or different effect size? XIHONG LIN: By letting…nobody knows what is the best way to specify

157
00:39:56,066 --> 00:40:11,566
the effect size, so we know, like, for the common variants based on published GWAS, the effect size is small; it’s about 1.2 or 1.3. So basically, we just assume probably it’s like decreasing, so I think we

158
00:40:11,566 --> 00:40:23,966
assume it is a decreasing function, is a log function of the minor allele frequency, but I think if there are more data and more results published then probably we will have a better understanding of the true effect size

159
00:40:23,966 --> 00:40:30,666
could be. PETER SONG: Another thing is rho, because rho is a weighting that you

160
00:40:30,666 --> 00:40:37,166
use in the SKAT-O. In the design do I need to specify the rho coefficient—the weighting?

161
00:40:37,166 --> 00:40:48,466
XIHONG LIN: Yeah, rho you can estimate. Suppose for example, when you design the study you assume all the effects are in positive, all in the same direction, they are all the same. In that situation you can estimate

162
00:40:48,466 --> 00:40:57,566
the rho from how you specify the effect size. So in that situation, rho will be 1, and so whatever effects that you specify, you can estimate the rho.

163
00:40:57,566 --> 00:41:10,066
PETER SONG: Okay, so essentially I don’t need to specify rho in your software; I just input the number of the causal SNPs and effect size, then the software automatically estimates the rho?

164
00:41:10,066 --> 00:41:19,332
XIHONG LIN: Exactly. So for investigator, you don’t need to specify the effect size; how the effect size changes with [---] frequency, then the software will automatically estimate the rho.

165
00:41:19,333 --> 00:41:20,233
PETER SONG: Okay. Thanks.




Date Last Updated: 9/18/2012

General Inquiries may be addressed to:
Office of Communications and Public Liaison
NIDDK, NIH
Building 31, Rm 9A06
31 Center Drive, MSC 2560
Bethesda, MD 20892-2560
USA
Phone: 301.496.3583