February 11-12, 2012 Conference Videos

Xihong Lin, Harvard University

Video Transcript

1

00:00:00,000 --> 00:00:14,066

ANDREY SHAW: Our next speaker is Xihong Lin, who is a professor of biostatistics from Harvard. She has an interest in developing statistical methods for high dimensional and correlated data, and the title of her talk

2

00:00:14,066 --> 00:00:25,032

today is Rare Variant Discovery, Design and Analytic Strategies. XIHONG LIN: Thank you. First of all, I would like to thank the organizers for

3

00:00:25,033 --> 00:00:46,733

inviting me. Can people hear? Thank you. So, both Suzanne and Joan have given excellent talks and built a strong foundation for my talk. So, my talk is an extension of what they have talked about. So as we know, in

4

00:00:46,733 --> 00:01:01,199

GWAS studies we have been using the individual SNP analysis, but for exome sequencing study and whole genome sequencing study and such individual based SNP analysis has little power for detecting the rare

5

00:01:01,200 --> 00:01:19,500

variants. Therefore, this talk will focus on three topics. The first topic is, people have proposed a Burden test that is a collapsing test and also non-Burden test. So a natural question is which test to use. The second

6

00:01:19,500 --> 00:01:34,366

question we will discuss in this talk is, how can we design the next generation whole exome work or genome sequencing study; in particular, whether the extreme phenotype sampling will be better than the random

7

00:01:34,366 --> 00:01:49,599

sampling. The third question I want to discuss is many of us have the GWAS data, so if we use the GWAS data to impute rare variants—how things work—would that be valuable or would it be too wrong to be

8

00:01:49,600 --> 00:02:05,866

valuable? So to start, the first thing we need to realize is when we work with rare variants, things are complicated. First of all, we need much more subjects to observe the rare variants and even if rare variants exist

9

00:02:05,866 --> 00:02:20,832

in the population. But if we don’t have enough subjects, we will not be able to observe them, even though they exist in the population. To illustrate, one can do this very simple binomial calculation to calculate the

10

00:02:20,833 --> 00:02:32,399

minimum sample size to observe at least one variance for a given population minor allele frequency with a high percentage of chance; for example, 99% chance. So, that means you are almost certainly observe

11

00:02:32,400 --> 00:02:46,200

these variants in your sample. So, if the population minor allele frequency is 10%, then you need at least 33 people to observe this variance, and so therefore for almost every single GWAS study you will be able to

12

00:02:46,200 --> 00:03:01,966

observe this variance. However, if the minor allele frequency is 1%, then you need at least 300 people to observe this variance. And now, if the minor allele frequency becomes smaller, like .1%, then you need at least

13

00:03:01,966 --> 00:03:15,066

3,000 people to observe these variants, even though the variants exist in the population, but if you don’t have enough people you may not observe it. So therefore, this makes analysis of the rare variants more challenging.

14

00:03:15,066 --> 00:03:33,132

To illustrate, suppose you can only afford to sequence 100 people; then you are able to observe almost all the variants with minor allele frequency which is now 1%, but almost 90% of the variants with minor allele

15

00:03:33,133 --> 00:03:46,199

frequency less than 1% will not be observed. So because they are too rare, if you don’t have enough people, even if they exist in the population, you cannot observe them. Suzanne talked about analysis pipeline, so I

16

00:03:46,200 --> 00:04:00,433

won’t iterate on the analysis pipeline for the sequencing studies. First of all, as Suzanne mentioned, the QC is very, very important, so I will illustrate. If you don’t do the QC well, if you don’t do the filtering well, then

17

00:04:00,433 --> 00:04:14,999

you may have the chi-[---] effect. The second is for the sequencing study, and to study the rare variants one needs to move away from the single individual SNP analysis and to focus on the region-based analysis.

18

00:04:15,000 --> 00:04:28,933

So a natural question to ask is how to define the regions, and if you are smart enough to define the region well, you may have power, but if you don’t define the region well, the region had too many non-variants not

19

00:04:28,933 --> 00:04:41,366

associated with disease, and you might lose power. So, in exome sequencing studies a natural way to define the region is a gene, and so in the whole genome study one can think about the moving window to

20

00:04:41,366 --> 00:04:55,932

cover the intergenetic region. Then after you define the region, how can you do the analysis? You observe both the common variants and the rare variants, so you want to use all the information in the analysis, and also

21

00:04:55,933 --> 00:05:10,333

for genetic study, adjustment for covariates such as population stratification is important. So, we want to build a statistical model to account for the covariates. After you are done defining the top region,

22

00:05:10,333 --> 00:05:24,933

then a natural question is: are those top regions really the truth? So, you want to think about: how can you design the validation study and how can you develop the strategy for functional analysis? So, every stage is

23

00:05:24,933 --> 00:05:42,733

challenging. So, let’s next talk about the statistical models to analyze the rare variants. So as I mentioned, for the rare variants we need to move away from the single SNP analysis to multi-marker analysis. So, suppose

24

00:05:42,733 --> 00:05:59,033

you have the covariates like age, gender, what population stratification—like a PC—to control for, and then suppose in this gene you have P SNPs—say, 100 SNPs in this gene—and many of them are rare. So therefore

25

00:05:59,033 --> 00:06:16,166

for each column, there are many, many zeros, and only small number of 1 or 2. Now what you want to do is like what we do in single SNP analysis; we do a regression model. We regress a disease phenotype, like for the

26

00:06:16,166 --> 00:06:31,099

continuous trend we can do a linear regression, and for the case control trait we can do a logistic regression, and we regress them on the covariates and also the P SNPs. Ideally, if all the betas associated with

27

00:06:31,100 --> 00:06:46,133

the P SNPs are equal to zero, then that means that there’s no association of any SNP within the gene and the disease outcome. So therefore, we want to test all the 100 betas equal to zero, and how can we do that? So,

28

00:06:46,133 --> 00:06:58,933

you can see if you do a 100 degree freedom test there is no power, so one natural way to do it is let’s do collapse; let’s collapse them. So, how can we collapse them? We are going to restrict to the rare variants. So,

29

00:06:58,933 --> 00:07:12,066

how can you define the rare variants? We cut the variants based on some criteria, for example the minor allele frequency less than 5% and then we collapse all the rare variants. So by doing that, then you can see

30

00:07:12,066 --> 00:07:30,132

that if all the betas—beta 1 to beta100—if all the betas are the same and you take the beta out, then inside what you have is basically a dose, that is a total number of rare variants within that gene. So, this basically is a

31

00:07:30,133 --> 00:07:43,299

collapsing method so you assume all the beta is the same associated with each individual variants and you take it out, and you calculate those. So, there are various versions of the collapsing method, as Suzanne

32

00:07:43,300 --> 00:07:57,166

mentioned, like CMC method, or WSS method, or MZ method, or the variant threshold method. This also explains in what situations the collapsing method works. The collapsing method and the Burden method is optimal

33

00:07:57,166 --> 00:08:10,499

when all the rare variants—all the 100 variants within this gene—are casual variants, so that means all the betas [---] to zero and also all the variants have the effect in the same direction and also within the same

34

00:08:10,500 --> 00:08:26,000

magnitude. In that situation the Burden test is going to be the best. However, the picture is not always that beautiful and it is rare that all the rare variants in this region are causal. Therefore, we want to account for

35

00:08:26,000 --> 00:08:40,666

the situation where a majority of the variants in this region are not causal and also some variants may be protective and some variants may be harmful. So, one approach is called the SKAT method we developed and

36

00:08:40,666 --> 00:08:54,366

the key feature is robustness. We don’t make any assumptions, so that means we allow most of the variants to have no effect, so there are lots of null variants or the effect could have the same direction or it could

37

00:08:54,366 --> 00:09:03,699

have different direction. So, what is the basic idea? As you recall in the collapsing method, what we do is we collapse the variants first and we collapse them and calculate the total dose. Therefore, if in the effect

38

00:09:03,700 --> 00:09:22,400

some of them are positive and some of them negative then they cancel out. Now for the SKAT methods, we don’t collapse the variants, what we do is collapse the statistics. Therefore, suppose you do a single marker

39

00:09:22,400 --> 00:09:35,433

analysis. For each individual marker and in GWAS you calculate the [---] test or you calculate the likelihood ratio test. Here we don’t do the likelihood ratio, we don’t do the [----], we calculate the score statistics,

40

00:09:35,433 --> 00:09:48,099

and when we calculate the score statistic, which is called the Mantel-Haenszel statistic in epidemiology, we aggregate this individual Mantel-Haenszel statistic for each individual marker and then we put in a weight.

41

00:09:48,100 --> 00:10:01,666

So what is a weight? The weight basically is we want to incorporate biological knowledge in the analysis. We want up with the rare variants and down with the common variants. You can see this SKAT statistic is

42

00:10:01,666 --> 00:10:17,432

basically very easy to calculate. We basically collapse individual marker test statistics. Because there is a score statistic like in the Mantel-Haenszel test that people learn in graduate school we only need the null

43

00:10:17,433 --> 00:10:30,266

model. So, this is a model without any genetic [---], without any SNPs, so with only covariants. So, this is the same across the genome for every single gene. Therefore, you just need to fit the null model once, so

44

00:10:30,266 --> 00:10:45,299

therefore computationally, this is very, very fast because you just need to fit the model once. The P value can be calculated very quickly analytically using a mixture of chi-square distribution. The C-alpha test is a special

45

00:10:45,300 --> 00:11:01,666

case of the SKAT to see how fast it takes and first I will show you the weight function. In order to improve the power we want to incorporate the weight when we collapse the individual test statistics. What we want

46

00:11:01,666 --> 00:11:20,866

to do is, if there is no weight—so, basically everybody we could do equally—we use this beta density as a weight which has a broad class of different shapes; no weight is a special case of beta density. So, you

47

00:11:20,866 --> 00:11:33,866

can see how we make the rare variants with more, and common variants with less, so that means that, based on population genetics, we believe that rare variants are more likely to be causal variants, have a bigger

48

00:11:33,866 --> 00:11:51,832

effect, and so we up with them more. To show you the speed…we can do this using the score test. It’s really, really fast and you can see that if one has a laptop to analyze the whole genome and you select a 300Kb

49

00:11:51,833 --> 00:12:07,433

regional time, you total it using a laptop—you will need, like, seven hours to analyze a whole genome—and computationally this is attractive. I don’t need the permutation and also I can control for the covariates. So, this

50

00:12:07,433 --> 00:12:20,899

comes with a question. In summation, a Burden test is more powerful when a large fraction of the variants are causal within the region and also the effects are in the same direction. In this situation Burden test is

51

00:12:20,900 --> 00:12:36,500

going to perform well and is going to be better than SKAT. SKAT is more powerful when a small fraction of the variants are causal in the region or the effects have mixed directions—some are positive, some negative. In

52

00:12:36,500 --> 00:12:53,700

reality, when we scan the genome both scenarios can happen and we don’t know which one is which. Therefore, a natural question is: which test to use? , we developed this test called Optimal Unified Test. What is

53

00:12:53,700 --> 00:13:09,366

Optimal Unified Test? Let’s talk about the Unified Test first. The unified test is a weighted average of the Burden test and also the SKAT test statistics. So, you can see if I make the rho and this is a weight, if I make

54

00:13:09,366 --> 00:13:22,099

the rho=0, that becomes a SKAT. If I make the rho=1 then this becomes the Burden test. Then the question is: how can we choose the rho? You

55

00:13:22,100 --> 00:13:30,400

can see for this unified test both the Burden test and also the SKAT test have special cases. So the question is: how can you choose the rho? What

56

00:13:30,400 --> 00:13:49,500

we can do is we can use a data to estimate the optimal rho to maximize the power. So, it turns out this unified test is an extended SKAT test with the correlation of the beta to be rho. So therefore, if you think about the

57

00:13:49,500 --> 00:14:06,900

correlation of the betas… if you look here… if all the betas you called are the same, that means the correlational beta is 1, then you collapse, right? If the betas are not related to each other, then that is a SKAT. So

58

00:14:06,900 --> 00:14:30,800

therefore, you can see that rho basically measures the correlation among the betas. This is extended SKAT family and so the P value of the optimal unified test we call the SKAT-O—O stands for optimal—is quickly

59

00:14:30,800 --> 00:14:47,200

calculated analytically, so therefore I don’t need to do permutation; I can call it analytically very quickly. One thing we need to worry about in the sequencing—in the whole genome sequencing—is the size of the test;

60

00:14:47,200 --> 00:15:00,600

whether we can make sure the Ypte 1 error rate is right. So, suppose we do the whole exome sequencing and totally there are 20,000 genes. Therefore, the alpha level—the genome-wide significance alpha level—is

61

00:15:00,600 --> 00:15:23,500

about 10-6. If you look here then you can see that at 10-6 you can see for the continuous trait they are all close to 10-6. For binary trait they are all 10-6 except when the sample size is about 500 and for the case

62

00:15:23,500 --> 00:15:39,900

control trait the SKAT is a little bit conservative, and so this is what we are going to fix later. So to show you the power, first of all we assume the causal variants are all in the same direction and all the causal variants

63

00:15:39,900 --> 00:15:59,933

are in the same direction. If I assume 10% of variants are causal, the blue is the collapsing method and the orange is a SKAT and the red is a SKAT-O. So, you can see if a majority of the variants are noises, the SKAT is

64

00:15:59,933 --> 00:16:13,999

better than the collapsing method and the SKAT-O tries to approximate the SKAT. Now, if I make 50% of variants causal then you can see the collapsing method is better because all the causal variants are in the

65

00:16:14,000 --> 00:16:30,200

same direction and a majority of them are causal; then the collapsing method is better than the SKAT but then the SKAT-O tries to approach the collapsing method. Now, if you see the situation where the variants are in

66

00:16:30,200 --> 00:16:43,466

different directions—80% of the causal variants are positive, 20% are negative—the you can see that in all those situations the SKAT is better than the collapsing method and the SKAT-O tries to approximate the SKAT.

67

00:16:43,466 --> 00:17:01,099

Therefore, the SKAT-O is more effective to different situations of the biology. So in reality, both scenarios can happen. This can be seen from this early gene sequencing study of the Dallas Heart Study. So in this

68

00:17:01,100 --> 00:17:17,633

study, they sequenced about 3,500 people and they sequenced regions, and totally there are about 93 variants in the 3 genes. The majority of them are singletons; almost 50% are singletons. There are three ethnicity

69

00:17:17,633 --> 00:17:35,299

groups, so the trait is continuous; it’s a TG. So here is a result. This is based on the Burden test and if I put all the three things together and analyze them simultaneously, then the Burden P value is about 10-4 and

70

00:17:35,300 --> 00:17:51,533

the SKAT P value is about 10-5. So, for the SKAT-O you can see it has tried to approximate the SKAT. This situation becomes rho. In this situation you can estimate it and it is pretty small, so that tells you the SKAT should

71

00:17:51,533 --> 00:18:07,066

be better than the Burden. Now, let’s look a different scenario for this ANGPTL3. The Burden has a P value of about 2x10-3 and for the SKAT it is about 9x10-3 and the SKAT-O tries to approximate the Burden; in this

72

00:18:07,066 --> 00:18:20,366

situation you can see the beta are more in the same direction. Therefore, the SKAT-O looks more robust and picks up the more powerful test between these two scenarios. Similarly, for the next gene, you can see

73

00:18:20,366 --> 00:18:33,666

the SKAT. The rho is very small, so that indicates that probably the variance may be in different directions and so the SKAT-O is more powerful than the Burden, the SKAT is more powerful than the Burden,

74

00:18:33,666 --> 00:18:49,832

and the SKAT-O tried to approximate the SKAT; so you can see this P value is closer to that. As I mentioned in the simulation study we found that the SKAT test may be conservative when the sample size is small.

75

00:18:49,833 --> 00:19:04,733

Do we care? Yes, we do. The reason is, in the current exome sequencing study the sample sizes are pretty small because the cost is high; right now it’s about $1,500 a person, so one cannot sequence a lot

76

00:19:04,733 --> 00:19:18,499

of people. So, the large sample-based P value for both the Burden test and also for the SKAT test are conservative. So for example in this simulation study we simulated 100 cases and 100 controls under the

77

00:19:18,500 --> 00:19:35,500

null—no genetic effect. You can see the Q-Q plot is off the 95 degree line, so we need to improve it to fix the size. So, how can we do it? It turns out we can calculate the small sample adjusted P value analytically

78

00:19:35,500 --> 00:19:50,166

by more accurately approximating the test statistic distribution. You can see the green line here is before the correction and the black line here is after correction. You can see after it, with a small sample P value

79

00:19:50,166 --> 00:20:07,266

correction, they are all on the straight line, so I am able to crack the size for small sample exome sequencing study to make the size right.To illustrate, this was a motivated by this NHLBI Exome Sequencing Project,

80

00:20:07,266 --> 00:20:26,899

so we are part of the lung [---], so the particular phenotype that we have is Acute Lung Injury, ALI. So, in this study we have the Utah sequencing center and sequenced about 90 subjects and about half were case and

81

00:20:26,900 --> 00:20:41,000

were controls using the whole exome sequencing technology. First, we needed to filter the SNPs so we just used the Broad GSA criteria and those are the criteria that Broad provided and they seemed like they work

82

00:20:41,000 --> 00:21:01,766

quite well. After the filtering there was about 106K SNPs left over for analysis. So, why is the filtering important? Here I will show you if you don’t filter then there is likely to be a batch effect, so one needs to be

83

00:21:01,766 --> 00:21:16,866

very careful. This is before filtering. This is from batch 1. This is a Utah Sequencing Center using an older platform—Illumina platform—in between and Illumina updated the platform and this is a new platform. You

84

00:21:16,866 --> 00:21:32,199

can see most of the controls that were sequenced in the first batch, in most cases, were sequenced in the second batch. So, if you do the principle component analysis, definitely you can see this big principle

85

00:21:32,200 --> 00:21:44,300

component; two structures. So, we were really worried, but then after the filtering you can see they were all mixed up, so that is good. So, one thing that is very, very important is filtering is very, very important to do

86

00:21:44,300 --> 00:22:06,133

when you analyze your data. Also, when they did the sequencing we shaped the sample to them, so probably the control wasn’t even on the same place and in some cases were on the same place and the

87

00:22:06,133 --> 00:22:13,966

sequencing center just sequenced the controls first and sequenced the cases later. So, I think if we knew better we would have mixed the sample before we sent it to the sequencing center. This is the SKAT

88

00:22:13,966 --> 00:22:29,132

result and also this is the SKAT-O result so this, as you recall, is optimal of the Burden and the SKAT test. So, before the small sample correction you can see the Q-Q plot is off. After the small sample correction you can

89

00:22:29,133 --> 00:22:46,533

see that all are straight lines, so that is good for this dataset. Okay. Now I will move on to the design. Several people discussed the designs of the sequencing study. So, the first question we answer is, suppose I know

90

00:22:46,533 --> 00:22:58,566

what design to use; I am going to have a case control design or I am going to have a continuous phenotype. Then, what are the key parameters which can be used to design the sequencing study to do the

91

00:22:58,566 --> 00:23:11,332

power calculation? There are a few things. First, we need to specify the percentage of causal variants. So remember, in my regression model I have I have 100 betas and I want to know what percentage of betas are

92

00:23:11,333 --> 00:23:24,733

not equal to 0, so that means those are the causal variants. Second, I need to know amount of causal variants. Suppose 20 betas are not equal to 0, 20% of variants are causal. I want to know: among those 20 betas,

93

00:23:24,733 --> 00:23:38,633

what percentage are protective and what percentage are harmful? Also, I need to specify the effect size of the beta and so here we allow the effect size to be a decreasing function of minor allele frequency. So, if

94

00:23:38,633 --> 00:23:52,466

the minor allele frequency is high we assume the effect size is high…sorry, correction…if the minor allele frequency is low we assume the effect size is high and if the minor allele frequency is high we assume the

95

00:23:52,466 --> 00:24:07,966

effect size is low, and also for case control sampling we need have the prevalence as well. So, we can do the analytic power and sample size calculation for both the SKAT and SKAT-O using the SKAT package. Also,

96

00:24:07,966 --> 00:24:23,766

to illustrate here, here is just one example. Suppose I use a Windows size of 3Kb to mimic the exome sequencing and so suppose our power is 80% and the genome-wide significance level is 10-6. So, if I assume

97

00:24:23,766 --> 00:24:36,999

continuous trait with the heritability is 6% , this a little bit high; you can make this down. So, if 20% is causal you need to sequence about 5,000 people. If 30% of variants are causal, you need to sequence about 2,000

98

00:24:37,000 --> 00:24:52,633

people. For the case control trait, if the problem is 10%, you need to sequence about 9,000 people, and if 30% are causal variants then you need to sequence about 4,000 people. So, if the prevalence is lower, then

99

00:24:52,633 --> 00:25:05,399

you need to sequence less people. So, here we assumed, as we showed, a decreasing function with minor allele frequency with odds ratio of about 2.5 for minor allele frequency equal to 3%. So, you can do

100

00:25:05,400 --> 00:25:22,200

this sample size any way you would like in the SKAT package. Okay, so this is another question we face because right now exome sequencing cost is still pretty expensive, so even if it would be $6 a month and

101

00:25:22,200 --> 00:25:33,833

suppose I could only afford to sequence 500 people in my cohort…suppose my cohort had 10,000 people, I could only afford to sequence 500 people. What sampling strategy should I use to increase analysis

102

00:25:33,833 --> 00:25:48,633

power to study for the rare variants effect? Should I do random sampling and randomly select 500 people? Or should I do extreme sampling and pick 250 people in the upper pill and 250 people in the lower pill. So,

103

00:25:48,633 --> 00:26:02,233

which sampling will give me better power? So, here are our choices. First is random sampling. So, I have a continuous trait and so, for example, the BMI for example, and then I can use a continuous trait and apply the

104

00:26:02,233 --> 00:26:17,399

SKAT-O method, or I can do extreme phenotype sampling, basically like this. So, suppose you think about the whole BMI distribution. I pick up 250 here, 250 here, and then after I pick through the extreme phenotype

105

00:26:17,400 --> 00:26:28,733

sampling there are several analysis strategies. The first analysis strategy is to treat all those guys the same—treat them as a case and treat those as a control. So, you treat the outcome as yes/no outcome and do

106

00:26:28,733 --> 00:26:42,433

logistical regression, or you treat those as a continuous and treat those as a continuous. Then you can see the distribution, if you use this 500 people, the distribution is not normal anymore; it is a truncated normal. So

107

00:26:42,433 --> 00:26:55,533

therefore in the analysis, you can use a continuous phenotype but you need to use a truncated normal. So therefore, we have three analysis strategies. The first is treated as case-control data; high versus low

108

00:26:55,533 --> 00:27:10,866

through logistic regression and do the Burden test, so we call this an EDP. The “D” stands for dichotomous treat and I do the Burden test. Second, I can treat this as a case-control analysis, and I apply SKAT-O, so this is a

109

00:27:10,866 --> 00:27:24,599

dichotomous trait with SKAT-O. The last strategy is I observe the BMI, treat them as a continuous, but the regression is not simple linear regression; one needs to use the truncated normal linear regression and I

110

00:27:24,600 --> 00:27:41,333

apply SKAT-O, so we can develop this and we have developed a strategy for this analysis, and you can see, this is going to be the most powerful strategy. So, here’s a power comparison from the simulation, so

111

00:27:41,333 --> 00:27:58,866

let’s look at this. The sample size is 4,000 and the blue indicates random sampling, use a continuous trait to do random sampling, and this pink is the dichotomous sampling using the Burden test. This green is

112

00:27:58,866 --> 00:28:14,332

dichotomous sampling using the SKAT-O. So, the last one is the continuous analysis using truncated normal and using the SKAT-O. And so for the first scenario, I will assume 40% of variants are causal and

113

00:28:14,333 --> 00:28:28,966

some betas are positive, some betas are negative. Then you can see that if I use the truncated normal continuous SKAT-O strategy, it’s going to be the most powerful. Similarly, if I still have 40% of variants to be causal but

114

00:28:28,966 --> 00:28:45,099

all the effect are in the same direction, then you can see the collapsing method pick-up, but the SKAT-O using the continuous trait is still more powerful. So, the next question is: why shouldn’t phenotype sampling

115

00:28:45,100 --> 00:29:00,933

help improve the analysis power? The reason is that extreme phenotype sampling helps enrich causal variants. So, here is the minor allele frequency in minus log 10 minor allele frequency. So this side is more

116

00:29:00,933 --> 00:29:15,299

rare, this side is common. So, this is the observed minor allele frequency divided by population minor allele frequency. You can see if no variant is causal as there’s no enrichment. But if heritability is at 2.6, then you can

117

00:29:15,300 --> 00:29:30,833

see, I think, enriched rare variants by two times, and so that’s why it helps to improve the analysis power. So, this is your pickup of top 10% and bottom 10%, but if you pick up top 20% and bottom 20%, then the

118

00:29:30,833 --> 00:29:49,766

enrichment is going to be less; that makes sense. Okay. The last part of this talk is that many of us have the GWAS data, so I want to know if I use GWAS data to impute exome data, how that works. Will that work or

119

00:29:49,766 --> 00:30:03,466

will this be junk? So therefore, the first question we want to answer is we want to evaluate the performance of imputing rare variants using the GWAS on the 1000 Genome data with the exome data. We use the

120

00:30:03,466 --> 00:30:18,566

exome seq data as a gold standard and then we impute the rare variants and then we want to compare how good the imputation is. Second, after we impute the data, I want to see if I do the association study using the

121

00:30:18,566 --> 00:30:36,966

imputed SNP to study the association and using either the Burden test or SKAT-O, whether the P value agrees with my P value using the exome seq data. So, the data set we have is this 89 subjects and the ALI

122

00:30:36,966 --> 00:30:49,366

subjects. For these 89 subjects we have both exome seq data and also we have GWAS data because these 89 subjects are part of a larger GWAS we have; we have about 2,000 people in the GWAS study. Okay,

123

00:30:49,366 --> 00:31:06,266

so let’s see how that works. First of all, let’s look at the imputation. So, we used the 1000 Genome to impute; for the GWAS data we have Illumina 610K. So totally, using 1000 Genome we can impute about 38

124

00:31:06,266 --> 00:31:28,466

million SNPs, and so this is overlap. You can see that the imputed SNPs and also the GWAS SNPs have very, very high overlap; almost 99% of the SNPs, the GWAS SNPs being imputed. But if you look at the exome

125

00:31:28,466 --> 00:31:46,966

seq data with the imputed SNPs and data, then you can see almost we have about 79,000 SNPs we can impute for the exome seq data. So, about 29,000 SNPs and they are not in the imputed data; so what are

126

00:31:46,966 --> 00:31:59,366

these 29,000 SNPs? Most of them are singletons and so they are only appearing in exome seq data. They are not in the 1000 Genome so that’s why we cannot impute them. And also, the agreement with exome seq

127

00:31:59,366 --> 00:32:13,166

data and the GWAS data you can see we have, for the GWAS data there’s only very small fraction of a SNP on the exome, and among those 9,000 SNPs the agreement is almost 99.9%. So that means exome

128

00:32:13,166 --> 00:32:29,966

sequencing data calling is very good; we can have a good agreement here. Okay. So, to look a little bit closer, I’m going to focus on chromosome 1. So for chromosome 1, totally there are about 11,000 exome seq SNPs

129

00:32:29,966 --> 00:32:53,466

using the exome sequencing platform, so among these 11,000 exome seq SNPs about 8,000 of them were able to be imputed from the GWAS and the 1000 Genome data. So, you can see among these 8,000 SNPs which

130

00:32:53,466 --> 00:33:09,866

are part of the exome seq data you can see almost 4,000—50%—are common variants and about 2,000 have minor allele frequency between 1% to 5% and about 1,700 have minor allele frequency of less than 1%.

131

00:33:09,866 --> 00:33:29,066

And now, if I look at the scores of the imputed SNP with exome seq SNP with a score of greater than .3, so I basically focus on the good quality imputed SNPs. So, you can see many of the common variants…the

132

00:33:29,066 --> 00:33:42,766

imputation quality is good but for the rare variants you can see the imputation quality is not very good; so that makes sense. To see how this works, here I applied the histogram of the imputed dose against the

133

00:33:42,766 --> 00:33:57,266

exome seq genotype, so we used exome seq genotype as a gold standard. So, let’s look at the common variants first. You can see the common variants have the genotype 0, 1, and 2; 0 stands for wild-type.

134

00:33:57,266 --> 00:34:10,666

So, you can see the box plot of the dose almost agree exactly with 0, 1, 2, so that means for common variants the imputation worked very well. For uncommon variants, for minor allele frequency between 1% to 5%

135

00:34:10,666 --> 00:34:25,566

you can see for the wild-type agreed very well but for the heterozygous and the homozygous you can see the imputed dose, so even the median is close to 1 and the 2, but you can see there is quite a bit of variability

136

00:34:25,566 --> 00:34:43,166

here. Also, for the rare variants basically you can see that for the wild-type agreed very well for heterozygous and there is some variability and some imputation which does not agree with the exome seq. So overall,

137

00:34:43,166 --> 00:34:58,966

the imputation worked pretty well. Okay. Now, the next question we asked ourselves is, suppose I’m using imputed data to do the association study and also I have exome seq data to do the imputation study? So for

138

00:34:58,966 --> 00:35:14,166

each gene, I concocted two P values. One is a P value using exome seq data, the other is P value using the imputed SNP data. How much do they agree? It turns out that they are no bad. So for example, if I use all SNP, I

139

00:35:14,166 --> 00:35:32,566

use the SKAT-O, so you can see this is the P value on the log scale, and so this is imputed, this is a P value using exome seq, so the r score is about .75. If I restrict to the rare variants, the r score is about .74, so not

140

00:35:32,566 --> 00:35:49,466

too bad using the imputed data. This is using the CMS method and you can see the agreement similarly and if I use all the SNPs the agreement between the P value using the imputed data and exome seq data is about

141

00:35:49,466 --> 00:36:11,366

.9; if I restrict to rare variants, the r score is about .7, so it’s not too bad. So in summary, the Burden test and the SKAT work well in different situations and the SKAT-O is an optimal unified test that works well in

142

00:36:11,366 --> 00:36:26,199

both situations, so [---] in biology which is often unknown. So, we have developed the software to help people with the power and sample size calculation for designing sequencing studies, especially if you need to

143

00:36:26,200 --> 00:36:42,200

write a large grant, you need the power calculation, so you can use this package. We have found that extreme phenotype sampling can help enrich rare variants if you want to use a continuous truncated normal

144

00:36:42,200 --> 00:36:59,200

outcome to do the extreme phenotype sampling that can help increase analysis power. We also, based on the limited data we have, we found out that analysis using imputed rare variants using the GWAS+1000

145

00:36:59,200 --> 00:37:16,100

Genome has a promising result. This is really a team work with many of the people, especially students and post-docs, so those are the people involved in the original SKAT paper and the SKAT-O paper. So, this is my

146

00:37:16,100 --> 00:37:27,100

post-doc, Seunggeun Lee and associates, the driving force for the SKAT package and also for the imputation, Lin Li is my post-doc and he’s a driving force for that. Ian Barnett is my student who helped with the

147

00:37:27,100 --> 00:37:46,400

extreme phenotype project, and for the ALI NHLBI Exome Sequencing Project, those are the investigators who have helped with the Exome Sequencing Project. For the SKAT and SKAT-O package you can find it

148

00:37:46,400 --> 00:37:59,600

from this website, and for those other references and for the SKAT that was published in 2001 in the American Journal of Human Genetics and SKAT-O is under review, and the C-alpha test is a special case of SKAT.

149

00:37:59,600 --> 00:38:14,100

There are multiple Burden test references, I think Suzanne mentioned, and there’s one very nice paper we just published in PLoS Genetics this month and they compared different methods of association tests, and if

150

00:38:14,100 --> 00:38:31,800

you are interested you can take a look at this paper. Thank you. MALE: One question concerning the use of GWAS data in imputation is

151

00:38:31,800 --> 00:38:48,900

that this GWAS was sort of the old version—the Illumina 660W, I guess—so you only had, like, 500,000 SNPs. If you were to use like a 2.5 million SNP GWAS, how much would that affect the imputation quality?

152

00:38:48,900 --> 00:38:58,000

XIHONG LIN: That’s a good question. I don’t know. We don’t have that data so it’s hard. I cannot answer that question. Probably other people have that experience and can help answer that question.

153

00:38:58,000 --> 00:39:06,033

MALE: Thank you. PETER SONG: I’m Peter Song from Biostat, University of Michigan. I help a

154

00:39:06,033 --> 00:39:27,333

lot of my colleagues from nephrology to design studies, so I’m very interested in your software, SKAT. First of all, thank you for the interesting informative talk and very nice work. So, in the design you told

155

00:39:27,333 --> 00:39:43,166

us that we need to specify the effect size of these SNPs, so I wonder if this effect size will be the same in the design. Let’s say you believe there are 20 SNPs that are causal, then in the design you will assign the same

156

00:39:43,166 --> 00:39:56,066

effect size or different effect size? XIHONG LIN: By letting…nobody knows what is the best way to specify

157

00:39:56,066 --> 00:40:11,566

the effect size, so we know, like, for the common variants based on published GWAS, the effect size is small; it’s about 1.2 or 1.3. So basically, we just assume probably it’s like decreasing, so I think we

158

00:40:11,566 --> 00:40:23,966

assume it is a decreasing function, is a log function of the minor allele frequency, but I think if there are more data and more results published then probably we will have a better understanding of the true effect size

159

00:40:23,966 --> 00:40:30,666

could be. PETER SONG: Another thing is rho, because rho is a weighting that you

160

00:40:30,666 --> 00:40:37,166

use in the SKAT-O. In the design do I need to specify the rho coefficient—the weighting?

161

00:40:37,166 --> 00:40:48,466

XIHONG LIN: Yeah, rho you can estimate. Suppose for example, when you design the study you assume all the effects are in positive, all in the same direction, they are all the same. In that situation you can estimate

162

00:40:48,466 --> 00:40:57,566

the rho from how you specify the effect size. So in that situation, rho will be 1, and so whatever effects that you specify, you can estimate the rho.

163

00:40:57,566 --> 00:41:10,066

PETER SONG: Okay, so essentially I don’t need to specify rho in your software; I just input the number of the causal SNPs and effect size, then the software automatically estimates the rho?

164

00:41:10,066 --> 00:41:19,332

XIHONG LIN: Exactly. So for investigator, you don’t need to specify the effect size; how the effect size changes with [---] frequency, then the software will automatically estimate the rho.

165

00:41:19,333 --> 00:41:20,233

PETER SONG: Okay. Thanks.

Date Last Updated: 9/18/2012