Whole Genome Approaches to Complex Kidney Disease
February 11-12, 2012 Conference Videos

Group Discussion—Issues in Data Analysis: What are the Challenges?
Andrey Shaw, Washington University in St. Louis and Cheryl Winkler, National ancer Institute

Video Transcript

1
00:00:00,000 --> 00:00:10,666
ANDREY SHAW: I’ll start the ball rolling. So, we had some discussions about QC. From my perspective as a practitioner, how do we

2
00:00:10,666 --> 00:00:28,066
implement QC? What do we decide what good QC is, and how do we make that a part of the standard review of data that’s going to be either

3
00:00:28,066 --> 00:00:33,999
uploaded to the Web or published? SUZANNE LEAL: I think right now that there is no

4
00:00:34,000 --> 00:00:44,533
standard and it’s something that’s still evolving, and there’s different levels of QC, so what I spoke about was QC on the VCF file level, but

5
00:00:44,533 --> 00:00:56,833
really, to do proper QC, you have to really go back to the alignment and look at the alignments, too. So, I think this is something that’s still evolving

6
00:00:56,833 --> 00:01:08,066
and more thought has to be put into properly QCing data. I mean, there are clearly some things you should but I think there’s much more that we

7
00:01:08,066 --> 00:01:14,366
have to think about and develop. JOAN BAILEY-WILSON: And there’s some

8
00:01:14,366 --> 00:01:24,566
consortia that are starting to be put together sort of checklists, much like Geneva has done and published for GWAS QC where they’ve taken the

9
00:01:24,566 --> 00:01:33,532
things that lots of different groups have figured out are good practices and sort of made a standardized pipeline. I know some of the folks at

10
00:01:33,533 --> 00:01:42,233
the Broad are doing this for some of their studies. Gonçalo Abecasis’ consortia are doing that, this new Austism Consortium’s talking about making

11
00:01:42,233 --> 00:01:50,699
sure everybody is doing the same, but we’re not all…we’re not at the point where Geneva finally published that paper, what, a year ago, two

12
00:01:50,700 --> 00:02:00,600
years ago at most? We’re struggling. Everybody’s telling each other what they’re finding and how to do things but we are still at that development

13
00:02:00,600 --> 00:02:09,733
stage, and there are some things that people are all starting to agree on like read depth less than 10, it’s probably garbage, and, you know, there

14
00:02:09,733 --> 00:02:19,666
are some rules like that, but it is still not final. Jamie’s saying “yes.”

15
00:02:19,666 --> 00:02:29,399
XIHONG LIN: Yeah, I agree, and I think that also the QC also depends on the sequencing depth and if the sequencing depth is shallow, then you

16
00:02:29,400 --> 00:02:37,766
probably need to have more stringent criteria and if the sequencing is deep…so right now, for example, for the Utah Sequencing Center—I think

17
00:02:37,766 --> 00:02:47,799
similar at Broad—they do at least a 330k 30X or 100X. So in that situation, it seems like the quality is pretty good. And also, the other way to think

18
00:02:47,800 --> 00:02:56,366
about a QC, if you have GWAS data, you can compare the overlap of the sequencing data with the GWAS data. So for example, in this dataset

19
00:02:56,366 --> 00:03:08,566
we have almost about 10,000 SNPs in the overlap, so I can compare the correlation and so it seems like, at least for our data, it seems like

20
00:03:08,566 --> 00:03:14,732
the correlation is very high. So, that is kind of reassuring.

21
00:03:14,733 --> 00:03:22,666
JOAN BAILEY-WILSON: Yeah, the Seider Lab is doing that and NISC is doing that now, too. If they bring in data they’re going to sequence, they just

22
00:03:22,666 --> 00:03:29,499
do a chip on the same sample so that they can do that kind of quality control comparison.

23
00:03:29,500 --> 00:03:37,033
MALE: First of all, I enjoyed this session very much and benefitted a lot from all the collective wisdom that you have accumulated at the

24
00:03:37,033 --> 00:03:50,366
forefront in this area, and my question is directly trying to tap into your collective wisdom. So, let’s assume or let’s pretend…or not pretend…but let’s

25
00:03:50,366 --> 00:04:02,899
take my…I have a cohort with 10,000 participants and interesting phenotypes and my dean says, “Here you go; next week I’ll give you a million to

26
00:04:02,900 --> 00:04:16,800
get us the hottest genomic data on your cohort,” and if I do exome sequencing I can probably, at this point in time, maybe sequence 1,000 of the

27
00:04:16,800 --> 00:04:23,466
10,000. If do the exome array I can do all of them. Now, what should I do?

28
00:04:23,466 --> 00:04:39,566
SUZANNE LEAL: I think…well, first of all the exome array probably to do the 10,000 would be less money than doing the 1,000, but also you

29
00:04:39,566 --> 00:04:49,566
have to take into consideration: are these European Americans? Are these Europeans? If they’re not Europeans, then I would not go with

30
00:04:49,566 --> 00:04:58,832
the exome chip, so that’s one consideration. Now, I think it’s probably…what?

31
00:04:58,833 --> 00:05:03,166
MALE: Could you explain that? SUZANNE LEAL: Oh, it’s just because of how

32
00:05:03,166 --> 00:05:11,999
the exome chip…what was used to the determine the SNPs that were put on the exome chip. So, the selection was done based on

33
00:05:12,000 --> 00:05:23,500
12,000 individuals and the vast majority of those individuals are European Americans. I think there was maybe 2,000 exomes that were African

34
00:05:23,500 --> 00:05:37,166
American and only maybe 500 Han Chinese and 500 Hispanic. Also, there was this criterion that the variants had to be seen at least in two

35
00:05:37,166 --> 00:05:48,299
studies, seen three times for most of the variants…three times in two studies and for, like, splice sites I believe it was only two times but in

36
00:05:48,300 --> 00:05:59,266
two studies. So in certain populations, that’s not going to work very well at all, but I think it’s hard to say because, you know, you’re going to lose

37
00:05:59,266 --> 00:06:11,899
some variants but you do have the trade-off where you can do a much larger sample size and I think we’ll start knowing a little bit better soon

38
00:06:11,900 --> 00:06:22,100
about that. You probably don’t have as much as…it’s not as problematic with calling the variants as with the sequencing, not only of it

39
00:06:22,100 --> 00:06:34,133
being able to do a much, much larger sample size. So, I would say it’s even more than, you know, 10 times the sample size you could do for the…it’s

40
00:06:34,133 --> 00:06:43,733
much more than 10 times the sample size you could for the same price. So, if you have a really large cohort, that might, right now, be the way to

41
00:06:43,733 --> 00:06:57,199
go, but I don’t know. I don’t think there’s such good odds for choosing one over the other right now, but if you have a smaller sample size, if you

42
00:06:57,200 --> 00:07:09,466
only had 1,000 individuals and you’re doing exome sequencing, you’re probably going to be very unpowered.

43
00:07:09,466 --> 00:07:17,732
FEMALE: Many speakers today mentioned gene-gene/gene-environment interactions but the power picture that they pictured was very

44
00:07:17,733 --> 00:07:29,199
pessimistic. They needed 6,000 samples more, and given that this was a power for main effects, and usually for interaction effects we need, like,

45
00:07:29,200 --> 00:07:42,233
4-fold number of samples. What do you think is the next step to be able to even start exploring gene-gene/gene-environment interactions?

46
00:07:42,233 --> 00:07:49,566
JOAN BAILEY-WILSON: When I was talking about gene-gene interactions I was really talking about the more common variants coming out of

47
00:07:49,566 --> 00:08:02,232
GWASs, because at this point in many of the GWAS studies that are out there, people are just now getting into consortia that are big enough

48
00:08:02,233 --> 00:08:13,833
that they have the kind of the power they need to really start querying these interactions because you do need larger samples, and so a lot of folks

49
00:08:13,833 --> 00:08:24,166
who’ve been doing GWASs and they’ve done their meta-analyses for the marginal effects—the effect of each SNP—in larger and larger and

50
00:08:24,166 --> 00:08:33,799
larger samples and they’re sort of feeling like they’ve found what they can find of those individual SNP effects, now they’re starting to

51
00:08:33,800 --> 00:08:43,366
look at the interactions and they are thinking that some of the missing heritability will be due to interactions of those more common things. So, I

52
00:08:43,366 --> 00:08:53,899
was really talking about analysis of data we already have, and let’s not go herring off after where variants are going to explain everything

53
00:08:53,900 --> 00:09:04,033
and forget to look at data that we already have that may help to explain some of it. I don’t think it’ll explain all of it but I think it will explain, perhaps,

54
00:09:04,033 --> 00:09:11,999
more. So, that’s where I was talking. Yeah, when you get into gene-gene-gene interactions with rare variants, then the sample sizes are going to

55
00:09:12,000 --> 00:09:22,066
have to be really huge and, you know, maybe it’ll happen in my lifetime, but I’m not sure about that because, you know, I’m old.

56
00:09:22,066 --> 00:09:33,332
XIHONG LIN: For the screening stage I would suggest what people to do is to, rather than fitting the main effect model, you fit the main effect for

57
00:09:33,333 --> 00:09:40,566
gene-environment interaction and you start with doing, like, testing the main effect on gene-environment interaction separately, and when

58
00:09:40,566 --> 00:09:47,232
you do the screening you do two-degree freedom test. So therefore, at the screening stage you will be able to pick up both the main

59
00:09:47,233 --> 00:09:56,133
effect and also the interaction, and then in the validation phase you can fine-tune it. So, that will help you to improve the screening power. And

60
00:09:56,133 --> 00:10:09,133
also for gene-gene action, I think that would probably be SNP by SNP interaction, right? So in that situation, because for the rare variants we

61
00:10:09,133 --> 00:10:17,499
focus on gene level analysis and so therefore, when you do the screening the one can account for SNP by SNP interaction. For example, in the

62
00:10:17,500 --> 00:10:35,100
SKAT method we can do the screening by allowing for SNP by SNP interaction in the model. FEMALE: Hi. I had a question about defining

63
00:10:35,100 --> 00:10:43,366
subsets of SNPs for rare variant analysis. I feel like it’s more obvious, you know, when we have exome sequencing data or gene sequencing

64
00:10:43,366 --> 00:10:54,032
data, in terms of defining subsets based on a functional unit, but as we move towards sequencing contiguous genomic regions, I was

65
00:10:54,033 --> 00:11:03,533
wondering about, you know, how to collapse rare variants. I know Dr. Lin had mentioned sort of a moving window approach as a possibility,

66
00:11:03,533 --> 00:11:12,966
but I wondered what’s known or what type of literature is there out there on this right now, and you know, if moving windows is the optimal

67
00:11:12,966 --> 00:11:19,932
approach, what size of a moving window? I don’t know. I was just hoping you guys might be able to comment on that.

68
00:11:19,933 --> 00:11:29,866
JOAN BAILEY-WILSON: Well, in addition to moving windows, people are using this sort of LD tiles. So, saying okay, if I’m going to go beyond

69
00:11:29,866 --> 00:11:38,299
genes where I have something where it does make sense that I could say, “I’m going to collapse within a gene because it’s a functional unit,” once

70
00:11:38,300 --> 00:11:46,433
you get out of it, you’re right. You don’t have something. And so, you can take just distance…so, like in SKAT you can say, “I want to do over

71
00:11:46,433 --> 00:11:58,133
3KB distance,” or something like that or you can do something where you’re collapsing based on the observed LD patterns in your data, and that’s

72
00:11:58,133 --> 00:12:11,499
another biologically attractive thing to do because then at least those things are perhaps close enough together that maybe there’s really one or

73
00:12:11,500 --> 00:12:22,900
two tagging SNPs in there that are really doing something, but it’s a difficult question and we don’t know. You may do better. Once we know

74
00:12:22,900 --> 00:12:31,333
more about the genome you might say, “All right, I know that this is a region that has an microRNA in it, so maybe I’m going to collapse there and here’s

75
00:12:31,333 --> 00:12:41,933
another region where I know there are some known microRNA seed sites, so maybe do I want to say that any microRNA seed site that works

76
00:12:41,933 --> 00:12:52,899
with this microRNA across the genome, I’m going to collapse any variants in them.” That’s sort of the pathway thing. So, there’s lot of different

77
00:12:52,900 --> 00:13:03,900
ways you can do it. I don’t think any of us know what’s going to be right. Someone was talking about…you…you were talking about, as you go

78
00:13:03,900 --> 00:13:12,933
across the genome, what’s right in this gene may not be what’s going on in this gene. In this gene, all the SNPs, all the rare variants may have a

79
00:13:12,933 --> 00:13:23,099
positive direction but in another gene maybe it is bidirectional; some increase risk, some decrease. We don’t know what it’s going to be and I suspect

80
00:13:23,100 --> 00:13:34,966
we’re all going to be experimenting for a while, but those are just some of my ideas of things you might collapse on. You guys probably have other ideas, too.

81
00:13:34,966 --> 00:13:42,866
SUZANNE LEAL: I think it’s a huge problem and I personally don’t have a solution. I’m kind of doubtful that a sliding window approach, just based on some, you know, criterion like size, is going to work because you wouldn’t know

82
00:13:42,866 --> 00:14:00,566
priority, what the size, you can’t try many different sizes and you’re going to have…it’s going to be a huge multiple testing problem. So,

83
00:14:00,566 --> 00:14:07,032
my answer is: I don’t have a clue, basically, and it’s a very important area of research.

84
00:14:07,033 --> 00:14:14,733
XIHONG LIN: Yeah, I don’t have a good answer, either. So, one way to do the sliding window, you can do this. Instead of doing it this way, this way,

85
00:14:14,733 --> 00:14:23,466
you can do this overlapping way. So therefore, that can help you a little bit, and the other is, as you mentioned, you can use the haplotype block

86
00:14:23,466 --> 00:14:31,399
to define the block, and the other way is to use a recombination hotspot to define the block. So, there are multiple ways but really we don’t know

87
00:14:31,400 --> 00:14:40,333
what’s the good way or the best way should be. I think that, probably, will take some time for us to figure it out. Maybe the bioinformatics could help,

88
00:14:40,333 --> 00:14:49,766
you know? FEMALE: My first question is about…

89
00:14:49,766 --> 00:14:54,532
JOAN BAILEY-WILSON: You need to talk louder. FEMALE: Okay. My first question is about the

90
00:14:54,533 --> 00:15:06,433
Q-Q plot that Suzanne showed for the age of menarche in African Americans. How do you explain why you see some signals in the

91
00:15:06,433 --> 00:15:14,166
Caucasian but not in African American from both the exome sequencing data?

92
00:15:14,166 --> 00:15:24,899
SUZANNE LEAL: Well, there’s many reasons. First of all, we do have a larger sample size for the European Americans. We might not have quite

93
00:15:24,900 --> 00:15:34,666
as much…it’s probably a little bit more homogenous sample in that they’re all older women, there’s less heterogeneity in the

94
00:15:34,666 --> 00:15:43,166
variants, so there’s many reasons but I really don’t know what explains it. The only thing I do know is we don’t have any inflation of Type 1

95
00:15:43,166 --> 00:15:49,766
error in our African Americans. FEMALE: My second question is about the

96
00:15:49,766 --> 00:16:02,332
imputation that Dr. Lin did. So, when you use 1000 Genome for imputation, which version of 1000 Genome did you use, and to impute, is your

97
00:16:02,333 --> 00:16:08,766
sample all Caucasian or…? XIHONG LIN: We used the most recent version of

98
00:16:08,766 --> 00:16:12,166
1000 Genome to do the imputation.

99
00:16:12,166 --> 00:16:18,166
FEMALE: So, do you have any comments about imputation against 1000 Genome for African Americans?

100
00:16:18,166 --> 00:16:24,166
XIHONG LIN: No, because we did not try that because all of our subjects are Caucasian, so I cannot answer that question.

101
00:16:24,166 --> 00:16:33,232
FEMALE: Okay. Thank you. FEMALE: So, I have a question about family

102
00:16:33,233 --> 00:16:43,099
studies. We probably all have pedigress of families from decades ago that we microsatellite linkage studies on. How informative do you think

103
00:16:43,100 --> 00:16:55,533
those will be for choosing…for example, if you just wanted to look at a targeted area of a genome or a family, rather than look across to

104
00:16:55,533 --> 00:17:01,866
whole exome or whole genome for the entire pedigree.

105
00:17:01,866 --> 00:17:09,232
SUZANNE LEAL: I think it’s extremely informative. In fact, I also work in nonsyndromic hearing impairment, where we have very large Mendelian

106
00:17:09,233 --> 00:17:20,733
families, and that’s what we do. We first do a linkage panel on those family members, then we have a much smaller region to look for the causal

107
00:17:20,733 --> 00:17:32,799
variants. I’m also working on this Mendelian genome project with the University of Washington and that’s one of our strategies, is first we’ll do a

108
00:17:32,800 --> 00:17:41,433
SNP array on the family members and we’ll actually use that information to inform us of which other best family members to select for

109
00:17:41,433 --> 00:17:51,366
exome sequencing because it could tell us if somebody’s in a phenocopy, it could help us select two individuals with the smallest overlap of

110
00:17:51,366 --> 00:17:59,632
their haplotypes, it also tells us in advance something about the DNA quality, which is, you know, you don’t want to push forward a sample

111
00:17:59,633 --> 00:18:09,533
that’s not good. So, I think it’s extremely useful. Most of the time you don’t have the luxury of having nice extended families, but if you have it

112
00:18:09,533 --> 00:18:14,266
you should definitely use it and it’s a very powerful tool.

113
00:18:14,266 --> 00:18:19,266
FEMALE: And you think you need extended, large pedigrees rather than the small families?

114
00:18:19,266 --> 00:18:25,932
SUZANNE LEAL: No, no, that’s the beauty of it because a lot of families that we really couldn’t use very well at all previously, like families maybe

115
00:18:25,933 --> 00:18:36,466
we had only a lot score of one point. Usually you want a lot score of 3.5 or greater in order to establish linkage. Before when we were using

116
00:18:36,466 --> 00:18:45,066
kind of the candidate gene approach, if we weren’t able to establish linkage, often you would have these peaks throughout the genome and

117
00:18:45,066 --> 00:18:53,899
there would be big regions with a lot of genes in it, and so that was very hard to follow-up with Sanger sequencing. So now, even if you have

118
00:18:53,900 --> 00:19:02,066
smaller families, of course you won’t have just one region but you would definitely reduce the number of regions that you could follow up. So,

119
00:19:02,066 --> 00:19:10,632
at one time these families weren’t that useful when we were doing Sanger sequence; unless you could combine families, you didn’t have a trait

120
00:19:10,633 --> 00:19:20,699
that was all that heterogenous and you could combine families to have one region. Now all of a sudden, these families are becoming extremely

121
00:19:20,700 --> 00:19:28,566
useful because you’re still reducing a lot [---] out of the genome but not as much as in a family that you could establish linkage. So, they are still very

122
00:19:28,566 --> 00:19:33,532
useful, these types of families, smaller families. JOAN BAILEY-WILSON: And I’m part of a

123
00:19:33,533 --> 00:19:43,433
hereditary prostate cancer consortium and also the lung cancer, and in both of those we have multiple aggregated families and we have many

124
00:19:43,433 --> 00:19:54,766
linkage peaks that are significant because some of the families have a peak here, and adding them all up it turns out to be significant. And then, some

125
00:19:54,766 --> 00:20:04,866
of the families have a peak somewhere else. And again, with Sanger we couldn’t even afford to do good sequencing under one peak in the families

126
00:20:04,866 --> 00:20:14,799
that are linked there, much less sequence multiple things. Then there are some families that maybe have a signal at two peaks. Well, then you’re

127
00:20:14,800 --> 00:20:24,400
going to have to sequence 200 genes instead of 100. So, what these tools are doing for us in these sort of consortia looking at highly

128
00:20:24,400 --> 00:20:37,533
aggregated families is letting us, in one go, take a look at the entire genome, and generally what we’re doing is taking distant relatives—the most

129
00:20:37,533 --> 00:20:48,699
distant relative pair we have in each family—sequencing them, and we can then filter to what are they sharing under the regions where they

130
00:20:48,700 --> 00:20:58,366
have some linkage information and then decide which variants we’re going to genotype in the remainder of the family, and then we can just

131
00:20:58,366 --> 00:21:08,132
make a custom genotyping array and genotype the whole dataset and help us then see which of these rare variants actually are segregating in the

132
00:21:08,133 --> 00:21:18,733
entire family, which is just back to linkage analysis again. She was telling…tell them your title of…

133
00:21:18,733 --> 00:21:25,566
SUZANNE LEAL: Oh, I have a talk that I give at one of my courses and it’s called “From Linkage Analysis to Next Generation Sequencing and

134
00:21:25,566 --> 00:21:34,499
Back Again.” So, you know, I really feel like for a long time people weren’t very interested in learning about linkage analysis, which I learned in

135
00:21:34,500 --> 00:21:42,733
graduate school and did a lot of it in graduate school, and the numbers in this course that I had with linkage analysis were really dwindling

136
00:21:42,733 --> 00:21:51,233
where everyone wanted to come to the GWAS course. But now, like, just last year, you know, we had a huge surge in the number of

137
00:21:51,233 --> 00:22:00,099
participants. So, there definitely is an interest and it is definitely a very useful tool

138
00:22:00,100 --> 00:22:12,233
FEMALE: Okay. My second question is filtering. Previously we filtered on dbSNP and now with the influx of variants in the dbSNP—that’s not a

139
00:22:12,233 --> 00:22:26,399
good filtering device—but supposing that you have a common disease that has a 4%-10% frequency in the population, quite common. How

140
00:22:26,400 --> 00:22:35,366
would you go about filtering for a disease like that? Would you just filter out everything that looked benign?

141
00:22:35,366 --> 00:22:41,399
SUZANNE LEAL: You can’t filter if you have something like that. You have to do an association…you can’t use filtering strategies;

142
00:22:41,400 --> 00:22:46,533
you have to use association testing. JOAN BAILEY-WILSON: I would agree. Yeah, we

143
00:22:46,533 --> 00:22:49,466
all three agree. XIHONG LIN: Or if you have some kind of

144
00:22:49,466 --> 00:22:54,599
functional information and you can use that to screen.

145
00:22:54,600 --> 00:22:56,600
FEMALE: Okay, thank you. JOAN BAILEY-WILSON: But given we all worry

146
00:22:56,600 --> 00:23:05,766
about how good the functional predictions are at this point, even if you think it doesn’t look functional you probably are going to want to do

147
00:23:05,766 --> 00:23:09,666
the association anyway, but you’ll up-weight. Have a really cool functional one. You’ll probably maybe up-weight your…

148
00:23:09,666 --> 00:23:24,066
XIHONG LIN: Yeah, exactly. So for example, for the weight function we can use the functional score to up-weight the variants with more

149
00:23:24,066 --> 00:23:35,099
function. Yeah. Another thing…I wanted to respond about the family studies. One advantage of the family study is it can control for population

150
00:23:35,100 --> 00:23:43,133
stratification but it’s more difficult to control than using the case-control studies.

151
00:23:43,133 --> 00:23:51,933
FEMALE: Okay and my third quick question is: what is your favorite tool for looking at functionality of SNPs? There are several of them

152
00:23:51,933 --> 00:23:55,299
out there and they seem to have little concordance.

153
00:23:55,300 --> 00:23:59,733
JOAN BAILEY-WILSON: That’s Jamie’s question. That’s not to us. That’s better to him.

154
00:23:59,733 --> 00:24:03,299
FEMALE: I think you have to look at a combination.

155
00:24:03,300 --> 00:24:13,466
JOAN BAILEY-WILSON: Yeah. I mean, we don’t just look at one; we look at several and sort of get a Gestalt. Is that how you do it? I mean, I

156
00:24:13,466 --> 00:24:26,832
know that’s what you guys do in Varsifter and I know that’s what NISC does.

157
00:24:26,833 --> 00:24:37,533
JAMIE TEER: I know some of the people have looked at that in our group and have found that the methods don’t overlap too well and, they’re

158
00:24:37,533 --> 00:24:42,233
each…they’re good, but the overlap isn’t great. So, I think looking at all of them and kind of considering that, is good and it’s a prediction. So,

159
00:24:42,233 --> 00:24:47,666
the more…you know, it’s a prediction. JOAN BAILEY-WILSON: And of course, we all

160
00:24:47,666 --> 00:24:59,066
hope, as we understand more and more about the genome, maybe those predictions will get better but right now they’re still kind of “iffy.”

161
00:24:59,066 --> 00:25:05,999
CHRISTY CHANG: Christy Chang from the University of Maryland. Going back to the function prediction to start my question, I just wanted to

162
00:25:06,000 --> 00:25:13,866
add that you have to be really committed to a variant to truly annotate function and figure out what’s going on, especially when it’s not in the

163
00:25:13,866 --> 00:25:24,199
coding region. We often make the assumption that if we found something intronic that’s really tagging something in the intron that’s going to alter

164
00:25:24,200 --> 00:25:30,633
the protein structure, when it’s just as likely when you found a coding variant that it’s actually a function of neutral and tagging something into

165
00:25:30,633 --> 00:25:38,666
intergenic and intronic region that has a regulatory function. A lot of people overlook that and then we found that to be true for monogenic

166
00:25:38,666 --> 00:25:47,566
disease over and over again. What we thought was amino acid substitution that killed the protein actually regulates splicing, for instance. But my

167
00:25:47,566 --> 00:25:56,799
question is, I’m very concerned that some of the GWAS results or many of the GWAS results will end up in the same graveyard with our linkage

168
00:25:56,800 --> 00:26:06,166
studies, that they are linkage peaks that showed up over and over again that we have not ever understood what biologically drove those linkage

169
00:26:06,166 --> 00:26:15,366
signals, and there are just as many GWAS results that have very compelling P values, but because they sit in intergenic regions or in the

170
00:26:15,366 --> 00:26:24,766
gene that’s poorly annotated or understood, we just don’t do anything with them. So, is there any way that, as a field, coming from clinician,

171
00:26:24,766 --> 00:26:34,299
genetic, epidemiologist, all the way to molecular biologist, that we will commit effort to understand those regions, to understand gene regulations

172
00:26:34,300 --> 00:26:47,033
and expression pathways and salvage all the biology from those signals before we jump in to sequence exomes and whole genomes and just

173
00:26:47,033 --> 00:26:53,999
end up with a bigger pile of things that we are not willing to understand at this point.

174
00:26:54,000 --> 00:27:03,700
JOAN BAILEY-WILSON: Well, I think this is one of the things Steve started talking about in his talk, was that one of the things they’re doing is they

175
00:27:03,700 --> 00:27:13,933
are following up their GWAS signals by doing really fine mapping in the regions and trying to identify what really is the functional variant that

176
00:27:13,933 --> 00:27:22,933
the GWAS SNPs tagged. Because what you have to remember is: what’s on those GWAS chips? They’re tag SNPs. The way they built

177
00:27:22,933 --> 00:27:32,466
those GWAS chips is they looked at the haplotype patterns, they went through and said, all right, in this haplotype block, what one or two

178
00:27:32,466 --> 00:27:45,366
or three SNPs do I need to recover all of the haplotype information? And so, those SNPs have very little likelihood of actually being the functional

179
00:27:45,366 --> 00:27:53,699
SNP that’s causing the association. I mean, probably in all the GWASs that have been done, yeah, probably some of them are, but most of

180
00:27:53,700 --> 00:28:02,733
them are just going to be in LD with functional variants. So, people are trying to find the functional variants that are responsible for those

181
00:28:02,733 --> 00:28:10,399
GWAS signals, and I agree with you, that’s critically important. And then also, I think if we’ve got something that we think is functional, we

182
00:28:10,400 --> 00:28:19,100
can’t just stop and say, “Oh, it looks cool, it’s a stop codon and it’s predicted to maybe be functional,” you’ve got to get out there and do the

183
00:28:19,100 --> 00:28:28,966
lab work. I mean, I’m not going to do it, we’re not going to do it, but our collaborators have to do the lab work. So if you’re doing cancer, well, you’re

184
00:28:28,966 --> 00:28:38,266
going to have mouse models and you’re going to look at zebrafish stuff. One of the faculty at NHGRI is just doing lots of really cool zebrafish

185
00:28:38,266 --> 00:28:48,932
mutations where he’s trying to, like, mutate everything and make a catalogue of what do all these things do. So, it’s going to take biology as

186
00:28:48,933 --> 00:28:55,766
well as statistics. Statistics takes you a certain way but then you’ve got to go to the biology.

187
00:28:55,766 --> 00:29:01,899
MATTHIAS KRETZLER: Matthias Kretzler, University of Michigan. I think I would strongly blog in that same area that we have an

188
00:29:01,900 --> 00:29:09,300
opportunity. So far we have talked about genetics and the phenotype, but obviously there are multi-levels of regulations—we’re a lot closer to the

189
00:29:09,300 --> 00:29:17,500
genes and phenotypes we have—and we have opportunities to capture those in our patients, and Eric Schadt will actually lead the Keystone

190
00:29:17,500 --> 00:29:25,866
conference in a week from now where the systems genetic approach will be discussed over five days to see how we can actually use these

191
00:29:25,866 --> 00:29:35,799
multidimensional data integrations to teach us which of these links might be those where it’s worthwhile to get into the real [---] of the

192
00:29:35,800 --> 00:29:43,800
coefficient mouse, which obviously is a lot of work and we will have to assemble very solid evidence before we engage our colleagues in

193
00:29:43,800 --> 00:29:53,833
that direction. And it’s also a pledge for cohort design…if we build our cohorts from the get-go that we have an opportunity to kind of have these

194
00:29:53,833 --> 00:30:02,566
additional scaffolds added to the genetic and phenotypic information we are currently focusing on.

195
00:30:02,566 --> 00:30:07,232
JOAN BAILEY-WILSON: I agree. GEORGIA DUNSTON: Quick question. I like the

196
00:30:07,233 --> 00:30:14,766
terminology. Joan, you mentioned it’s more like a treasure hunt and I’d like to know how…

197
00:30:14,766 --> 00:30:21,699
JOAN BAILEY-WILSON: The treasure hunt was in Genetic Analysis Workshop but it is a treasure hunt in our real stuff, too, right?

198
00:30:21,700 --> 00:30:25,366
GEORGIA DUNSTON: And I’m wondering…and I like the term but…

199
00:30:25,366 --> 00:30:30,899
JOAN BAILEY-WILSON: Actually, he’s the one that made up the term “treasure hunt.” I like that. I think I’m going to use for…

200
00:30:30,900 --> 00:30:39,433
GEORGIA DUNSTON: I want to know how are we going to distinguish, when we try to get support for these areas of pursuit, how do we

201
00:30:39,433 --> 00:30:52,333
distinguish a treasure hunt from a fishing expedition?

202
00:30:52,333 --> 00:31:00,633
JOAN BAILEY-WILSON: I’ve never liked fishing ex…well, I like to fish—my dad loved to fish—and I’m not averse to a fishing expedition, but I really

203
00:31:00,633 --> 00:31:09,166
like to think of it in terms of hypothesis generation. I mean, that’s what we’re really doing when we’re doing a lot of these genome-wide things,

204
00:31:09,166 --> 00:31:16,732
and a lot of people say, “Well, when you’re doing these genome-wide tests, it is a fishing expedition and you’re not testing hypotheses,”

205
00:31:16,733 --> 00:31:25,333
whereas statisticians like us say, “Wait, this is where we really are testing hypotheses.” You go do a lab experiment and maybe you’re not

206
00:31:25,333 --> 00:31:37,099
actually testing a hypothesis. We ARE testing hypotheses. Our null hypothesis is: there is no variant in the genome that explains this and we

207
00:31:37,100 --> 00:31:46,866
go out and try to reject that null hypothesis. So, most statisticians get a little upset when we’re told it’s a fishing expedition, and I know you

208
00:31:46,866 --> 00:31:53,032
believe that too, Georgia. [inaudible comment from audience]

209
00:31:53,033 --> 00:32:00,066
JOAN BAILEY-WILSON: Really. I mean, that’s the statistical answer; that’s what a statistician will tell you. I actually have a null hypothesis that’s

210
00:32:00,066 --> 00:32:08,232
based in biology and I’m trying to reject it with my data.

211
00:32:08,233 --> 00:32:17,633
FEMALE: I have two questions about indels. Many of you talked about two-step analyses with first GWAS and then next generation sequencing,

212
00:32:17,633 --> 00:32:27,399
but we know that indels are not covered by genotyping arrays, so probably some of them are in disequilibrium with stacked SNPs but not all of

213
00:32:27,400 --> 00:32:37,666
them. So, my question is more, for next generation sequencing, which tool is better to use to identify those indels, since we know that

214
00:32:37,666 --> 00:32:48,832
there is many problems of genotyping coding, especially…even in the last version of 1000 Genome? So, that’s my first question and my

215
00:32:48,833 --> 00:32:59,033
second question is about imputation methods of indels.

216
00:32:59,033 --> 00:33:06,899
JOAN BAILEY-WILSON: I haven’t done that myself. I know that NISC at NHGRI has been working really hard on developing some Bayesian

217
00:33:06,900 --> 00:33:17,366
algorithms that are helping them align correctly and then call indels better. I know there are a lot of people, other people working on such

218
00:33:17,366 --> 00:33:27,132
improvements to aligners and callers so that they call the indel correctly rather than messing it up. Do you have any favorites? Do you have any

219
00:33:27,133 --> 00:33:29,833
favorites? SUZANNE LEAL: No, I just know from the

220
00:33:29,833 --> 00:33:40,799
literature but I get the data from other people and I know it’s highly problematic and especially from exome sequence data it’s quite, you know, you

221
00:33:40,800 --> 00:33:49,933
can call them but it is problematic. As far as imputation, I mean, I think first we have to work out the bugs of calling them and then I think that

222
00:33:49,933 --> 00:33:54,899
will come in the future, but we certainly aren’t there yet at all.

223
00:34:01,733 --> 00:34:06,866
JEFFREY KOPP: There’s no more questions? Thank you, Suzanne, Joan, and Xi.




Date Last Updated: 9/18/2012

General Inquiries may be addressed to:
Office of Communications and Public Liaison
NIDDK, NIH
Building 31, Rm 9A06
31 Center Drive, MSC 2560
Bethesda, MD 20892-2560
USA
Phone: 301.496.3583