Whole Genome Approaches to Complex Kidney Disease
February 11-12, 2012 Conference Videos

Introduction to Exome Studies: Approaches, Analysis, and Problems
Andrey Shaw, Washington University in St. Louis

Video Transcript


1
00:00:00,000 --> 00:00:09,700
ANDREY SHAW: …and I wanted to, even though I am listed as a co-organizer, I really wanted to say that Jeffrey has done 99.99% of the work,

2
00:00:09,700 --> 00:00:25,000
and he deserves all the credit for a really great meeting. So I have been asked to start this afternoon’s session where we are going to talk

3
00:00:25,000 --> 00:00:33,566
about exome sequencing. I’ve got 15 minutes, so I am just going to quickly go through what I think the major issues are, and then I am going to use

4
00:00:33,566 --> 00:00:41,799
an example of basically the data that we are generating, and then the problems that I am having right now as to figure out what to do with

5
00:00:41,800 --> 00:00:53,233
this data, and hopefully that will be a good jumping off point for the rest of the afternoon. So many of us here are mainly interested in one

6
00:00:53,233 --> 00:01:01,466
disease in the kidney, and that’s FSGS. FSGS is one of the leading causes of nephrotic syndrome and chronic kidney disease in both children and

7
00:01:01,466 --> 00:01:11,099
adults; responsible for about 5-10% of end-stage renal failure, and why it’s important to us today, it is thought to have a strong genetic component,

8
00:01:11,100 --> 00:01:22,633
especially in African Americans. Most of the progress, I would say, over the last 10 years has really established that FSGS is a disease of a

9
00:01:22,633 --> 00:01:31,099
single cell in the kidney called the podocyte. The podocyte has this really amazing architecture, where it’s an epithelial cell that basically coats the

10
00:01:31,100 --> 00:01:42,333
outside of the glomerular capillary, and is thought to play some role in glomerular filtration. So, I think many of us in this field still argue about this: is

11
00:01:42,333 --> 00:01:51,666
FSGS a Mendelian or a complex disease? I feel pretty sure that it’s a complex disease, but I know there are many in this audience who really think

12
00:01:51,666 --> 00:02:00,566
of it as a Mendelian disease, and how you approach this question, I think, really affects the way you analyze your data. So, we’ve talked a

13
00:02:00,566 --> 00:02:07,699
lot about this already, but I think the main points are that Mendelian diseases are going to be caused by highly penetrant alleles while complex

14
00:02:07,700 --> 00:02:17,200
diseases are going to be caused by very poorly pentrant alleles, and therefore will impart a low risk which makes them much more difficult to find.

15
00:02:17,200 --> 00:02:26,900
Mendelian diseases…with a good pedigree you can easily find the gene, but if it is a complex disease we are going to require many, many

16
00:02:26,900 --> 00:02:34,766
large numbers of patients. So, really what I think has transformed this area is really the development of this new technology, whole

17
00:02:34,766 --> 00:02:41,632
exome sequencing. I think most of us are aware of what it is. It’s basically a technique that focuses on only sequencing the 1% of the

18
00:02:41,633 --> 00:02:53,799
genome that’s exons that’s generally estimated at between 35-50 million base pairs. Comparison to whole genome sequencing, it actually has an

19
00:02:53,800 --> 00:03:03,866
extra labor step which makes it more expensive than it should be, and as sequencing costs drop that’s when whole genome sequencing will

20
00:03:03,866 --> 00:03:18,666
overtake whole exome sequencing, because it will eliminate this capture step. I think we should all admit that $1,000 is cheap, but still not cheap if

21
00:03:18,666 --> 00:03:26,166
the goal is to sequence thousands and thousands of patients that would be required if this was considered to be a complex disease. So

22
00:03:26,166 --> 00:03:33,666
one of the problems with whole exome sequencing is that we’re basically looking for variants, and I’ve just pulled this out of a paper. I

23
00:03:33,666 --> 00:03:42,366
think all of us have very different experiences here, but generally in the same area. But most people are going to have about 15,000-25,000

24
00:03:42,366 --> 00:03:54,899
variants, which will be divided into those variants that we call common, and those variants that we call rare or novel variants, and basically the big

25
00:03:54,900 --> 00:04:05,733
advantage of whole exome sequencing over GWAS is really this ability to focus on rare variants. So, why the focus on rare variants?

26
00:04:05,733 --> 00:04:12,633
We’ve talked about it this morning. The experience with GWAS suggests that there is significant missing heritability, and the current

27
00:04:12,633 --> 00:04:22,666
hypothesis, then, today is that missing heritability will be contained in the 400-800 rare variants that are present in all of us; and if that is what we are

28
00:04:22,666 --> 00:04:33,766
looking for then we basically can only see this today through exome sequencing. I think this raises another question that we can talk about, is

29
00:04:33,766 --> 00:04:44,999
what is a rare variant? We define it traditionally as not a common variant, and a common variant is defined as something that has an allele

30
00:04:45,000 --> 00:04:55,000
frequency of 1% or 2%, so anything less than 1% or 2% is a rare variant. Some rare variants are extremely, extremely rare; 1 in 10,000, 1 in

31
00:04:55,000 --> 00:05:04,800
50,000, are those still rare variants? I think there is a lot here that goes into deciding what we call rare variants. When the allele frequency gets

32
00:05:04,800 --> 00:05:15,733
very, very low, I think you really have to worry about whether you’re looking at a false negative, or false positive SNP call. So, as we accumulate

33
00:05:15,733 --> 00:05:25,733
greater and greater numbers of identification of rare variants, I think it becomes a formidable task to think about how we will validate all these rare

34
00:05:25,733 --> 00:05:36,399
SNPs. Okay, so this is a slide that comes from this great Altschuler, Daly, and Lander paper that really just addresses what sample sizes we

35
00:05:36,400 --> 00:05:48,133
need to basically generate, let’s say in this case, 90% power. So, just taking something that would have an odds ratio of 2.5, an allele frequency of

36
00:05:48,133 --> 00:06:00,199
1%, so this would be a relatively common rare SNP, according to this chart to basically get 90% power, we would need about 6,600 patients. If

37
00:06:00,200 --> 00:06:13,433
we assume that rare variants are even rarer, let’s say a rare frequency rate of .3%, we’re looking at huge numbers of patients. So, I think the task at

38
00:06:13,433 --> 00:06:22,499
using whole exome sequencing, if the focus and the hypothesis of whole exome sequencing is to focus on rare variants, is: how do we generate

39
00:06:22,500 --> 00:06:30,033
the statistical power to actually make any of this data meaningful? And so one method that has been developed, that we are going to hear some

40
00:06:30,033 --> 00:06:38,433
more about today, are collapsing methods where rather than consider each rare variant by itself, we basically take all the rare variants that are

41
00:06:38,433 --> 00:06:48,066
found in a particular gene and basically call that a single variant. The problem with this strategy is that many of these rare variants are probably

42
00:06:48,066 --> 00:06:56,432
benign and have no functional effect, and so clustering potentially significant variants with benign variants is probably going to decrease our

43
00:06:56,433 --> 00:07:06,099
association and power, and so that leads us into predictive tools like VAAST, SIFT, and PolyPhen that basically try to predict whether a rare variant

44
00:07:06,100 --> 00:07:18,400
has significance or not and whether these two techniques together can basically increase the statistical power of what we’re doing. Okay. So,

45
00:07:18,400 --> 00:07:26,466
our feeling was that if we are talking about sequencing thousands of patients to basically generate statistical power, that whole exome

46
00:07:26,466 --> 00:07:35,332
sequencing today is still out of our reach, and so we needed to develop a method that would be cheaper, allowing us to do this much more

47
00:07:35,333 --> 00:07:45,166
efficiently, and so our strategy was basically to sequence a smaller number of genes, basically targeted exome sequencing, and then figuring out

48
00:07:45,166 --> 00:07:53,932
a way to multiplex the samples. So, I am not going to go into this, but this seemed relatively straightforward to us because if FSGS is truly a

49
00:07:53,933 --> 00:08:01,266
podocyte disease, then theoretically we would only need to sequence the genes that are expressed in podocytes, and the microarray

50
00:08:01,266 --> 00:08:11,032
analyses suggest that that’s about 7,000. So, I have involved a lot of people in the selection of these genes, mostly Matthias Kretzler of

51
00:08:11,033 --> 00:08:20,499
Michigan, to really sit down and examine this list of genes, discard obvious housekeeping genes, and then using all of the human expression data

52
00:08:20,500 --> 00:08:27,266
that is out there try to make sure that every gene that is potentially a podocyte specific gene is included, and that basically gives us a list of

53
00:08:27,266 --> 00:08:35,599
2,400 genes; it’s about a tenth of the genome. We’ve developed a method that allows us to multiplex the captures that significantly reduces

54
00:08:35,600 --> 00:08:44,666
the labor, so instead of having to basically hybridize 100 samples, if we multiplex in groups of 10 we’d only be multiplexing 10 samples, so

55
00:08:44,666 --> 00:08:55,466
that would significantly reduce the cost. So right now, we think we can do this targeted exome of 2,400 genes for about $150 a patient, and one of

56
00:08:55,466 --> 00:09:03,566
the things that I thought would be better about this method is by reducing the number of tests we would reduce the Bonferroni correction, and

57
00:09:03,566 --> 00:09:11,499
that would increase the statistical significance of anything that we would find. Okay. So, this work was done in my lab by Ghaidan Shamsan, a

58
00:09:11,500 --> 00:09:21,933
technician who is here, and another technician, Chris Stander. I told you about the choosing of the genes. The patients basically come from Jeffrey

59
00:09:21,933 --> 00:09:33,699
Kopps’ FSGS trial that is basically, the cohort design would be the extremes—patients with HIV associated nephropathy with kidney disease,

60
00:09:33,700 --> 00:09:44,466
patients with HIV without any signs of renal insufficiency, and then Michelle Winn has provided about 90 patients, all with family history,

61
00:09:44,466 --> 00:09:54,566
all with pedigrees, and then Ania Koziell has provided some samples of pediatric FSGS. Okay, so I am not going to go through the details of our

62
00:09:54,566 --> 00:10:01,866
approach, but we validated it by pooling HapMap patient samples, going back sequencing and making sure that we basically make the same

63
00:10:01,866 --> 00:10:12,166
calls that are actually in the database. So, we’ve done about 200 and really the data that I could provide to you today was only from 131 patients,

64
00:10:12,166 --> 00:10:20,332
but basically what this chart is showing you is that out of about 200 patients that we have sequenced today, out of the 2,400 genes, we

65
00:10:20,333 --> 00:10:30,266
have identified rare variants in about 2,000 of them. What this chart is showing you is the number of patients that basically share a common

66
00:10:30,266 --> 00:10:40,832
variant in a specific gene. So, I told you about the patients; basically what we are seeing is about a total of 1,500 total SNPs per person. We remove

67
00:10:40,833 --> 00:10:50,199
common SNPs, and there is a whole discussion point there. We’ve removed all synonymous variants, and then just for the sake of our

68
00:10:50,200 --> 00:10:59,566
analysis, we have used PolyPhen and deleted anything that looks benign, and that leaves us about 60-80 what we call rare non-deleterious

69
00:10:59,566 --> 00:11:09,732
SNPs. This just kind of blows that up a little bit. So, what that is saying here is like, 68 of the 131 patients shared a rare variant in the same gene,

70
00:11:09,733 --> 00:11:21,466
while over here it would be 32 patients shared rare variants in 5 different genes. Okay. So, here is a list of the candidate genes. Here is that one

71
00:11:21,466 --> 00:11:29,866
that is at 68, and the 3 that are sitting out there and basically so on and so forth, and if you just look at this list, there are some interesting genes.

72
00:11:29,866 --> 00:11:41,899
MYH9 which we all know about, ALMS1 that was picked up in the GWAS CKD screen, NOTCH1 that we all know is important in kidney development,

73
00:11:41,900 --> 00:11:52,566
and here is even NPHS1 on nephron. So, is this significant? Is there anything here that tells us anything about the genetics of kidney disease? It

74
00:11:52,566 --> 00:12:01,999
really now depends on what are the controls. What is the frequency of rare variants in these genes in the general population? And so, this is

75
00:12:02,000 --> 00:12:10,066
where we have kind of been stuck, and so an issue that I wanted to talk about today, and that is, can we use the existing databases as

76
00:12:10,066 --> 00:12:18,699
controls? We could use 1,000 genomes, we could use all the genomes that have been sequenced at each of our private institutions, we

77
00:12:18,700 --> 00:12:26,666
can use the NHLBI GO ESP Project that we’ve heard about today. In the case of Michele’s patients, they are all in pedigrees so she is going

78
00:12:26,666 --> 00:12:35,266
to take the rare variants and basically go through her pedigrees, and she has uncovered two or three genes that potentially are linked to FSGS. Or

79
00:12:35,266 --> 00:12:46,366
will we need to sequence matched controls? The issues that I see here are that all of these databases basically declare rare variants, or

80
00:12:46,366 --> 00:12:55,132
common variants, using different SNP callers and different base callers. Are we all going to need to establish the same protocols, the same cutoffs,

81
00:12:55,133 --> 00:13:04,099
the same base callers and SNP callers to establish what these numbers of rare variants are? If we have to control for ethnicity, then

82
00:13:04,100 --> 00:13:11,966
actually if we are only having to look at one ethnic population—that 1,000 genomes—the size of that population now becomes very small, and

83
00:13:11,966 --> 00:13:20,466
that is actually not a large enough population for us to use as controls. So, that raises a different question: is the frequency of rare variants in a

84
00:13:20,466 --> 00:13:30,766
gene different in different ethnic populations, and when are we going to have to basically split out based on ethnicity? Even if we get targets, this is

85
00:13:30,766 --> 00:13:41,099
only going to be a statistical argument, probably a relatively low P value until we start reaching numbers of tens of thousands, and so really we

86
00:13:41,100 --> 00:13:53,100
are going to need new genetic and biological strategies to validate these genes, especially if we think they are multi-genic. I personally think a

87
00:13:53,100 --> 00:14:01,666
major part of the effort now should be trying to figure out ways to keep this data as open as possible so that as many people here in this room

88
00:14:02,666 --> 00:14:12,266
have an opportunity to try interrogate this data and try to come up with new methods for trying to figure out how to analyze it. And then that asks

89
00:14:12,266 --> 00:14:20,232
another question, and it is when we think about phenotype refinement, we are going to need to have attached to this data enough clinical data

90
00:14:20,233 --> 00:14:29,833
that will allow people to try to cluster this, and try to refine the sequencing data to potentially increase statistical power. So, Jeffrey asked me

91
00:14:29,833 --> 00:14:41,766
to add on this at the last minute, and just a discussion of this new Illumina exome BeadChip. So basically, this is potentially a method that

92
00:14:41,766 --> 00:14:50,832
would allow us to combine the two things that everybody has been doing: GWAS and exome. And basically what this Illumina exome chip is, it’s

93
00:14:50,833 --> 00:15:01,199
basically 250,000 SNPs that are only in the exons. So surprisingly, most of the GWAS chips have very, very little coverage of exons, and so

94
00:15:01,200 --> 00:15:10,900
the argument here is by having a new set of markers that is only in exons, that this will allow us to actually have better exonic coverage. The

95
00:15:10,900 --> 00:15:18,300
good thing about it is it is relatively cheap. The exciting part of it is that you can add up to 200,000 additional custom markers, but that

96
00:15:18,300 --> 00:15:28,366
actually doubles the cost, and I think cost becomes a major issue. As the cost of all these technologies that we are using changes, I think it

97
00:15:28,366 --> 00:15:38,666
changes what we want to do. Probably the most powerful part of this technology is to combine it with the current GWAS chips, giving you a total

98
00:15:38,666 --> 00:15:46,899
of about 5 million SNPs, but again, that is going to add to the total costs, and then the issue that I have is an issue that we talked about this

99
00:15:46,900 --> 00:15:56,233
morning. The focus on GWAS is primarily in intronic regions. Is that a totally different category of SNPs than the kind of SNPs that we are

100
00:15:56,233 --> 00:16:04,699
basically trying to see in exome sequencing, which are encoding regions? So, I think what I’ve tried to do just now is just kind of raise a bunch

101
00:16:04,700 --> 00:16:17,566
of issues that I hope will be answered in the next couple of sessions, and I look forward to it. So, thank you.

102
00:16:17,566 --> 00:16:28,166
MALE: Just a point of clarification since I was on the design team of the exome chip, we had 200,000 non-synonymous variants. There were

103
00:16:28,166 --> 00:16:40,932
about 14,000-15,000 splice variants, and 7,000 or so stop gain/loss of function variants. This came about because the exome sequencing project

104
00:16:40,933 --> 00:16:49,966
provided about 5,000 exomes, there was a Type II diabetes project providing exomes, autism project providing exomes, so the content comes

105
00:16:49,966 --> 00:16:59,199
from about 12,000 individual exomes; 9,000 of those are Caucasian, and about 2,000 African American. I think there were 500 Hispanic and

106
00:16:59,200 --> 00:17:12,933
500 Hahn Chinese. So, the distribution that you see is quite different. The other point is that the variants that are on the exome chip had to be

107
00:17:12,933 --> 00:17:22,866
seen at least twice or three times in two or more populations. So, in that sense they are not singletons or doubletons, and so you will get

108
00:17:22,866 --> 00:17:30,932
quite a different distribution. They are probably true variants, because they’ve had to be seen a couple of times, but they are not going to be the

109
00:17:30,933 --> 00:17:42,599
really rare variants, and the other content because, chip can contain 300,000 variants; there is all the GWAS hits from the NHGRI

110
00:17:42,600 --> 00:17:53,000
catalog, as well as the HLA and other types of variants. So, it is a useful chip, and I think Illumina has sold a million of them so far.

111
00:17:53,000 --> 00:17:59,766
ANDREY SHAW: I tried to find somebody at Wash U who has used them, and they said that they had been back ordered and everybody is waiting

112
00:17:59,766 --> 00:18:01,799
for them to come in.

113
00:18:01,800 --> 00:18:11,066
MALE: Yes well, the other point is that because most of these are rare variants, the clustering of these could be an issue because unlike normal

114
00:18:11,066 --> 00:18:20,199
GWAS chips where you see 3 genotypes, typically clustered, and the clustering works well, here because they are rare, you have a glob of

115
00:18:20,200 --> 00:18:30,000
the homozygous wild type and maybe 10 or 12 heterozygotes, and typically no homozygous variant, and so the clustering algorithms, typically,

116
00:18:30,000 --> 00:18:39,966
we don’t deal with that very well. We’ve run about 1,500 exome chips and they have performed remarkably well, and, knock on wood,

117
00:18:39,966 --> 00:18:50,199
but at least at this point they do well. The other point is that AlphaMetrix, just to be clear, AlphaMetrix has an exome chip as well, just so

118
00:18:50,200 --> 00:18:58,866
that not everyone goes rushing to Illumina. We try to have competition to keep the price down, but it doesn’t seem to work. If they get bought by

119
00:18:58,866 --> 00:19:02,532
Roche it probably will cause the price to go up even more.
120
00:19:02,533 --> 00:19:11,299
ANDREY SHAW: Well I just think an interesting point here is that as we get these new technologies, I was not aware of this technology

121
00:19:11,300 --> 00:19:18,200
at all until Jeffrey brought it up, and for me it was really just when do we use this technology over exome sequencing, or genome sequencing, and

122
00:19:18,200 --> 00:19:20,733
all of this for me is very confusing. Okay.




Date Last Updated: 9/18/2012

General Inquiries may be addressed to:
Office of Communications and Public Liaison
NIDDK, NIH
Building 31, Rm 9A06
31 Center Drive, MSC 2560
Bethesda, MD 20892-2560
USA
Phone: 301.496.3583