Whole Genome Approaches to Complex Kidney Disease
February 11-12, 2012 Conference Videos

Filtering and Integration of Exome Data Sets
Jamie Teer, National Institutes of Health Intramural Sequencing Center

Video Transcript

1
00:00:01,900 --> 00:00:07,766
JAMIE TEER: Great. Well, I’d like to thank the organizing committee for inviting me to speak here. It’s really been a fun session. I’ve enjoyed

2
00:00:07,766 --> 00:00:18,066
myself and learned quite a bit. So to start off, just to give a brief introduction to my world view, I sort of have multiple roles at NHGRI, I work with

3
00:00:18,066 --> 00:00:26,732
Les Biesecker’s group and he is interested in the study of rare diseases in human genetics, and I also work with the folks at the NIH Intramural

4
00:00:26,733 --> 00:00:35,133
Sequencing Center led by Jim Mullikin, so I have sort of an analysis and technology interest. I’m supposed to say, as a government employee, that

5
00:00:35,133 --> 00:00:42,566
a lot of the tools I’ll talk about I can’t endorse any of them, so now that’s out of the way. I’d also like to say that a lot of the tools I’ll talk about are

6
00:00:42,566 --> 00:00:53,666
freely available and when they’re not I’ll try and point that out. So before I really get started, this is a session on genomes and yet I’m talking about

7
00:00:53,666 --> 00:01:00,799
exomes and I thought it might be worth just sort of pointing out where you might want to do one over the other. I probably won’t have too much

8
00:01:00,800 --> 00:01:07,600
future opportunities to use this slide because genomes really are becoming much, much less expensive. However, I would like to point out that

9
00:01:07,600 --> 00:01:14,200
currently, exomes, of course, require less sequencing, and so given a machine capacity, you can sequence many more exomes than you

10
00:01:14,200 --> 00:01:22,700
can genomes. We’re finding that to be around 10 exomes for every genome. So, if you’re interested in a lot of samples in a given capacity

11
00:01:22,700 --> 00:01:33,300
or amount of time, you may want to consider exomes. Certainly, exome sequencing targets the exons. It’s a very focused targeting on what

12
00:01:33,300 --> 00:01:40,566
could be the more well understood part of the human genome. Certainly in the study of rare diseases, many variants are found in the exons

13
00:01:40,566 --> 00:01:50,099
and so it is a relevant target in many cases. And finally , although this seems maybe not a good thing that you have less data in an exome

14
00:01:50,100 --> 00:01:59,966
experiment, considering the $1,000 genome-$100,000 analysis, less data actually is a benefit, so there is less storage costs, a faster

15
00:01:59,966 --> 00:02:09,899
analysis and easier to apply different tests because the analysis are quicker and easier to perform. On the other side of course, genome

16
00:02:09,900 --> 00:02:15,800
sequencing, you do get the whole genome—“whole” of course in quotation marks—because because really we do not have yet a

17
00:02:15,800 --> 00:02:24,733
whole human genome sequence but certainly you do get regulatory regions and everything else or many other things that aren’t included in the exon,

18
00:02:24,733 --> 00:02:34,633
so of course this is much more comprehensive. Speaking just in terms of an analytical capability, structural variation and copy number variation is

19
00:02:34,633 --> 00:02:43,833
really better done on whole genomes just given the methods that are available. So, if that’s sort of an interesting part of human variation that’s of

20
00:02:43,833 --> 00:02:51,966
interest to you, then for larger scale structure variations, genomes is really the way to go. And as people have mentioned, genomes…there is

21
00:02:51,966 --> 00:03:00,866
less handling of the sample so you don’t have to do these manipulations to do the targeting and so that may also be a consideration. So generally

22
00:03:00,866 --> 00:03:07,999
when you go a sequencing experiment, the workflow kind of looks this. First you have to generate the sequence; this is often done by a

23
00:03:08,000 --> 00:03:15,166
sequencing provider. The sequence generated then goes through an alignment and genotype calling procedure which sometimes happens from

24
00:03:15,166 --> 00:03:22,366
folks here, also sometimes from informatics experts. Once you have called genotypes you generally will want to annotate them and then

25
00:03:22,366 --> 00:03:30,732
finally to analyze them and ideally, presumably most people would like to analyze the data themselves, particularly now as you can see,

26
00:03:30,733 --> 00:03:42,733
there is so much data that having experts in informatics and statistics can be very beneficial for analysis. So today I’ll just talk to you about a

27
00:03:42,733 --> 00:03:51,299
couple of things. Briefly I’ll introduce sort of a high level overview of tools for alignment and genotype calling or variant identification. I’ll spend

28
00:03:51,300 --> 00:03:59,433
the bulk of the time on tools to annotate variants and determine the context of your genomic variation, a little discussion on automation, how to

29
00:03:59,433 --> 00:04:05,899
sort of run these analyses, can you do them yourself, and I’ll finish up with visualization and introduce some tools to allow you to actually look

30
00:04:05,900 --> 00:04:15,466
at the data. As I talk about the different tools, some of them are easier to use, require less computational or informatics experience. They

31
00:04:15,466 --> 00:04:22,199
may be graphical in nature so look for the little laptop icon down here, and the tools also range the gamut and go up to those that are more

32
00:04:22,200 --> 00:04:30,666
challenging, really requiring command line or UNIX experience, and so look for the little picture of the server in the corner. So, we’ll start with

33
00:04:30,666 --> 00:04:38,066
sequence alignment. When you get raw sequence data it’s really just a bunch of letters and by itself not perhaps terribly interesting, and

34
00:04:38,066 --> 00:04:45,532
so the alignment is a process of taking that sequence of letters and determining where in a reference genome it actually comes from. There

35
00:04:45,533 --> 00:04:52,499
are various algorithms and approaches to handle this problem. Certainly with next generation sequencing machines, the problem is that there

36
00:04:52,500 --> 00:05:00,433
are so many reads, the old tools like BLAST aren’t fast enough to handle the data, so there’s been a lot of development for new tools that are very,

37
00:05:00,433 --> 00:05:08,833
very fast at handling this problem. Ideally, your alignment algorithm will be able to handle gaps, the small insertions and deletions in your

38
00:05:08,833 --> 00:05:15,299
sequence, and you really would want this for two reasons: one, so you can actually detect insertions and deletions, it’s an important class of

39
00:05:15,300 --> 00:05:23,833
variation; but perhaps more importantly, if you’ll imagine that an aligner does not even consider insertions or deletions, you almost have a frame

40
00:05:23,833 --> 00:05:30,933
shift problem where you insert or delete a base and then everything after that base is misaligned and looks like a variant, and so you can get

41
00:05:30,933 --> 00:05:40,533
tremendous false positives if you’re not properly handling gaps. Alignment quality can be limited by read length, particularly on the human genome,

42
00:05:40,533 --> 00:05:47,499
because we have repetitive sequence and if your read is not long enough to get out of the repetitive sequence into unique sequence, you

43
00:05:47,500 --> 00:05:55,200
don’t know if it belongs to this region or this region. So certainly, short-read technology is great because we can get so many reads but I

44
00:05:55,200 --> 00:06:05,466
think longer reads would really be necessary to improve alignment as it is now. And finally, we’re finding that a lot of the genotype errors we’re

45
00:06:05,466 --> 00:06:14,332
actually seeing are really due to alignment artifacts; the same problem of a read is really being misplaced and it looks like you have a

46
00:06:14,333 --> 00:06:21,666
variant but it’s really artificial because the read has simply been misplaced. And so, these are some of the alignment tools that have been used,

47
00:06:21,666 --> 00:06:30,999
commonly: BWA, ELAND, and other tools. I should point out that Novoalign is not freely available, however we’ve done some evaluation

48
00:06:31,000 --> 00:06:42,066
of it and it’s worth investigating. So now you’ve got alignments, you presumably know where your reads belong on a reference genome, and

49
00:06:42,066 --> 00:06:51,899
the next goal is to accurately determine diploid genotypes. So what is the actual call, what is the base on each chromosome at a given position?

50
00:06:51,900 --> 00:06:58,933
Again, various algorithms. Early on, methods were generally filtering based on frequency but now more sophisticated methods include a

51
00:06:58,933 --> 00:07:08,666
Bayesian expectation maximization and these methods often are really just counting the bases at a given position. So, you line all your reads up

52
00:07:08,666 --> 00:07:18,532
and just look at a given position and see how many A’s, how many T’s do I have? If it’s about 50% you might expect a heterozygous position.

53
00:07:18,533 --> 00:07:25,099
Some of the tools also can, by themselves, determine insertions and deletions. Sometimes this is a separate package, but again, an

54
00:07:25,100 --> 00:07:34,566
important class of variation and so something that’s worth dealing with and trying to discover. Some of the programs, in particular, the Genome

55
00:07:34,566 --> 00:07:43,699
Analysis Tool Kit, use a local reassembly to improve indel calls, and basically they identify a position at which there’s been an insertion, but

56
00:07:43,700 --> 00:07:49,900
understand that the alignment tool may not be powerful enough to properly align that. So, they’ll take all those reads and perform a de novo

57
00:07:49,900 --> 00:07:56,533
assembly—just line them up against each other—to get a better sense of what the actual sequence is, and that can be a very powerful

58
00:07:56,533 --> 00:08:05,099
way to improve accuracy of your indel calls. And really for all these methods, the quality is highly dependent on the alignment because the methods

59
00:08:05,100 --> 00:08:13,333
assume that the alignments are correct and I’ve told you just now that they’re not always so. So really, these methods themselves are quite good

60
00:08:13,333 --> 00:08:20,866
but it does depend on good alignments. I should also point out that while these methods—the genotype calling methods and the alignment

61
00:08:20,866 --> 00:08:28,599
methods—are all preforming very similarly, certainly each has strengths and weaknesses. So, when you are comparing datasets it’s

62
00:08:28,600 --> 00:08:36,000
probably a very good idea to use the same tools to align and to call genotypes so that you’re not just seeing differences in the individual

63
00:08:36,000 --> 00:08:46,433
approaches of each method. Okay, so now you have a list of variants. We are getting some more information. What is the effect of this variant? We

64
00:08:46,433 --> 00:08:54,333
would like some contextual information about what a variant might actually be involved with. So, is it in a gene? Is it coding? Amino acid

65
00:08:54,333 --> 00:09:04,966
change or detrimental things like that? So just to highlight this again, if you have a variant here, it’s a variant. Okay, that’s not so much information but

66
00:09:04,966 --> 00:09:11,666
if you do annotation, you can say well, maybe the variant actually falls within a gene, it falls in a coding region so it’s causing an amino acid

67
00:09:11,666 --> 00:09:19,566
change, it seems to be conserved and things like that, so now this may give you more information as to what the variant could actually be doing in a

68
00:09:19,566 --> 00:09:29,799
functional way. And so, software. These are just some examples of software that can do the gene annotation. Some of these run locally, so these

69
00:09:29,800 --> 00:09:38,366
two and this one you download the program and can do the analysis locally. Some of them you have to send your data off to other places and

70
00:09:38,366 --> 00:09:46,732
there can be issues involved with that. Generally they all will do gene annotation so they will tell you if your variant is in a gene, if its coding; if it’s

71
00:09:46,733 --> 00:09:55,866
coding is there an amino acid change? And then each one offers some additional features using different formats, calculating distances to nearby

72
00:09:55,866 --> 00:10:04,966
genes or other sort of prediction methods. This one here uses Polyphen, some of the common sort of detrimental prediction methods. This one

73
00:10:04,966 --> 00:10:15,732
integrates with more commonly used tools and easy to use wrappers that I’ll talk about and can read and write comment files. So in addition to

74
00:10:15,733 --> 00:10:22,966
annotation, you would also perhaps be interested in variant consequence and several groups have looked at how to predict consequence. Generally,

75
00:10:22,966 --> 00:10:31,899
these methods are all based on amino acids so they are very exon-centric. In general, they’re using conservations so in different ways this one

76
00:10:31,900 --> 00:10:39,566
uses PSI-BLAST, Polyphen does use conservation but also sequence features, and when available, structural features to attempt to

77
00:10:39,566 --> 00:10:47,266
say a given amino acid change could be having an effect on the function of the protein, and this one here uses a Conserved Domains Database.

78
00:10:47,266 --> 00:10:56,599
So these methods are predictive and because they’re predictions it’s important to remember that it’s not actual observations but predictions based

79
00:10:56,600 --> 00:11:06,833
on other data sets. Down here we have sort of curated databases like the Human Gene Mutation Database where people have gone through the

80
00:11:06,833 --> 00:11:17,066
literature looking for variance associated with given either diseases or functional aspects, and so this is more based on observation; however,

81
00:11:17,066 --> 00:11:26,966
it’s a manual curation. In this case, this one is subscription-based. If you’re at the NHGRI we do have site…at NIH we have a site license. It’s

82
00:11:26,966 --> 00:11:34,399
important to remember with tools like this, though, that it’s a manual curation and folks in our group have found that perhaps your interpretation of

83
00:11:34,400 --> 00:11:42,133
the primary literature linking a gene to a disease might be different than the curators, so it’s very, very important with these types of things to

84
00:11:42,133 --> 00:11:54,866
actually go back and have a check at the primary literature just to make sure that you agree with the analysis that the curators made. And finally,

85
00:11:54,866 --> 00:12:03,666
there are tools now just coming out that really attempt to prioritize your variants ab initio. Two of these examples include VAST, this is a program

86
00:12:03,666 --> 00:12:13,599
that prioritizes variants using a probabilistic approach incorporating amino acids substitution in an aggregative way as well as inheritance.

87
00:12:13,600 --> 00:12:24,100
This program is free for academic research use; if you’re not an academic researcher then Omicia will happily sell it to you. VarMD, again, prioritizes

88
00:12:24,100 --> 00:12:33,533
variants using inheritance models. So, these tools try and not assume anything but they are models and then return a list of variants that could be

89
00:12:33,533 --> 00:12:43,399
most interesting for a given trait. And so just to give you an example of what an annotation kind of looks like, here you would have different rows

90
00:12:43,400 --> 00:12:49,533
for each variant and columns for each type of data so you can know a variant could be synonymous, this one non-synonymous, this is

91
00:12:49,533 --> 00:13:00,533
the gene it’s in, here’s the exon, the gene change, the amino acid change, some sort of predicative score, how detrimental the mutation may be, and

92
00:13:00,533 --> 00:13:12,266
perhaps some information about known reported association with disease. So now you might want to know: how common is a variant? Is it

93
00:13:12,266 --> 00:13:19,866
common at all? Is it seen only in certain populations or really has it been observed in a disease cohort? We have heard a lot today

94
00:13:19,866 --> 00:13:26,799
about these databases so I won’t go into great detail, just to say that I think many of these numbers are out of date and I apologize for that.

95
00:13:26,800 --> 00:13:35,433
Certainly they are really kind of two classes: some that really don’t have phenotype like these that are more population surveys of variants and

96
00:13:35,433 --> 00:13:45,033
frequencies; and then studies like, certainly, the NHLBI Exome Sequencing Project, ClinSeq is our own internal NHGRI effort where there is

97
00:13:45,033 --> 00:13:54,933
phenotype information to be deposited or already deposited in dbGaP along with the sequencing information. So as you’ve heard today, applying

98
00:13:54,933 --> 00:14:03,599
these filters or applying these databases in a filtering type of way can be very dangerous, particularly for complex diseases where you

99
00:14:03,600 --> 00:14:12,700
might expect a variance to be in the population. Certainly, for rare diseases where this might be more attractable, I’ll show you an anecdote why

100
00:14:12,700 --> 00:14:21,433
you might be careful. So, this is the same variant I showed you before and it actually was a good candidate for a rare disease we were studying in

101
00:14:21,433 --> 00:14:30,133
the lab: the Proteus syndrome, the disease that the Elephant Man is thought to have been afflicted with. It’s very, very rare. We had identified a

102
00:14:30,133 --> 00:14:37,466
coding variant. A variant had actually been observed in cancer. Cancer is an overgrowth disorder as is this Proteus Syndrome, so things

103
00:14:37,466 --> 00:14:44,899
were looking good. However, in the latest build of dbSNP the exact variant is there. So, had we filtered on dbSNP we would not have seen this

104
00:14:44,900 --> 00:14:52,400
variant and it turns out the variant was probably in dbSNP cause it had been observed in cancer. So, that’s just an anecdote but I think either sort

105
00:14:52,400 --> 00:15:03,333
of setting frequency limits to say that a variant is too common or even more statistical approaches would be necessary when using these

106
00:15:03,333 --> 00:15:13,133
databases, although I should point out that they are quite powerful and useful. All right. So, I’ve kind of run through a lot of different analysis.

107
00:15:13,133 --> 00:15:21,533
How can you run these programs? How do you tie it all together and do an analysis? Can they go automatically? Are there graphical interfaces?

108
00:15:21,533 --> 00:15:29,699
So, the more difficult answer is what we do at the Sequencing Center. You start, you identify genotypes, you go back for every variant you

109
00:15:29,700 --> 00:15:37,266
have and determine what that genotype is in other samples so that you can know if somebody is homozygous non-reference or just with

110
00:15:37,266 --> 00:15:46,866
missing data. As you’ve heard before, that’s an important step; this is what we call back genotyping. Then you add your annotations, you

111
00:15:46,866 --> 00:15:54,966
add your frequencies and generate some sort of output file, and at the Sequencing Center this is all custom built using scripts, UNIX tools, databases,

112
00:15:54,966 --> 00:16:02,799
and cluster control. I mean, the advantage of this is that it’s absolutely customized to do exactly what we want. The disadvantage is we had to

113
00:16:02,800 --> 00:16:13,066
write it, and it’s actually a lot of effort to write and maintain. So, is there a way to do this if you don’t have a lot of informatics resources? So, I’d like to

114
00:16:13,066 --> 00:16:22,732
just discuss briefly the Galaxy tool, sort of an attempt—and a good one, I think—to do this in a graphical manner to allow you to build analysis

115
00:16:22,733 --> 00:16:32,366
pipelines where you can load data files and specify a string of analyses to run on them, almost in an automatic way. The tool is designed,

116
00:16:32,366 --> 00:16:41,932
as you can see, to be sort of graphical. It’s a Web interface so it runs on the Web. In this particular case you would load you data here. This is a list

117
00:16:41,933 --> 00:16:49,066
of some of the available human variation analyses, so in this one, amino acid changes, you’d set some parameters here in a graphical

118
00:16:49,066 --> 00:17:00,899
manner. It’s generally documented well with examples and allowing you sort of an easier way to run these analysis in a reproducible way. They

119
00:17:00,900 --> 00:17:08,700
do have next generation sequencing tools shown here—some of the tools to handle the alignment formats that are becoming more popular—and

120
00:17:08,700 --> 00:17:16,333
also text manipulation tools. Certainly one of the informatics challenges is just converting one type of text file to another type of text file for a given

121
00:17:16,333 --> 00:17:24,133
program, and with some programing knowledge that’s not a huge burden, but Galaxy does provide some tools to do this kind of thing if you

122
00:17:24,133 --> 00:17:31,933
don’t have a lot of programming experience. So, now you’ve got the data and you’d like to look at it. So, what are some of the tools you can use to

123
00:17:31,933 --> 00:17:39,433
visualize the data to really get your hands dirty and understand what’s going on? So in this case, tools can be useful both to evaluate the data for

124
00:17:39,433 --> 00:17:53,599
quality, just in a qualitative way, as well as to then identify biologically interesting variants. And so this, starting from the sort of simple and plain

125
00:17:53,600 --> 00:18:02,133
display is a program, tview, part of the Samtools package, just to view the alignments. And so what’s shown here, each of these sort of strings

126
00:18:02,133 --> 00:18:09,966
of characters is a sequence read, the 70 or 100 bases. Commas and dots mean it’s reference and when you see something that’s not a comma or a

127
00:18:09,966 --> 00:18:18,599
dot, it’s a variant. Colors indicate qualities; you can see maybe around here the quality is not so good. Here is a non-reference base, but at this

128
00:18:18,600 --> 00:18:26,033
position, and here is the reference genome, at this particular position it’s the only one, so this is an example of what’s likely a sequencing error. At

129
00:18:26,033 --> 00:18:33,699
this position you can see that there is more than just one variant base—there’s about half of them—and so this would be a heterozygous position.

130
00:18:33,700 --> 00:18:42,533
You can see one here as well. So, this tool is fast, it’s text based, but I’ve pretty much showed you all the functionality; it’s pretty basic. So, if

131
00:18:42,533 --> 00:18:52,499
you want a more fully featured way to view your alignment information, IGV, the Integrative Genome Viewer, written by folks at the Broad, is

132
00:18:52,500 --> 00:19:01,000
another option. You can see here this one is much more graphical. You have all your reads here in gray, you can see the direction with little

133
00:19:01,000 --> 00:19:07,900
arrows. These are the same two variants I showed you before, so you can again see about the 50% frequency. Here’s two more, and if you

134
00:19:07,900 --> 00:19:15,866
can see this dot, this is actually an insertion in about half the reads. Down here you’ve got your gene models, you have a cytoband here, and the

135
00:19:15,866 --> 00:19:23,832
neat thing about this program is that it’s designed to read many different types of genomic file information such as SNP chips, arrays, and

136
00:19:23,833 --> 00:19:33,533
things like that. Many different types of data can be loaded in here and investigated at the same time. So this one allows zooming. You can

137
00:19:33,533 --> 00:19:40,599
highlight the reads to get more information. Many features. It can be run locally on your computer or even started with a Web launcher, so it’s

138
00:19:40,600 --> 00:19:50,200
designed to be fairly easy to use. So, you’ve looked at the raw data and now you have your final data, and this is sort of an example of the

139
00:19:50,200 --> 00:19:58,900
problem. You’ve done a sequencing study. The sequencing team says, “Wow, it worked great! We’ve got so much data here!” and it’s a lot, right?

140
00:19:58,900 --> 00:20:06,533
I mean, 50 to 1,000 columns, perhaps, depending on how many samples you have. In an exome experiment we’re finding for one sample it’s

141
00:20:06,533 --> 00:20:13,099
around 100,000 variants; certainly, with more samples, that increases. If you’re looking at whole genome you’re starting at around three

142
00:20:13,100 --> 00:20:22,166
million. Maybe you can load this into Excel, but even in Excel, this is a lot of data, so this is really a problem, and what I wanted to do is address

143
00:20:22,166 --> 00:20:30,566
this problem by creating a tool designed for genetic variation data that would allow people in a graphical manner to analyze and really get their

144
00:20:30,566 --> 00:20:38,199
hands dirty with the data. So this is a tool I’ve written called VarSifter. It allows for the viewing, sorting and filtering of variants, and in the

145
00:20:38,200 --> 00:20:47,100
remaining time I’ll just walk you through the program and show you what it can do. So first, you can see the main view here. You have your

146
00:20:47,100 --> 00:20:54,466
annotation information. Every row here is a different variant. Every column is a piece of information, so here you have your location, here

147
00:20:54,466 --> 00:21:03,232
you have gene name, the mutation type, the base change, the amino acid change, is it in a database of known variation? Clicking on any one of these

148
00:21:03,233 --> 00:21:10,733
rows brings up the genotypes for all your samples, so you can see what each sample is at that given position with scores and coverages.

149
00:21:10,733 --> 00:21:19,166
Each of the columns is sortable, so clicking column can allow you to prioritize your variant by gene name, by location, sift score, or any type of

150
00:21:19,166 --> 00:21:27,899
annotation information that’s in the file. And so you probably can’t see it, but the number of variants in this file is about 76,000. Okay, that’s a

151
00:21:27,900 --> 00:21:37,266
lot of variants, so how can you sort of reduce the list to those that might be most interesting for your disease or what it is you’re studying? And

152
00:21:37,266 --> 00:21:44,666
so, you can start just by filtering based on mutation type. Let’s, for instance, click Splice-site and Stop and look at just those variants. So, you

153
00:21:44,666 --> 00:21:51,566
click the boxes, you click Apply Filter down here and now we’re down to 133 variants, so we’re getting more reasonable. Can we reduce the list

154
00:21:51,566 --> 00:21:59,332
more? ] Even though everybody here has told you not to, you can exclude variants in a known database. I’ll probably have to take that out at

155
00:21:59,333 --> 00:22:06,899
some point, but you can you do it. You click the box, you click Apply Filters, and now we’re down to around 25 variants. So, in that way you can

156
00:22:06,900 --> 00:22:15,666
very rapidly dive in and look at a very limited list of variants to see if anything here could be interesting for further study. If not, the program is

157
00:22:15,666 --> 00:22:25,466
really designed to allow you to back out, dive in given a different way, so you just clear all, clear all your filters, apply filter again and you’re back

158
00:22:25,466 --> 00:22:34,766
to where you started. So, in the program you can determine or tell the program that a sample is part of an affected normal pair, perhaps like a tumor

159
00:22:34,766 --> 00:22:44,299
normal pair, and so once that’s been done you can click the Affected Different from Normal box and this will show you variants that are different

160
00:22:44,300 --> 00:22:52,566
in at least some number of pairs; a very simple filter. In this case this is simulated data, so there is only one, but it’s a quick way to look at

161
00:22:52,566 --> 00:23:04,699
differences in samples that really should be the same. You can do this manually for each sample by checking boxes whether it’s part of an

162
00:23:04,700 --> 00:23:10,100
affected normal pair or a case control. Case control then opens up a very simple filter that just allows you to see variants that are seen in X or

163
00:23:10,100 --> 00:23:19,466
more cases and Y or fewer controls. So, it’s a very simple way to look at the data but it can be useful. Gene name. Certainly, many people have

164
00:23:19,466 --> 00:23:27,366
genes of interest that you want to quickly look at. You just type the name in the box. If you’re familiar with regular expression syntax, which is

165
00:23:27,366 --> 00:23:34,099
a programmatic way to define searches, this tool accepts that. If you’re not familiar with what regular expressions are I encourage you to

166
00:23:34,100 --> 00:23:42,866
check that out on the Web because it’s a very, very powerful way to do a textual search and it can be very, very useful. So, you apply the filter

167
00:23:42,866 --> 00:23:53,966
and here are all the variants; in this case, the CFTR gene. Some other filters…if you have a whole list of genes you’re interested in—a

168
00:23:53,966 --> 00:24:01,299
pathway or perhaps a region—you can enter a list of those genes and the program will then filter, giving you the variants that are in that list of

169
00:24:01,300 --> 00:24:08,800
genes. If you have a region of interests, perhaps GWAS peaks, linkage regions, you can identify only the variants that fall within that region, which

170
00:24:08,800 --> 00:24:19,700
you can give the program as a Bed file. So finally, the program also includes a custom filter builder. If there are not filters built into the program you

171
00:24:19,700 --> 00:24:27,833
can really filter on any of the annotation columns you have, as well as doing sample comparisons to just allow the use of all the tools that are still

172
00:24:27,833 --> 00:24:37,233
being developed and will continue to be important and useful. And so, you open a custom query window and, for instance, let’s say we want to

173
00:24:37,233 --> 00:24:45,066
filter on this type annotation, so we click it, and now in here we see actually all the values that are in that column. So, we can say Exactly

174
00:24:45,066 --> 00:24:52,832
Matches. We can either pick one of the values that’s in that column or type in a search text. You probably can’t see it, but we’ll click this that says

175
00:24:52,833 --> 00:25:01,366
Stop, and so now we have a little query box that says Type Equals Stop. Okay, now we can do a sample comparison, so we can click Affected. In

176
00:25:01,366 --> 00:25:08,166
this case we’ll click “Does Not Match” and now we can either pick a genotype so we can say this sample does not match homozygous

177
00:25:08,166 --> 00:25:15,766
reference, we want to identify variants, or say “Does not match this normal” so we can click that, and now we have another query. So, we

178
00:25:15,766 --> 00:25:22,232
can draw a box around these and link them together with logical statements. In this case we will say “and,” so both of these would have to be

179
00:25:22,233 --> 00:25:31,566
true, and in this way you can build up really a custom sort of filter. Actually, this one’s pretty simple. I’ve seen users build these with hundreds

180
00:25:31,566 --> 00:25:42,532
of nodes and it actually still performs quite quickly, so that’s a very useful way to dive in to the type of data you might want to find. Then you

181
00:25:42,533 --> 00:25:50,099
just finalize it and you’re good to go. So, this tool was really designed to allow end users without informatics experience to really dig into the data

182
00:25:50,100 --> 00:26:01,100
in a visual kind of filtery type of way. It’s been quite successful and it’s been used to identify variants in a number of rare diseases, so people

183
00:26:01,100 --> 00:26:09,333
tend to find it very useful. Actually, even though I’ve got a lot of programming experience now, I find this easier to use that just writing a custom

184
00:26:09,333 --> 00:26:17,633
script every time I want to find something. So in summary, there are really several approaches to the initial alignment in genotype calling. The

185
00:26:17,633 --> 00:26:24,833
methods, on average, the more recent ones, all perform similarly—there are slight differences between them—although many of them share the

186
00:26:24,833 --> 00:26:32,599
same challenges due to things like read-length and alignment artifacts. Annotation gives you context to get more information about what a

187
00:26:32,600 --> 00:26:41,600
particular change could actually mean biologically. Consequence prediction and population analysis can help guide which variants might be more

188
00:26:41,600 --> 00:26:49,466
interesting to you. These tools require varying experience but I think there is sort of more of an effort now to make things easier to use so that

189
00:26:49,466 --> 00:26:58,799
more people can use them, certainly something that’s of interest to many. And finally, visualization is useful for the quality evaluation and just sort of

190
00:26:58,800 --> 00:27:09,666
understanding the data in a qualitative way, as well as prioritization of your variants for further studies. And with that, a lot of people to just

191
00:27:09,666 --> 00:27:17,766
thank. Genomics is quite a collaborative field. The folks at the Sequencing Center, my mentors, Dr. Mulliken and Biesecker, and some of the

192
00:27:17,766 --> 00:27:25,166
collaborators, and thank you. Oh, and I should also point out that in the folder that there are links and names of a lot of the programs I list, so

193
00:27:25,166 --> 00:27:38,332
thanks to the organizers for including that. So, if I went too fast, the names of all the programs and their links are in the folder you got. Thanks.

194
00:27:45,433 --> 00:27:51,466
MALE: We’ve talked about the importance of phasing today. Do any of these programs in the pipeline phase the sequencing?

195
00:27:51,466 --> 00:28:03,232
JAMIE TEER: Most of them do not, generally due to limitations of short reads. In some cases, in one of the viewing tools I actually picked two

196
00:28:03,233 --> 00:28:10,933
reads that were close enough that you could get phased and they weren’t phased, but in general the distance is just too great. So, the tools aren’t

197
00:28:10,933 --> 00:28:18,599
really handling that yet. But I think as the technology for sequencing improve, and even if you don’t get longer reads, if there are ways to

198
00:28:18,600 --> 00:28:27,000
get phased I think that will be important to include because it is really greatly important.

199
00:28:27,000 --> 00:28:38,200
MALE: Is there a certain minimum where it will work to get phased or is it just completely specific per probe?

200
00:28:38,200 --> 00:28:45,466
JAMIE TEER: It depends mostly on the sequencing length, and so given 100 base reads, you cannot really detect things 100 bases apart

201
00:28:45,466 --> 00:28:54,599
but maybe up to 80 or 90. It’s something we’ve definitely thought about looking into, but not too many variants fall in that range, so I think more

202
00:28:54,600 --> 00:29:04,166
clever approaches are going to be needed or longer reads. Great. Thank you very much.




Date Last Updated: 9/18/2012

General Inquiries may be addressed to:
Office of Communications and Public Liaison
NIDDK, NIH
Building 31, Rm 9A06
31 Center Drive, MSC 2560
Bethesda, MD 20892-2560
USA
Phone: 301.496.3583