Whole Genome Approaches to Complex Kidney Disease
February 11-12, 2012 Conference Videos

Data Sharing: Access and Confidentiality
Laura Rodriguez, NHGRI

Video Transcript

1
00:00:00,833 --> 00:00:09,399
SARA HULL: Hi, my name is Sara Hull. I have a joint appointment with the National Human Genome Research Institute and with NIH’s Clinical

2
00:00:09,400 --> 00:00:20,733
Center’s Department of Bioethics, and it’s a real privilege to have worked with Dr. Kopp and Dr. Shaw and the rest of the planning committee as

3
00:00:20,733 --> 00:00:31,566
part of this meeting. We are now going to turn our focus to pay explicit attention to the ethical, legal, and social implications of the very important

4
00:00:31,566 --> 00:00:42,499
research that we’ve been discussing today, and it’s heartening to hear just how much of the discourse has actually, at least implicitly, focused

5
00:00:42,500 --> 00:00:54,800
on these issues. This morning we heard quite a bit about the need for collaboration, transparency, sharing and that’s a nice transition into the

6
00:00:54,800 --> 00:01:02,133
presentation that we’re going to hear this evening. I do want to point out there is one talk in this area this evening and then three more

7
00:01:02,133 --> 00:01:09,199
excellent talks tomorrow morning and then the breakout session as well, so we are going to carry this through. So, it’s my pleasure to

8
00:01:09,200 --> 00:01:19,133
introduce Dr. Laura Lyman-Rodriguez who directs our Office of Policy, Communications and Education at NHGRI. She oversees the

9
00:01:19,133 --> 00:01:27,933
development of the institute’s policy positions on the ethical, legal, and social implications of human genome research. She did her doctorate in cell

10
00:01:27,933 --> 00:01:35,999
biology at the Baylor College of Medicine and then transitioned into policy work as a congressional science fellow, and then at the Institute of

11
00:01:36,000 --> 00:01:48,433
Medicine before coming to the NIH in 2002. And now, I’m quoting from our boss, Eric Green, who is the Director of the institute. He points out that

12
00:01:48,433 --> 00:01:56,099
“Laura’s science background provides a strong foundation for understanding the implications of genomic advances. Her extensive experience in

13
00:01:56,100 --> 00:02:04,666
science policy allows her to provide key leadership in a rapidly changing and complex area of research,” and her credibility—and this is

14
00:02:04,666 --> 00:02:16,366
me talking, which I know very well, personally—serves to galvanize widespread respect for her by staff within and outside NHGRI. Laura was

15
00:02:16,366 --> 00:02:26,266
integrally involved with the development of the NIH’s GWAS data sharing policy as well as what’s going to be happening next. I couldn’t

16
00:02:26,266 --> 00:02:38,732
have come up with anybody more qualified to talk to us today about data sharing, access, and confidentiality.

17
00:02:38,733 --> 00:02:43,666
LAURA LYMAN-RODRIGUEZ: Okay. Thank you and I think we should all just go home now with that lovely introduction from Sara so that I don’t

18
00:02:43,666 --> 00:02:52,699
do anything to counteract that within the next few minutes. I am happy to be here to talk to everyone today and to begin the exploration that

19
00:02:52,700 --> 00:03:01,066
you all will have tomorrow through what does look like a great series of talks in the workout groups. To think about the activities around data

20
00:03:01,066 --> 00:03:09,599
sharing, the principles underlying how to data share in a responsible way that is respectful of the human participants that give their time and

21
00:03:09,600 --> 00:03:18,933
their samples and information to our work and in a way that maintains the public trust going forward. So, I’m going to talk a lot about the

22
00:03:18,933 --> 00:03:25,499
GWAS policy, but really try to think about that, too, in terms of what are the questions to ask ourselves as we go forward in science as our

23
00:03:25,500 --> 00:03:34,633
science changes and as we just try to work together with the various variables, as I’ve called them here; a different kind of variant than what

24
00:03:34,633 --> 00:03:40,066
you’ve been talking about for most of the day, but definitely things that we don’t always know exactly what they mean as we’re working on

25
00:03:40,066 --> 00:03:51,066
them. The principle of data sharing is not new. Certainly, for NIH, it is something that the agency has believed strongly in for a long time, has

26
00:03:51,066 --> 00:04:01,299
implemented through various policies for many years, and the quote that is on here—that’s on this slide—is from the 2003 Data Sharing Policy. It

27
00:04:01,300 --> 00:04:10,900
emphasizes here that this really is trying to move research forward in a way that attains the greatest public benefit from the federal

28
00:04:10,900 --> 00:04:20,966
investment, from the investment of the, again, research participants who join with us in the pursuit of the research and the questions to bring

29
00:04:20,966 --> 00:04:29,099
about better health. When we’re trying to think about any research program there are always, regardless of our perspective, there are always

30
00:04:29,100 --> 00:04:37,866
lots of different elements which need to come together. The ones that I try to look at when I’m thinking from a policy perspective is, first of all,

31
00:04:37,866 --> 00:04:46,966
what is the scientific design, what are the scientific aims that we’re trying to accomplish in the research project, and the program priorities

32
00:04:46,966 --> 00:04:54,399
specifically separate from the technical issues of how the science is going to be done. Because if you can’t achieve those scientific aims, it’s not

33
00:04:54,400 --> 00:05:01,466
going to matter what policies you might create because it’s not going to be a successful project. You do have to think about the practicalities of

34
00:05:01,466 --> 00:05:11,866
existing policies, practices, and procedures that may be in place that you’re going to have to conform to as you pursue the research, and also

35
00:05:11,866 --> 00:05:19,399
the guiding principles again. What do you want to achieve and not just from a scientific perspective always but from an ethical perspective? How do

36
00:05:19,400 --> 00:05:28,000
you want to conduct the research? What is really important to the aims of the research and to the aims of the research enterprise, more globally

37
00:05:28,000 --> 00:05:35,200
sometimes? And then, getting down to brass tacks, how are you going to make it happen? What are the policy and ethics that you are going

38
00:05:35,200 --> 00:05:41,933
to build to build around your particular project or initiative? How are you going to maintain transparency so that there’s credibility in what

39
00:05:41,933 --> 00:05:49,033
you’re doing, both with the research community and the stakeholders of, again, any particular initiative, but also in terms of thinking broader at

40
00:05:49,033 --> 00:05:57,733
the more public trust level and sustaining that trust that we have? And then also from a practical perspective, how are you going to

41
00:05:57,733 --> 00:06:05,566
share the responsibilities of accomplishing all of these different aims and objectives with those who are doing it—because we do operate it in a

42
00:06:05,566 --> 00:06:13,766
distributed model and so we as funders are working with everyone’s institutions and with the investigators—and so how do we work together

43
00:06:13,766 --> 00:06:24,132
to make things happen? So, what are the questions? Again, thinking about this and things that I think it sounds like you’ve already been

44
00:06:24,133 --> 00:06:28,799
talking about today, what are the questions to ask ourselves as we go forward and try to put this together? For the things that I was asked to talk

45
00:06:28,800 --> 00:06:35,566
about, we’re thinking about access and confidentiality, so jumping right to the heart of some of these issues around genomic research

46
00:06:35,566 --> 00:06:47,299
is: can we calculate the risks to privacy? What might it be? What does it mean exactly? How do we calibrate, too, the protections and oversight

47
00:06:47,300 --> 00:06:56,333
that we are going to put into place in a risk-based model, which is what we try and put forward because there is always risk? How are we going

48
00:06:56,333 --> 00:07:04,666
to manage the risk? Can we quantify it? And again, an important question, particularly for me in an area of research that I’m very interested in is:

49
00:07:04,666 --> 00:07:14,532
what are the participant perceptions of risk and their tolerance for this risk if we can articulate what it is? How do we respect the wishes of the

50
00:07:14,533 --> 00:07:22,266
participants? As we go forward, research is a never-ending process; we always have another question. How do we respect what that individual

51
00:07:22,266 --> 00:07:32,099
in a particular study wanted to have done by participating in the research by donating their time and their information and specimens? And then,

52
00:07:32,100 --> 00:07:38,000
of course, since science is ever moving and technology with it, how do we reconcile what the current research might be and the current

53
00:07:38,000 --> 00:07:47,900
technology and parameters with what past expectations may have been for science, for the participants, for the public, etc.? So, we can’t

54
00:07:47,900 --> 00:07:55,666
always think just about going forward, but how do we achieve our principles also for those studies that may have been done before but

55
00:07:55,666 --> 00:08:07,532
where we’d like to continue to learn from them if we have collections of specimens again, now that we can ask new questions? So, just to start

56
00:08:07,533 --> 00:08:15,399
with identifiability again, and the questions of privacy, this is now—I realized as I was looking at this—almost an eight-year-old slide, so it’s

57
00:08:15,400 --> 00:08:23,133
really outdated and that’s okay. This slide is looking actually at single nucleotide polymorphisms and trying to first quantify how

58
00:08:23,133 --> 00:08:29,666
much is it that you have to have before you have a unique pattern. And what’s important from my point now is that, of course, when we’re talking

59
00:08:29,666 --> 00:08:35,266
about whole genomics, we have unique patterns. We had it when we were looking at GWAS studies and as we move into whole genome

60
00:08:35,266 --> 00:08:42,499
sequence information, of course we do as well. And so, it’s unique but again, how do we calculate the risk, how do we calculate the

61
00:08:42,500 --> 00:08:52,400
tolerances of participants. How do we manage those risks? It really is about balance, as with everything. Again, this figure is from a paper that

62
00:08:52,400 --> 00:09:02,566
is now almost five years old from Francis Collins and Bill Lowrance, trying to ask the questions of: what does it mean to be de-identified in terms of

63
00:09:02,566 --> 00:09:10,666
genomic data? How to you manage it? What systems can you put in place to look at the risks, to be able to share the data and achieve the

64
00:09:10,666 --> 00:09:20,032
scientific aims while respecting the participant protection issues? My main reason for showing this slide is that it was complicated then and it’s

65
00:09:20,033 --> 00:09:29,333
complicated now. It’s not really important what the different boxes on this slide say; there are many different ways to make data de-identified or

66
00:09:29,333 --> 00:09:38,433
render it identifiable. There are many different ways to backtrack it if you wanted to try and backtrack it to a particular individual and have the

67
00:09:38,433 --> 00:09:46,466
means and enough information to do so, and we’re still at a place where we have uncertain and debatable risk, and we’re still at a place

68
00:09:46,466 --> 00:09:53,499
where we don’t know exactly what the participants think about this debatable risk. But what we do know is that we need to balance

69
00:09:53,500 --> 00:10:05,966
these risks, these unknown and ill-formed risks in some places, or at least it’s very subjective and so it varies from person to person what that risk

70
00:10:05,966 --> 00:10:15,032
is and what their tolerance is. So, how do we balance it on a large scale across the population to go forward? Those questions are exactly the

71
00:10:15,033 --> 00:10:25,799
same today as they were in 2007. Mixed into all of these different questions is the different way that we’re doing our science today. When our

72
00:10:25,800 --> 00:10:34,333
current participant protection system was designed, most everything was done through single projects with very specific consents at a

73
00:10:34,333 --> 00:10:42,366
single institution, many times, and for a defined timeline or a defined period of time, and that’s just not what’s happening anymore. We have many

74
00:10:42,366 --> 00:10:50,632
projects that are being done on the same datasets. We still sometimes have very specific consents that we need to deal with but we do, in

75
00:10:50,633 --> 00:10:58,399
some cases, have broader consents because again, with time, there’s a pendulum that seems to go with what’s believed to be appropriate

76
00:10:58,400 --> 00:11:05,233
consent. We also have the entrance of governance models coming into how we’re sharing data and how we’re thinking about

77
00:11:05,233 --> 00:11:14,433
overseeing the sharing of data. But certainly now, we’re at a point where we have undefined timelines going forward. These data exist, once

78
00:11:14,433 --> 00:11:23,499
the specimens are there you can generate genomic information and that genomic information is not degradable and so it’s going to stay there

79
00:11:23,500 --> 00:11:32,400
and we’ll be able to ask questions of it for a very long time. And so, how does that change the nature of how you design the oversight systems

80
00:11:32,400 --> 00:11:40,833
to protect the interests of the participants of those individuals who originally were in the studies? Oftentimes I feel like and what I’ve heard

81
00:11:40,866 --> 00:11:48,199
other people feel like is you just don’t know which way to go. You feel like every which way you turn there is another question, another problem,

82
00:11:48,200 --> 00:11:56,800
another roadblock. That’s just not what we want and not what we are trying to accomplish, so we need to keep looking for ways to untangle this

83
00:11:56,800 --> 00:12:06,866
mess and not end up running into the brick wall on the other side. So of course, that brings me to GWAS. We won’t ask whether or not there is a

84
00:12:06,866 --> 00:12:17,366
brick wall on the other side of this right now but I think all of these questions that I’ve talked about so far are what we were thinking about in trying

85
00:12:17,366 --> 00:12:22,066
to put together the GWAS policy originally, and even more so, what we have been thinking about as we’ve gone forward the last four years trying

86
00:12:22,066 --> 00:12:29,366
to implement the policy, trying to maintain the currency of the policy and its oversight mechanisms. The guiding principle that we

87
00:12:29,366 --> 00:12:37,199
established when we were trying to build this was, again, building directly from the NIH statements on data sharing before, where we

88
00:12:37,200 --> 00:12:45,666
wanted to attain maximum public benefit by making these data available to ask many investigators for as many appropriate research

89
00:12:45,666 --> 00:12:58,066
questions as was possible in a timely way. We wanted to do this in a way that was true to our aims which was, of course, respect for the

90
00:12:58,066 --> 00:13:05,232
participants involved in the primary studies that promoted data sharing and that also enabled a freedom to operate with some of the very basic

91
00:13:05,233 --> 00:13:12,566
information that would be found through GWAS studies, so that there weren’t these data or these initial association findings would never be tied up

92
00:13:12,566 --> 00:13:20,099
in intellectual property challenges, and again, could be available for innovation by as many different people as had good and appropriate

93
00:13:20,100 --> 00:13:28,833
ideas. You all have been talking about the scientific aspects underlying GWAS and what you wanted to attain, so there is really no need to

94
00:13:28,833 --> 00:13:37,299
go through this, but of course it is one of my three essential elements and so I just wanted to come up here and again make the notion that we

95
00:13:37,300 --> 00:13:45,500
are talking about, of course, very large datasets involving very many people with lots of different genomic information and phenotypic information

96
00:13:45,500 --> 00:13:54,600
when we are trying to share that in context of this policy as broadly as is appropriate and through a government database. This begins to

97
00:13:54,600 --> 00:14:02,866
come into our implementation questions, where we’re having to deal with the realities of what our existing situation is, and being a government

98
00:14:02,866 --> 00:14:13,432
entity, this added something new and completely unrelated to the science that still had to be accommodated. The GWAS research process,

99
00:14:13,433 --> 00:14:27,066
as you know of course, begins with the individual study. There is an informed consent agreement process discussion document that occurs in the

100
00:14:27,066 --> 00:14:33,099
beginning and it represents the agreement between the individual research participant and the primary investigator, and that informed

101
00:14:33,100 --> 00:14:41,900
consent was, we thought, fundamental to all of the future research going forward and so we built that in as an underlying tenant for everything

102
00:14:41,900 --> 00:14:53,433
that would be done in the future. I’ll show you how we implemented that in a moment. When genomic studies were conceived by the

103
00:14:53,433 --> 00:15:00,466
investigators—again, these could have been from the beginning—an intent of the study…they could be deciding to go back to deciding to go back to

104
00:15:00,466 --> 00:15:06,866
data sets in their freezer or sample sets in their freezer and doing this retrospectively, but when they determined that they wanted to do a GWAS

105
00:15:06,866 --> 00:15:14,532
study and requested support from the NIH and received the support, that is when the policy became effective. There was an expectation

106
00:15:14,533 --> 00:15:23,899
then to submit the data—the genomic data as well as the phenotypic data—to a central repository at the NIH and that all of that data would come in an

107
00:15:23,900 --> 00:15:32,833
encoded way with all of the identifying information removed, again being practical, with many different ways to de-identify information.

108
00:15:32,833 --> 00:15:44,099
We had to pick a standard and the standard that we chose was the 18 identifiers from the privacy rule and that is what we defined within the policy

109
00:15:44,100 --> 00:15:56,433
as representing de-identified data. Going back to the informed consent issues, for setting up our parameters for how data would be shared for

110
00:15:56,433 --> 00:16:05,566
future use, we asked the submitting institutions to determine, based on a review of the conformed consent, what data use limitations should be in

111
00:16:05,566 --> 00:16:14,432
place for all secondary research? So it’s the PIs with their IRBs that work together to define what is an appropriate use in the future based on the

112
00:16:14,433 --> 00:16:22,499
informed consent. If the informed consent had statements about only cancer research being done then the NIH wants to know about that

113
00:16:22,500 --> 00:16:31,500
because we want to honor those original intentions and understanding of the participants, and we use that in making all access decisions

114
00:16:31,500 --> 00:16:41,566
as data are handed out in the electronic and encrypted way to secondary investigators through dbGaP which is the name, as you all

115
00:16:41,566 --> 00:16:52,332
hopefully know, of the GWAS data repository at the NIH. Coming back to our shared responsibilities, we do have to rely on

116
00:16:52,333 --> 00:17:01,099
investigators and local institutions in, not only implementing, but in helping us to get to this point where we have the de-identified data coming into

117
00:17:01,100 --> 00:17:10,400
the NIH in an appropriate way. So, we ask the institutions to certify their approval for any submission. They stipulate in that certification that

118
00:17:10,400 --> 00:17:19,933
the data have all been collected and transferred in accord with applicable law, that an IRB has looked at the particulars of the informed consent

119
00:17:19,933 --> 00:17:28,599
process and documentation for a given study, and that they have used that information to establish the data use limitations that they are

120
00:17:28,600 --> 00:17:37,066
going to provide to the NIH. They also assure that the PI has done their part to remove all of the HIPPA identifiers and that they are retaining the

121
00:17:37,066 --> 00:17:47,266
key code to the data so that all data come into the NIH without any identifiers. Moving to data access then, there are, of course, also other

122
00:17:47,266 --> 00:17:56,899
issues to consider in terms of our goals and the goals were, again, immediate or at least rapid access to all qualified users so that there’s a

123
00:17:56,900 --> 00:18:04,333
maximum opportunity for scientific progress, but we did need to protect the confidentiality of the participants and respect their consent provisions,

124
00:18:04,333 --> 00:18:11,199
and also questions about recognizing the needs of the investigators for academic research through the sharing since they’ve invested a lot

125
00:18:11,200 --> 00:18:18,633
of time in building these cohorts and in doing the original data collection. And then, of course, we want to preserve all of the basic knowledge for

126
00:18:18,633 --> 00:18:27,666
downstream development getting to our freedom to operate principle for the basic knowledge coming out of these studies. So, the model that

127
00:18:27,666 --> 00:18:37,032
was developed was a two-tiered data access model with controlled access data through a Data Access Request and review procedure where

128
00:18:37,033 --> 00:18:44,699
people could get access to all of the genomic information, the phenotype information, as well as some pre-computed information which was

129
00:18:44,700 --> 00:18:54,500
done—some basic statistics information to put out there—to try and make sure that the data that were there…that no one was able to claim just a

130
00:18:54,500 --> 00:19:04,666
basic association from the data and tie that up in intellectual property claims. By making it available to everyone we hope to circumvent that. From

131
00:19:04,666 --> 00:19:12,666
the open access perspective we wanted to let people know—again, transparency—what was in the database. We wanted people to know as

132
00:19:12,666 --> 00:19:20,599
much about the individual studies as possible, so we included the protocol descriptions about the population, the variables that were included in the

133
00:19:20,600 --> 00:19:27,366
dataset, the study instruments that were used to collect the data and this was both for transparency to the public but also transparency

134
00:19:27,366 --> 00:19:34,032
to investigators who might want to use the data so that they would know before they ever requested the data whether it was going to help

135
00:19:34,033 --> 00:19:40,433
them. Were they going to be able to compare the blood pressures in their study with the blood pressures in the study that they did in their own

136
00:19:40,433 --> 00:19:49,266
lab? So, they didn’t ask for data that they didn’t need because, again, that’s a participant protection issue as well. The other part of that

137
00:19:49,266 --> 00:19:58,266
process is that anyone that is approved for access, their name and approved research use statement goes on to the public access site so

138
00:19:58,266 --> 00:20:05,166
you can also see everyone who has been approved to work with the dataset, and our hope was not only transparency there but also to

139
00:20:05,166 --> 00:20:12,432
promote collaboration so that you can see who else was asking questions of the same data that you were interested in and if they might be doing

140
00:20:12,433 --> 00:20:21,933
something you might want to work together on. The actual request process works through a Web-based system where an individual

141
00:20:21,933 --> 00:20:30,233
investigator goes into dbGaP and they identify the particular datasets by their consent group by their data use limitations that they would like to

142
00:20:30,233 --> 00:20:37,733
use. They submit to the NIH a Research Use Statement, which is basically an abstract describing the questions they would like to

143
00:20:37,733 --> 00:20:47,566
answer. They then submit that proposal to their own institution who must co-sign the request before it comes to the NIH so that the institution is

144
00:20:47,566 --> 00:20:55,199
involved in the conduct of the research, again, trying to get to our concept of shared responsibilities. And in so doing, both the

145
00:20:55,200 --> 00:21:05,433
investigator and the institution agree to particular terms of use for the dataset and that the investigator will abide by a code of conduct for

146
00:21:05,433 --> 00:21:12,566
using the genomic data that they are going to receive access to. The terms of use in the code of conduct, not surprisingly, overlap rather

147
00:21:12,566 --> 00:21:21,032
closely in terms of what we’re trying to achieve. The highlights are some things that, one, we might expect but that we also wanted to be very

148
00:21:21,033 --> 00:21:27,799
clear about, and that is, that they won’t attempt to identify the individuals within the dataset, they won’t sell or transfer the data to anyone not

149
00:21:27,800 --> 00:21:35,633
approved on their application, that they will only use the data for the approved research use. So, they can’t investigate the data, find something

150
00:21:35,633 --> 00:21:44,599
interesting and decide on their own to go ahead and do something different because that may not be consistent with the data use limitations, and so

151
00:21:44,600 --> 00:21:54,133
they need to come back to the NIH and let the Data Access Committee, who oversees the individual access to each dataset, look at their

152
00:21:54,133 --> 00:22:02,499
proposed questions and make sure that its consistent with what the data use limitations are before they go and do their research. One of the

153
00:22:02,500 --> 00:22:09,966
roles of the Data Access Committee is to monitor conduct and progress on research through annual reports just as another way to check and

154
00:22:09,966 --> 00:22:17,832
make sure investigators really do limit themselves to what they said they are going to do and what they’ve been approved to do. So, the Data

155
00:22:17,833 --> 00:22:26,099
Access Committee, again, reviewing annual reports is one thing that they do once they…and everything is done in the context of the data use

156
00:22:26,100 --> 00:22:37,366
limitations. Just to know what it looks like, I’m not sure if everyone knows how the DACs work, bBut as I’ve already mentioned there’s a process

157
00:22:37,366 --> 00:22:44,666
where the investigator’s request goes to the institution—their home institution—before it ever comes into NCBI for dbGaP. Once that does come

158
00:22:44,666 --> 00:22:52,366
in there, there’s an initial staff review to make sure everything is in place and then the DACs take a look at the proposed Research Use

159
00:22:52,366 --> 00:23:00,366
Statement relative to the data use limitations and make an initial decision: either “yes” the uses are consistent and it may be approved, or “no”

160
00:23:00,366 --> 00:23:10,399
there’s something inconsistent or inappropriate about how this investigator claims to do the work and the request is disapproved. If it’s approved,

161
00:23:10,400 --> 00:23:18,966
notification goes to dbGaP and the staff are notified with their passwords and instructions for downloading the encrypted data, and likewise, if

162
00:23:18,966 --> 00:23:27,999
it’s disapproved the requester is notified by the DAC and often given a reason as to why it was disapproved. With this rationale, if this was a

163
00:23:28,000 --> 00:23:35,133
simple mistake or they didn’t understand something they can resubmit the request after making revisions. Sometimes it’s not as obvious

164
00:23:35,133 --> 00:23:41,966
what they’re planning to do in the research statement and so the DAC, the Data Access Committee, will have questions and so there can

165
00:23:41,966 --> 00:23:49,932
be some back-and-forth between the DAC Chairs or the DAC staff and the individual investigators

166
00:23:50,000 --> 00:23:58,266
Thank you.

167
00:23:58,266 --> 00:24:02,432
to make sure again that the DACs and the investigators are on the same page before the DAC makes a review. Then the DACS are also

168
00:24:02,433 --> 00:24:08,833
responsible—and again this speaks to transparency—for being timely in their work. And so, there is a semi-annual reporting process for

169
00:24:08,833 --> 00:24:15,433
all of the Data Access Committees from NIH to report to NIH how they are doing the work. What is their…and I’m going to show you some of the

170
00:24:15,433 --> 00:24:21,833
data that comes out of that later. How many reports are they approving? How many are they disapproving? How long is it taking them to do

171
00:24:21,833 --> 00:24:29,799
this? So, we can look at the system and make sure it’s efficient. If we want to think about how is this access process working, who is asking

172
00:24:29,800 --> 00:24:37,433
for the data, and what are they using it for, the requestors are coming from across the research community, so we’re getting investigators from

173
00:24:37,433 --> 00:24:45,033
the private sector, from academic sectors, from non-profits, and from all over the world, which is exactly what we wanted to happen. We really

174
00:24:45,033 --> 00:24:52,766
wanted this to be a community resource that anyone could use if they had an appropriate research question for the data they were

175
00:24:52,766 --> 00:25:01,266
requesting. In terms of the kinds of work they’re doing, most of the work is looking at the etiology of the particular disease or related conditions that

176
00:25:01,266 --> 00:25:08,499
the dataset was originally collected to study, but there are also a lot of methodology questions that are being asked: trying to learn how to work

177
00:25:08,500 --> 00:25:17,466
with these kinds of data, how to work with very large datasets, etc. Getting to some of the statistics I thought it might be interesting for the

178
00:25:17,466 --> 00:25:24,666
group just to look at how much activity there has been. These numbers aren’t perfect. They don’t add up, so don’t worry about that when you try

179
00:25:24,666 --> 00:25:32,932
to look at it. But if we look at this in terms of project requests…because you can request multiple datasets within any single project. So,

180
00:25:32,933 --> 00:25:46,299
since dbGaP opened through the middle of this week when we pulled the data, there have been nearly 2,500 research projects approved or

181
00:25:46,300 --> 00:25:53,600
nearly 2,100, I guess…nearly 2,500 that have been submitted. About 2,100 of those have been approved, about 500 rejected. Again, these

182
00:25:53,600 --> 00:26:03,300
numbers don’t add up for reasons that are technical that dbGaP sometimes has to remind me about, and then also a significant number of

183
00:26:03,300 --> 00:26:11,066
revisions requested, a lot of back-and-forth between the Data Access Committees and the investigators. If we look at this in terms of

184
00:26:11,066 --> 00:26:20,599
consent groups, because again, these are set up with data use limitations and in any given dataset 80% of the study population may have said, “My

185
00:26:20,600 --> 00:26:28,100
data can be used for any general research that you’d like to conduct,” while 20% may have said, “I only want my data used for diabetes.” And so,

186
00:26:28,100 --> 00:26:35,600
data are requested by a consent group, and if we look at these numbers, they’re of course that much larger because there are a lot more

187
00:26:35,600 --> 00:26:41,133
transactions that are taking place. But clearly, what’s important from here is that these data are being used, many investigators are getting

188
00:26:41,133 --> 00:26:49,133
access to the data and asking questions, and a lot of publications have resulted. Unfortunately, it’s a lot harder for us to track the publications

189
00:26:49,133 --> 00:27:00,633
coming out of it, but we are starting to get a little bit of data on that front. If we look at how this maps out to the various datasets that we have as

190
00:27:00,633 --> 00:27:08,399
they are at least identified or categorized according to the Data Access Committee that they belong to, we can see that it varies based on the

191
00:27:08,400 --> 00:27:18,633
Data Access Committee. Some of this has to do with the number of datasets that they have, some of it has to do with the size of those datasets that

192
00:27:18,633 --> 00:27:25,766
they have. So, you see that NHLBI is very large; they have a lot of datasets but they also have Framingham, which is incredibly popular for

193
00:27:25,766 --> 00:27:35,599
everyone to look at. The Cancer Genome Atlas is another one; it actually has the highest usage of information out of the dataset. So, this is really

194
00:27:35,600 --> 00:27:45,466
more just to show the breadth of interest and the number of people looking at particular datasets. Another key point within the policy, and again,

195
00:27:45,466 --> 00:27:55,066
going back to trying to respect participant wishes and our principles of respect for participants that we use in putting together the policy, is to note

196
00:27:55,066 --> 00:28:02,732
that from the beginning it was expected that there would some datasets where it simply wasn’t appropriate to share the data through dbGaP in

197
00:28:02,733 --> 00:28:10,833
this broad way as we wanted to do to build the community resource. So, it was conceived from the very beginning that exceptions to the data

198
00:28:10,833 --> 00:28:19,833
sharing expectation would be possible. This is something that, because different institutes have different program priorities in putting together

199
00:28:19,833 --> 00:28:28,399
their initiatives, it needs to be agreed upon with the program staff prior to funding so that everyone is on the same page and there are no

200
00:28:28,400 --> 00:28:40,300
surprises down the road where their primary intent is about data sharing or to share a particular kind of data within their dataset, doesn’t

201
00:28:40,300 --> 00:28:48,533
find out until the point that it’s time to put the data in dbGaP that the IRB said “no” because the consent explicitly says there can’t be any

202
00:28:48,533 --> 00:28:55,799
sharing. That’s a perfectly appropriate thing for the consent to have said and the NIH wants to honor that. The NIH just wants to know what

203
00:28:55,800 --> 00:29:05,600
those restrictions are before anything is funded. We have had several exceptions that have been granted and these have, not surprisingly, been

204
00:29:05,600 --> 00:29:13,333
due to limited consent language where the dataset was still felt to be very scientifically important and so the institute felt it was within the

205
00:29:13,333 --> 00:29:22,833
interest of their program priorities to fund the data but to have an alternative data sharing plan besides using working through dbGaP. There

206
00:29:22,833 --> 00:29:28,299
have also been a few cases where there have been legal restrictions so that data couldn’t leave—a particular kind of data couldn’t leave—a given

207
00:29:28,300 --> 00:29:33,700
state or data couldn’t leave a given country, and also some cases where the geographic representation of the study population has been

208
00:29:33,700 --> 00:29:42,366
felt by the institution to be too localized to be appropriate to share the data. And so institutes, again because of the scientific merits of a given

209
00:29:42,366 --> 00:29:49,766
study, have wanted to fund that and have granted an exception to the data sharing expectation. We are in the process of developing

210
00:29:49,766 --> 00:29:59,066
points to consider on how to write your data sharing plans and how to write requests for exceptions so that institutions understand more

211
00:29:59,066 --> 00:30:08,566
what the NIH is looking for and even so that our own staff understand more of what the expectations of the policy are with regard to

212
00:30:08,566 --> 00:30:20,666
granting exceptions. Again, coming back to transparency and looking at the concept of governance, which I mentioned earlier as

213
00:30:20,666 --> 00:30:29,832
something that we’re seeing more and more of in the oversight of data access, the GWAS policy also, from the very beginning, set up a system of

214
00:30:29,833 --> 00:30:38,333
governance committees so that the policy and the implementation of the policy could be fluid over time and could be responsive to changes either in

215
00:30:38,333 --> 00:30:46,833
technology, changes in our scientific understanding, or changes also in the public or societal conversations that were going on around

216
00:30:46,833 --> 00:30:56,499
this science. And so in terms of accountability, the governance system and all of the conduct of the GWAS policy and the infrastructure that was

217
00:30:56,500 --> 00:31:05,333
created to oversee it is directly responsible to the NIH Director. He is immediately informed by a senior oversight committee which consists of

218
00:31:05,333 --> 00:31:16,799
several IC directors as well as some of his most senior staff, and they meet at least quarterly to oversee how the policy implementation is

219
00:31:16,800 --> 00:31:27,300
happening to make decisions on policy questions that are being brought before the committee, or they meet whenever there’s some urgent issue

220
00:31:27,300 --> 00:31:34,700
which needs to be attended to, to make sure that the policy can be as responsive as possible. The Senior Oversight Committee is staffed in some

221
00:31:34,700 --> 00:31:42,566
ways by two steering committees which are constituted with senior staff from across the NIH. The Participant Protection and Data Management

222
00:31:42,566 --> 00:31:49,366
Steering Committee is made up of all of the Chairs of the Data Access Committee, and they come together monthly to meet and talk about their

223
00:31:49,366 --> 00:31:57,866
experiences, develop programs and procedures so that they operate consistently across the agency, and similarly, the Technical Standards

224
00:31:57,866 --> 00:32:04,866
and Data Submission Steering Committee is made up of the lead scientific staff for each institute or center that is conducting a GWAS study or a

225
00:32:04,866 --> 00:32:14,566
genomic data sharing program, and they also come together once a month to talk about issues around scientific quality of data that are being

226
00:32:14,566 --> 00:32:20,666
submitted, policy questions around data submission, again, with the aim of sharing experience and building common practices so

227
00:32:20,666 --> 00:32:28,332
that we can be as transparent and consistent as possible. Something to keep in mind as we go forward is always what’s happening in the public

228
00:32:28,333 --> 00:32:36,333
conversation. There have been many things of note lately that are influencing what the public thinks about genomics research and also what

229
00:32:36,333 --> 00:32:47,266
they think about research with their data or their samples going forward, and of course that’s important for research involving a community

230
00:32:47,266 --> 00:32:54,866
resource and a data bank like dbGaP where we are holding genomic information for a very long time. This one has several headlines from the

231
00:32:54,866 --> 00:33:04,899
very well-known Havasupai case, the book regarding the story of Henrietta Lacks and the creation of HeLa cells and recently, too, all of the

232
00:33:04,900 --> 00:33:12,800
information and court cases around newborn screening and the holding of blood spots and what the public attitudes are about the ownership

233
00:33:12,800 --> 00:33:20,333
of the information that is contained within those blood spots and how that’s changing the conversation. Again, this speaks to the need for

234
00:33:20,333 --> 00:33:32,699
transparency and the need for the policy conversation to be ongoing and not static over time. We are hoping very soon to release an

235
00:33:32,700 --> 00:33:42,633
extended data sharing policy that would include whole genome sequence information. It would include epigenomic information, expression level

236
00:33:42,633 --> 00:33:49,433
data. Obviously, in the five years since the policy has been in development we’ve moved on from the basic GWAS study, but there are all the

237
00:33:49,433 --> 00:33:56,633
questions that we need to ask ourselves in putting together this policy that we asked ourselves the first time for these new data types

238
00:33:56,633 --> 00:34:05,299
and also looking at the study designs for these new data types that are different than GWAS and there are some different issues with regard

239
00:34:05,300 --> 00:34:11,700
to data quality and data submission that need to be addressed. So, some of the questions that we’ve been considering are, again, what data

240
00:34:11,700 --> 00:34:20,400
and project types should be included? Are these only community resource projects at the whole level or is it every R01 and where do you draw

241
00:34:20,400 --> 00:34:31,400
the line in between? What is the data submission process and timeline going to look at for this broader array of genomic data types? Are there

242
00:34:31,400 --> 00:34:36,500
options now for Open Access data release of genomic information? At the moment, no genomic data are available or very, very minimal genomic

243
00:34:36,500 --> 00:34:44,200
data are available through open access. What should the data release process and timeline be? What have we learned from our GWAS data

244
00:34:44,200 --> 00:34:52,966
release policy that we should integrate into what we’re doing now? Again, all of the participant protection issues that we considered

245
00:34:52,966 --> 00:35:00,866
before…informed consent…are the questions different when we’re talking about whole genome information relative to GWAS information? And I

246
00:35:00,866 --> 00:35:08,232
know you’re going to consider some of this in greater depth tomorrow. And as I already mentioned, what have we learned from GWAS

247
00:35:08,233 --> 00:35:19,499
and the policy implementation and infrastructure development for that purpose that we can integrate into what we’re going to do in the future

248
00:35:19,500 --> 00:35:26,233
to make it better, to make it more streamlined, to make it more responsive to investigator needs, to participant concerns, etc.? And of course, how

249
00:35:26,233 --> 00:35:41,499
are we going to implement it and oversee it? And with that, I will stop and take any questions. [applause] Everyone is just ready to be done for

250
00:35:41,500 --> 00:35:46,666
the day, which I completely understand. It’s been a long day from looking at your agenda.

251
00:35:46,666 --> 00:35:54,066
JEFFREY KOPP: Maybe you said it and I missed it. Are whole exomes at this point required to be placed in dbGaP? You had a slide where it’s

252
00:35:54,066 --> 00:35:57,966
under discussion.

253
00:35:57,966 --> 00:36:02,832
LAURA LYMAN-RODRIGUEZ: That depends on…yes. It’s under discussion at an agency level, so it’s one of those things where you really need to

254
00:36:02,833 --> 00:36:06,766
talk to your program director because some ICs are expecting it now, and it’s following the basic GWAS data sharing policy, but not every institute or center.

255
00:36:06,766 --> 00:36:14,899
JEFFREY KOPP: So as consents are written, they should include the appropriate language to let…

256
00:36:14,900 --> 00:36:20,000
LAURA LYMAN-RODRIGUEZ: They should include the appropriate language because that is the direction that we expect it to go, but it’s still

257
00:36:20,000 --> 00:36:25,700
under discussion. FEMALE: Thanks for a great presentation. It was

258
00:36:25,700 --> 00:36:36,266
interesting to see the data from the experience with GWAS and dbGaP to date. Then you listed some of the data per consent group and I want to

259
00:36:36,266 --> 00:36:47,466
talk this through to make sure my interpretation is correct. You talked about a significant number—6,500 plus—of projects that were submitted by

260
00:36:47,466 --> 00:36:57,832
consent group, but 1,755 that were rejected. Does that mean that 27% of the requests that came in from scientists to access data by

261
00:36:57,833 --> 00:37:06,599
consent group were rejected because of data use limitations that were described and recorded from the consent?

262
00:37:06,600 --> 00:37:11,933
LAURA LYMAN-RODRIGUEZ: It means that about that percentage were rejected on their first submission, and so sometimes those came back

263
00:37:11,933 --> 00:37:22,699
in and were amended, because sometimes investigators are not super-careful when they’re checking off boxes for the consent groups for

264
00:37:22,700 --> 00:37:27,866
the data that they want and this is one of the arguments that we still have for a manual data access process, because they will say very

265
00:37:27,866 --> 00:37:35,266
specifically in their abstract that they’re going to do research on cancer and then they will request a diabetes-only dataset. And so, the Data Access

266
00:37:35,266 --> 00:37:43,199
Committee will approve those datasets that are fine for cancer and disapprove those consent groups that have that restriction on it.

267
00:37:43,200 --> 00:37:50,666
FEMALE: So, that’s a good protection but also may be a lesson about the way in which prospective consents are constructed and

268
00:37:50,666 --> 00:37:58,699
whether they should be narrowly tailored versus broader and how that might affect the science.

269
00:37:58,700 --> 00:38:03,000
LAURA LYMAN-RODRIGUEZ: Yes, it does speak to that issue.

270
00:38:03,000 --> 00:38:05,566
FEMALE: Okay. Thanks. JOAN BAILEY-WILSON: I won’t be here

271
00:38:05,566 --> 00:38:13,032
tomorrow but some of the things that I’ve been discussing with various institute program staff about how are we going to do whole exome

272
00:38:13,033 --> 00:38:19,999
sequencing has been really interesting and pertinent to what you were talking about, because what we’ve been doing so far has

273
00:38:20,000 --> 00:38:32,100
mostly been in families and, of course, the GWAS policy never applied to linkage studies.It didn’t apply really to family studies because there were

274
00:38:32,100 --> 00:38:45,700
considered to be more risks with family studies because these people that gave you their DNA in the family have consented, but perhaps there are

275
00:38:45,700 --> 00:38:52,900
several other important affecteds in the family that you know are affected by they refuse to give you DNA and they haven’t specifically

276
00:38:52,900 --> 00:39:06,566
consented. But if I sequence their relatives I now know quite a lot about their genome, etc. So, there are all kinds of interesting policy and

277
00:39:06,566 --> 00:39:17,432
privacy and IRB questions that, as NIH is developing this policy they are thinking about and they’re talking about and trying to figure out

278
00:39:17,433 --> 00:39:30,566
where do we draw the line and how do we protect the human subjects but still get the data shared. For example, NICHD is doing a cleft

279
00:39:30,566 --> 00:39:40,966
sequencing study that some of my samples are part of and these come from Syria. They are extended pedigrees with many affected

280
00:39:40,966 --> 00:39:52,532
individuals, and when we set up this study in collaboration we had to really be careful in our consents for the Syrian government to agree to

281
00:39:52,533 --> 00:40:05,733
work with us. And so, NICHD has really worked with us to make it so that we can deposit the data but have protections for the access that satisfied

282
00:40:05,733 --> 00:40:17,533
the consents—the Syrian investigators—but more importantly, the Syrian ethics boards and the Syrian government. It was quite a process,

283
00:40:17,533 --> 00:40:25,633
so those things are going on, but those are kinds of things that are really going to have to be thought about and figured out. So, I’m sure you

284
00:40:25,633 --> 00:40:29,566
guys will enjoy discussing it tomorrow. Since I won’t be here I thought I’d…

285
00:40:29,566 --> 00:40:35,632
LAURA LYMAN-RODRIGUEZ: Preview. FEMALE: Laura, thank you very much. It was

286
00:40:35,633 --> 00:40:41,066
really interesting to see some of the new data. I have two questions for you. One, I hope, is an easy question; maybe they’re both really easy

287
00:40:41,066 --> 00:40:46,032
questions. LAURA LYMAN-RODRIGUEZ: I hope they’re both

288
00:40:46,033 --> 00:40:47,599
easy. FEMALE: It’s a question that Gail Jarvik asked

289
00:40:47,600 --> 00:40:54,500
Mike Fiello when he gave a presentation to the EMERGE Steering Committee earlier this week and it had to do with the implications of the proposed

290
00:40:54,500 --> 00:41:01,733
changes to the Common Rule and whether in particular a proposal in the advanced news for the proposed rule-making which would make all

291
00:41:01,733 --> 00:41:08,833
research with biospecimans require informed consent; whether that would have any implications for the data that are already shared

292
00:41:08,833 --> 00:41:17,466
in dbGaP. My assumption is that’s a “no” but I wasn’t sure and Mike couldn’t answer her question, so I now ask you.

293
00:41:17,466 --> 00:41:27,166
LAURA LYMAN-RODRIGUEZ: So, in the ANPRM there is also a question about whether or not prior studies should be grandfathered into this

294
00:41:27,166 --> 00:41:35,899
new expectation for informed consent on any study. So, I think whether or not that particular clause goes forward and any notice or any final

295
00:41:35,900 --> 00:41:45,233
rule will determine what happens to the studies that are already in dbGaP where there isn’t explicit consent or whatever it may be for this

296
00:41:45,233 --> 00:41:49,899
process. So, I can’t…there is no answer to that question yet.

297
00:41:49,900 --> 00:41:57,066
FEMALE: Okay. All right, so maybe that was easy. The second one is it’s really nice to see the numbers on how many data access requests

298
00:41:57,066 --> 00:42:04,932
were refused because, as far as I was aware, those data were not readily available until quite recently. Another issue that I know the bioethics

299
00:42:04,933 --> 00:42:13,199
community is sometimes concerned about are the degree to which there have been frank research misuses of data that have been deposited in

300
00:42:13,200 --> 00:42:25,200
dbGaP, and there was a poster at ASHG last fall for giving some data on the GAIN DAC, giving some limited information about examples of

301
00:42:25,200 --> 00:42:35,100
research misuses. Do you have any numbers on that either for GAIN or for any of the other data access committees?

302
00:42:35,100 --> 00:42:41,666
LAURA LYMAN-RODRIGUEZ: I was looking…I do have a slide on that that I didn’t include for this particular talk because I knew I was already

303
00:42:41,666 --> 00:42:57,966
going to be over time. Let’s see if it’s here. No, it’s not here. So, at the moment I did look at our information this morning and we have 14 data

304
00:42:57,966 --> 00:43:06,732
management incidents that have taken place. We can put that over the total of studies that have been submitted, so the number is fairly small,

305
00:43:06,733 --> 00:43:20,033
relatively speaking, and those have ranged in infraction, I guess. There have been a variety of infractions that have been represented. Some

306
00:43:20,033 --> 00:43:28,499
have been because of computer glitches where something went out either from dbGaP or from the institution to us that should not have gone out,

307
00:43:28,500 --> 00:43:39,600
and so we’ve corrected those. We’ve had three issues where investigator misconduct was determined to be the issue and those were…two

308
00:43:39,600 --> 00:43:49,033
were publication issues prior to an embargo end date, and so the sanction that was issued was that they lost access to dbGaP and they had to

309
00:43:49,033 --> 00:43:57,233
destroy all of the data and cease all activity on anything that they already had and were currently working on and we worked with the

310
00:43:57,233 --> 00:44:06,066
institutions, again, shared responsibility. So, whenever there is a data management incident, we contact the dean and investigators once we

311
00:44:06,066 --> 00:44:11,966
know that there’s been an issue. Investigators are sort of cc’d versus being the primary they were working with because we really are

312
00:44:11,966 --> 00:44:19,566
working with the institution, and ask the institution to tell us what they’re going to do to prevent it from happening again. So again, the investigators

313
00:44:19,566 --> 00:44:25,532
in those two cases lost access for six months, and that included everyone on their research team and they weren’t able to do anything to

314
00:44:25,533 --> 00:44:32,266
continue their work on any of their dbGaP datasets, whether it was involved in that particular project or not. The other one was

315
00:44:32,266 --> 00:44:37,432
someone who asked a different question than what was in their research use statement, and through the annual report, we realized this. It did

316
00:44:37,433 --> 00:44:49,533
not violate the use parameters, so we didn’t have any consent violations in that case, but it wasn’t what they said they were going to do, and so

317
00:44:49,533 --> 00:44:58,133
they were sanctioned for three months, I believe, from dbGaP and had to cease use of all of the

318
00:44:58,133 --> 00:45:05,033
datasets that they had in that case. So we are, again, tracking it and putting it out there and we are looking at publishing some global analysis, not

319
00:45:05,033 --> 00:45:11,799
just the GAIN paper. I think NHLBI also has a paper in development to talk about their docs experience, but we are hoping to put together

320
00:45:11,800 --> 00:45:15,666
something across the board to look at the entire agency’s perspective.

321
00:45:15,666 --> 00:45:21,199
FEMALE: That would be really wonderful for those of us who talk to research participants and who have questions about these sorts of things.




Date Last Updated: 9/18/2012

General Inquiries may be addressed to:
Office of Communications and Public Liaison
NIDDK, NIH
Building 31, Rm 9A06
31 Center Drive, MSC 2560
Bethesda, MD 20892-2560
USA
Phone: 301.496.3583