Whole Genome Approaches to Complex Kidney Disease
February 11-12, 2012 Conference Videos

Data Sharing: Access and Confidentiality
Laura Rodriguez, NHGRI

Video Transcript

00:00:00,833 --> 00:00:09,399
SARA HULL: Hi, my name is Sara Hull. I have a joint appointment with the National Human Genome Research Institute and with NIH’s Clinical

00:00:09,400 --> 00:00:20,733
Center’s Department of Bioethics, and it’s a real privilege to have worked with Dr. Kopp and Dr. Shaw and the rest of the planning committee as

00:00:20,733 --> 00:00:31,566
part of this meeting. We are now going to turn our focus to pay explicit attention to the ethical, legal, and social implications of the very important

00:00:31,566 --> 00:00:42,499
research that we’ve been discussing today, and it’s heartening to hear just how much of the discourse has actually, at least implicitly, focused

00:00:42,500 --> 00:00:54,800
on these issues. This morning we heard quite a bit about the need for collaboration, transparency, sharing and that’s a nice transition into the

00:00:54,800 --> 00:01:02,133
presentation that we’re going to hear this evening. I do want to point out there is one talk in this area this evening and then three more

00:01:02,133 --> 00:01:09,199
excellent talks tomorrow morning and then the breakout session as well, so we are going to carry this through. So, it’s my pleasure to

00:01:09,200 --> 00:01:19,133
introduce Dr. Laura Lyman-Rodriguez who directs our Office of Policy, Communications and Education at NHGRI. She oversees the

00:01:19,133 --> 00:01:27,933
development of the institute’s policy positions on the ethical, legal, and social implications of human genome research. She did her doctorate in cell

00:01:27,933 --> 00:01:35,999
biology at the Baylor College of Medicine and then transitioned into policy work as a congressional science fellow, and then at the Institute of

00:01:36,000 --> 00:01:48,433
Medicine before coming to the NIH in 2002. And now, I’m quoting from our boss, Eric Green, who is the Director of the institute. He points out that

00:01:48,433 --> 00:01:56,099
“Laura’s science background provides a strong foundation for understanding the implications of genomic advances. Her extensive experience in

00:01:56,100 --> 00:02:04,666
science policy allows her to provide key leadership in a rapidly changing and complex area of research,” and her credibility—and this is

00:02:04,666 --> 00:02:16,366
me talking, which I know very well, personally—serves to galvanize widespread respect for her by staff within and outside NHGRI. Laura was

00:02:16,366 --> 00:02:26,266
integrally involved with the development of the NIH’s GWAS data sharing policy as well as what’s going to be happening next. I couldn’t

00:02:26,266 --> 00:02:38,732
have come up with anybody more qualified to talk to us today about data sharing, access, and confidentiality.

00:02:38,733 --> 00:02:43,666
LAURA LYMAN-RODRIGUEZ: Okay. Thank you and I think we should all just go home now with that lovely introduction from Sara so that I don’t

00:02:43,666 --> 00:02:52,699
do anything to counteract that within the next few minutes. I am happy to be here to talk to everyone today and to begin the exploration that

00:02:52,700 --> 00:03:01,066
you all will have tomorrow through what does look like a great series of talks in the workout groups. To think about the activities around data

00:03:01,066 --> 00:03:09,599
sharing, the principles underlying how to data share in a responsible way that is respectful of the human participants that give their time and

00:03:09,600 --> 00:03:18,933
their samples and information to our work and in a way that maintains the public trust going forward. So, I’m going to talk a lot about the

00:03:18,933 --> 00:03:25,499
GWAS policy, but really try to think about that, too, in terms of what are the questions to ask ourselves as we go forward in science as our

00:03:25,500 --> 00:03:34,633
science changes and as we just try to work together with the various variables, as I’ve called them here; a different kind of variant than what

00:03:34,633 --> 00:03:40,066
you’ve been talking about for most of the day, but definitely things that we don’t always know exactly what they mean as we’re working on

00:03:40,066 --> 00:03:51,066
them. The principle of data sharing is not new. Certainly, for NIH, it is something that the agency has believed strongly in for a long time, has

00:03:51,066 --> 00:04:01,299
implemented through various policies for many years, and the quote that is on here—that’s on this slide—is from the 2003 Data Sharing Policy. It

00:04:01,300 --> 00:04:10,900
emphasizes here that this really is trying to move research forward in a way that attains the greatest public benefit from the federal

00:04:10,900 --> 00:04:20,966
investment, from the investment of the, again, research participants who join with us in the pursuit of the research and the questions to bring

00:04:20,966 --> 00:04:29,099
about better health. When we’re trying to think about any research program there are always, regardless of our perspective, there are always

00:04:29,100 --> 00:04:37,866
lots of different elements which need to come together. The ones that I try to look at when I’m thinking from a policy perspective is, first of all,

00:04:37,866 --> 00:04:46,966
what is the scientific design, what are the scientific aims that we’re trying to accomplish in the research project, and the program priorities

00:04:46,966 --> 00:04:54,399
specifically separate from the technical issues of how the science is going to be done. Because if you can’t achieve those scientific aims, it’s not

00:04:54,400 --> 00:05:01,466
going to matter what policies you might create because it’s not going to be a successful project. You do have to think about the practicalities of

00:05:01,466 --> 00:05:11,866
existing policies, practices, and procedures that may be in place that you’re going to have to conform to as you pursue the research, and also

00:05:11,866 --> 00:05:19,399
the guiding principles again. What do you want to achieve and not just from a scientific perspective always but from an ethical perspective? How do

00:05:19,400 --> 00:05:28,000
you want to conduct the research? What is really important to the aims of the research and to the aims of the research enterprise, more globally

00:05:28,000 --> 00:05:35,200
sometimes? And then, getting down to brass tacks, how are you going to make it happen? What are the policy and ethics that you are going

00:05:35,200 --> 00:05:41,933
to build to build around your particular project or initiative? How are you going to maintain transparency so that there’s credibility in what

00:05:41,933 --> 00:05:49,033
you’re doing, both with the research community and the stakeholders of, again, any particular initiative, but also in terms of thinking broader at

00:05:49,033 --> 00:05:57,733
the more public trust level and sustaining that trust that we have? And then also from a practical perspective, how are you going to

00:05:57,733 --> 00:06:05,566
share the responsibilities of accomplishing all of these different aims and objectives with those who are doing it—because we do operate it in a

00:06:05,566 --> 00:06:13,766
distributed model and so we as funders are working with everyone’s institutions and with the investigators—and so how do we work together

00:06:13,766 --> 00:06:24,132
to make things happen? So, what are the questions? Again, thinking about this and things that I think it sounds like you’ve already been

00:06:24,133 --> 00:06:28,799
talking about today, what are the questions to ask ourselves as we go forward and try to put this together? For the things that I was asked to talk

00:06:28,800 --> 00:06:35,566
about, we’re thinking about access and confidentiality, so jumping right to the heart of some of these issues around genomic research

00:06:35,566 --> 00:06:47,299
is: can we calculate the risks to privacy? What might it be? What does it mean exactly? How do we calibrate, too, the protections and oversight

00:06:47,300 --> 00:06:56,333
that we are going to put into place in a risk-based model, which is what we try and put forward because there is always risk? How are we going

00:06:56,333 --> 00:07:04,666
to manage the risk? Can we quantify it? And again, an important question, particularly for me in an area of research that I’m very interested in is:

00:07:04,666 --> 00:07:14,532
what are the participant perceptions of risk and their tolerance for this risk if we can articulate what it is? How do we respect the wishes of the

00:07:14,533 --> 00:07:22,266
participants? As we go forward, research is a never-ending process; we always have another question. How do we respect what that individual

00:07:22,266 --> 00:07:32,099
in a particular study wanted to have done by participating in the research by donating their time and their information and specimens? And then,

00:07:32,100 --> 00:07:38,000
of course, since science is ever moving and technology with it, how do we reconcile what the current research might be and the current

00:07:38,000 --> 00:07:47,900
technology and parameters with what past expectations may have been for science, for the participants, for the public, etc.? So, we can’t

00:07:47,900 --> 00:07:55,666
always think just about going forward, but how do we achieve our principles also for those studies that may have been done before but

00:07:55,666 --> 00:08:07,532
where we’d like to continue to learn from them if we have collections of specimens again, now that we can ask new questions? So, just to start

00:08:07,533 --> 00:08:15,399
with identifiability again, and the questions of privacy, this is now—I realized as I was looking at this—almost an eight-year-old slide, so it’s

00:08:15,400 --> 00:08:23,133
really outdated and that’s okay. This slide is looking actually at single nucleotide polymorphisms and trying to first quantify how

00:08:23,133 --> 00:08:29,666
much is it that you have to have before you have a unique pattern. And what’s important from my point now is that, of course, when we’re talking

00:08:29,666 --> 00:08:35,266
about whole genomics, we have unique patterns. We had it when we were looking at GWAS studies and as we move into whole genome

00:08:35,266 --> 00:08:42,499
sequence information, of course we do as well. And so, it’s unique but again, how do we calculate the risk, how do we calculate the

00:08:42,500 --> 00:08:52,400
tolerances of participants. How do we manage those risks? It really is about balance, as with everything. Again, this figure is from a paper that

00:08:52,400 --> 00:09:02,566
is now almost five years old from Francis Collins and Bill Lowrance, trying to ask the questions of: what does it mean to be de-identified in terms of

00:09:02,566 --> 00:09:10,666
genomic data? How to you manage it? What systems can you put in place to look at the risks, to be able to share the data and achieve the

00:09:10,666 --> 00:09:20,032
scientific aims while respecting the participant protection issues? My main reason for showing this slide is that it was complicated then and it’s

00:09:20,033 --> 00:09:29,333
complicated now. It’s not really important what the different boxes on this slide say; there are many different ways to make data de-identified or

00:09:29,333 --> 00:09:38,433
render it identifiable. There are many different ways to backtrack it if you wanted to try and backtrack it to a particular individual and have the

00:09:38,433 --> 00:09:46,466
means and enough information to do so, and we’re still at a place where we have uncertain and debatable risk, and we’re still at a place

00:09:46,466 --> 00:09:53,499
where we don’t know exactly what the participants think about this debatable risk. But what we do know is that we need to balance

00:09:53,500 --> 00:10:05,966
these risks, these unknown and ill-formed risks in some places, or at least it’s very subjective and so it varies from person to person what that risk

00:10:05,966 --> 00:10:15,032
is and what their tolerance is. So, how do we balance it on a large scale across the population to go forward? Those questions are exactly the

00:10:15,033 --> 00:10:25,799
same today as they were in 2007. Mixed into all of these different questions is the different way that we’re doing our science today. When our

00:10:25,800 --> 00:10:34,333
current participant protection system was designed, most everything was done through single projects with very specific consents at a

00:10:34,333 --> 00:10:42,366
single institution, many times, and for a defined timeline or a defined period of time, and that’s just not what’s happening anymore. We have many

00:10:42,366 --> 00:10:50,632
projects that are being done on the same datasets. We still sometimes have very specific consents that we need to deal with but we do, in

00:10:50,633 --> 00:10:58,399
some cases, have broader consents because again, with time, there’s a pendulum that seems to go with what’s believed to be appropriate

00:10:58,400 --> 00:11:05,233
consent. We also have the entrance of governance models coming into how we’re sharing data and how we’re thinking about

00:11:05,233 --> 00:11:14,433
overseeing the sharing of data. But certainly now, we’re at a point where we have undefined timelines going forward. These data exist, once

00:11:14,433 --> 00:11:23,499
the specimens are there you can generate genomic information and that genomic information is not degradable and so it’s going to stay there

00:11:23,500 --> 00:11:32,400
and we’ll be able to ask questions of it for a very long time. And so, how does that change the nature of how you design the oversight systems

00:11:32,400 --> 00:11:40,833
to protect the interests of the participants of those individuals who originally were in the studies? Oftentimes I feel like and what I’ve heard

00:11:40,866 --> 00:11:48,199
other people feel like is you just don’t know which way to go. You feel like every which way you turn there is another question, another problem,

00:11:48,200 --> 00:11:56,800
another roadblock. That’s just not what we want and not what we are trying to accomplish, so we need to keep looking for ways to untangle this

00:11:56,800 --> 00:12:06,866
mess and not end up running into the brick wall on the other side. So of course, that brings me to GWAS. We won’t ask whether or not there is a

00:12:06,866 --> 00:12:17,366
brick wall on the other side of this right now but I think all of these questions that I’ve talked about so far are what we were thinking about in trying

00:12:17,366 --> 00:12:22,066
to put together the GWAS policy originally, and even more so, what we have been thinking about as we’ve gone forward the last four years trying

00:12:22,066 --> 00:12:29,366
to implement the policy, trying to maintain the currency of the policy and its oversight mechanisms. The guiding principle that we

00:12:29,366 --> 00:12:37,199
established when we were trying to build this was, again, building directly from the NIH statements on data sharing before, where we

00:12:37,200 --> 00:12:45,666
wanted to attain maximum public benefit by making these data available to ask many investigators for as many appropriate research

00:12:45,666 --> 00:12:58,066
questions as was possible in a timely way. We wanted to do this in a way that was true to our aims which was, of course, respect for the

00:12:58,066 --> 00:13:05,232
participants involved in the primary studies that promoted data sharing and that also enabled a freedom to operate with some of the very basic

00:13:05,233 --> 00:13:12,566
information that would be found through GWAS studies, so that there weren’t these data or these initial association findings would never be tied up

00:13:12,566 --> 00:13:20,099
in intellectual property challenges, and again, could be available for innovation by as many different people as had good and appropriate

00:13:20,100 --> 00:13:28,833
ideas. You all have been talking about the scientific aspects underlying GWAS and what you wanted to attain, so there is really no need to

00:13:28,833 --> 00:13:37,299
go through this, but of course it is one of my three essential elements and so I just wanted to come up here and again make the notion that we

00:13:37,300 --> 00:13:45,500
are talking about, of course, very large datasets involving very many people with lots of different genomic information and phenotypic information

00:13:45,500 --> 00:13:54,600
when we are trying to share that in context of this policy as broadly as is appropriate and through a government database. This begins to

00:13:54,600 --> 00:14:02,866
come into our implementation questions, where we’re having to deal with the realities of what our existing situation is, and being a government

00:14:02,866 --> 00:14:13,432
entity, this added something new and completely unrelated to the science that still had to be accommodated. The GWAS research process,

00:14:13,433 --> 00:14:27,066
as you know of course, begins with the individual study. There is an informed consent agreement process discussion document that occurs in the

00:14:27,066 --> 00:14:33,099
beginning and it represents the agreement between the individual research participant and the primary investigator, and that informed

00:14:33,100 --> 00:14:41,900
consent was, we thought, fundamental to all of the future research going forward and so we built that in as an underlying tenant for everything

00:14:41,900 --> 00:14:53,433
that would be done in the future. I’ll show you how we implemented that in a moment. When genomic studies were conceived by the

00:14:53,433 --> 00:15:00,466
investigators—again, these could have been from the beginning—an intent of the study…they could be deciding to go back to deciding to go back to

00:15:00,466 --> 00:15:06,866
data sets in their freezer or sample sets in their freezer and doing this retrospectively, but when they determined that they wanted to do a GWAS

00:15:06,866 --> 00:15:14,532
study and requested support from the NIH and received the support, that is when the policy became effective. There was an expectation

00:15:14,533 --> 00:15:23,899
then to submit the data—the genomic data as well as the phenotypic data—to a central repository at the NIH and that all of that data would come in an

00:15:23,900 --> 00:15:32,833
encoded way with all of the identifying information removed, again being practical, with many different ways to de-identify information.

00:15:32,833 --> 00:15:44,099
We had to pick a standard and the standard that we chose was the 18 identifiers from the privacy rule and that is what we defined within the policy

00:15:44,100 --> 00:15:56,433
as representing de-identified data. Going back to the informed consent issues, for setting up our parameters for how data would be shared for

00:15:56,433 --> 00:16:05,566
future use, we asked the submitting institutions to determine, based on a review of the conformed consent, what data use limitations should be in

00:16:05,566 --> 00:16:14,432
place for all secondary research? So it’s the PIs with their IRBs that work together to define what is an appropriate use in the future based on the

00:16:14,433 --> 00:16:22,499
informed consent. If the informed consent had statements about only cancer research being done then the NIH wants to know about that

00:16:22,500 --> 00:16:31,500
because we want to honor those original intentions and understanding of the participants, and we use that in making all access decisions

00:16:31,500 --> 00:16:41,566
as data are handed out in the electronic and encrypted way to secondary investigators through dbGaP which is the name, as you all

00:16:41,566 --> 00:16:52,332
hopefully know, of the GWAS data repository at the NIH. Coming back to our shared responsibilities, we do have to rely on

00:16:52,333 --> 00:17:01,099
investigators and local institutions in, not only implementing, but in helping us to get to this point where we have the de-identified data coming into

00:17:01,100 --> 00:17:10,400
the NIH in an appropriate way. So, we ask the institutions to certify their approval for any submission. They stipulate in that certification that

00:17:10,400 --> 00:17:19,933
the data have all been collected and transferred in accord with applicable law, that an IRB has looked at the particulars of the informed consent

00:17:19,933 --> 00:17:28,599
process and documentation for a given study, and that they have used that information to establish the data use limitations that they are

00:17:28,600 --> 00:17:37,066
going to provide to the NIH. They also assure that the PI has done their part to remove all of the HIPPA identifiers and that they are retaining the

00:17:37,066 --> 00:17:47,266
key code to the data so that all data come into the NIH without any identifiers. Moving to data access then, there are, of course, also other

00:17:47,266 --> 00:17:56,899
issues to consider in terms of our goals and the goals were, again, immediate or at least rapid access to all qualified users so that there’s a

00:17:56,900 --> 00:18:04,333
maximum opportunity for scientific progress, but we did need to protect the confidentiality of the participants and respect their consent provisions,

00:18:04,333 --> 00:18:11,199
and also questions about recognizing the needs of the investigators for academic research through the sharing since they’ve invested a lot

00:18:11,200 --> 00:18:18,633
of time in building these cohorts and in doing the original data collection. And then, of course, we want to preserve all of the basic knowledge for

00:18:18,633 --> 00:18:27,666
downstream development getting to our freedom to operate principle for the basic knowledge coming out of these studies. So, the model that

00:18:27,666 --> 00:18:37,032
was developed was a two-tiered data access model with controlled access data through a Data Access Request and review procedure where

00:18:37,033 --> 00:18:44,699
people could get access to all of the genomic information, the phenotype information, as well as some pre-computed information which was

00:18:44,700 --> 00:18:54,500
done—some basic statistics information to put out there—to try and make sure that the data that were there…that no one was able to claim just a

00:18:54,500 --> 00:19:04,666
basic association from the data and tie that up in intellectual property claims. By making it available to everyone we hope to circumvent that. From

00:19:04,666 --> 00:19:12,666
the open access perspective we wanted to let people know—again, transparency—what was in the database. We wanted people to know as

00:19:12,666 --> 00:19:20,599
much about the individual studies as possible, so we included the protocol descriptions about the population, the variables that were included in the

00:19:20,600 --> 00:19:27,366
dataset, the study instruments that were used to collect the data and this was both for transparency to the public but also transparency

00:19:27,366 --> 00:19:34,032
to investigators who might want to use the data so that they would know before they ever requested the data whether it was going to help

00:19:34,033 --> 00:19:40,433
them. Were they going to be able to compare the blood pressures in their study with the blood pressures in the study that they did in their own

00:19:40,433 --> 00:19:49,266
lab? So, they didn’t ask for data that they didn’t need because, again, that’s a participant protection issue as well. The other part of that

00:19:49,266 --> 00:19:58,266
process is that anyone that is approved for access, their name and approved research use statement goes on to the public access site so

00:19:58,266 --> 00:20:05,166
you can also see everyone who has been approved to work with the dataset, and our hope was not only transparency there but also to

00:20:05,166 --> 00:20:12,432
promote collaboration so that you can see who else was asking questions of the same data that you were interested in and if they might be doing

00:20:12,433 --> 00:20:21,933
something you might want to work together on. The actual request process works through a Web-based system where an individual

00:20:21,933 --> 00:20:30,233
investigator goes into dbGaP and they identify the particular datasets by their consent group by their data use limitations that they would like to

00:20:30,233 --> 00:20:37,733
use. They submit to the NIH a Research Use Statement, which is basically an abstract describing the questions they would like to

00:20:37,733 --> 00:20:47,566
answer. They then submit that proposal to their own institution who must co-sign the request before it comes to the NIH so that the institution is

00:20:47,566 --> 00:20:55,199
involved in the conduct of the research, again, trying to get to our concept of shared responsibilities. And in so doing, both the

00:20:55,200 --> 00:21:05,433
investigator and the institution agree to particular terms of use for the dataset and that the investigator will abide by a code of conduct for

00:21:05,433 --> 00:21:12,566
using the genomic data that they are going to receive access to. The terms of use in the code of conduct, not surprisingly, overlap rather

00:21:12,566 --> 00:21:21,032
closely in terms of what we’re trying to achieve. The highlights are some things that, one, we might expect but that we also wanted to be very

00:21:21,033 --> 00:21:27,799
clear about, and that is, that they won’t attempt to identify the individuals within the dataset, they won’t sell or transfer the data to anyone not

00:21:27,800 --> 00:21:35,633
approved on their application, that they will only use the data for the approved research use. So, they can’t investigate the data, find something

00:21:35,633 --> 00:21:44,599
interesting and decide on their own to go ahead and do something different because that may not be consistent with the data use limitations, and so

00:21:44,600 --> 00:21:54,133
they need to come back to the NIH and let the Data Access Committee, who oversees the individual access to each dataset, look at their

00:21:54,133 --> 00:22:02,499
proposed questions and make sure that its consistent with what the data use limitations are before they go and do their research. One of the

00:22:02,500 --> 00:22:09,966
roles of the Data Access Committee is to monitor conduct and progress on research through annual reports just as another way to check and

00:22:09,966 --> 00:22:17,832
make sure investigators really do limit themselves to what they said they are going to do and what they’ve been approved to do. So, the Data

00:22:17,833 --> 00:22:26,099
Access Committee, again, reviewing annual reports is one thing that they do once they…and everything is done in the context of the data use

00:22:26,100 --> 00:22:37,366
limitations. Just to know what it looks like, I’m not sure if everyone knows how the DACs work, bBut as I’ve already mentioned there’s a process

00:22:37,366 --> 00:22:44,666
where the investigator’s request goes to the institution—their home institution—before it ever comes into NCBI for dbGaP. Once that does come

00:22:44,666 --> 00:22:52,366
in there, there’s an initial staff review to make sure everything is in place and then the DACs take a look at the proposed Research Use

00:22:52,366 --> 00:23:00,366
Statement relative to the data use limitations and make an initial decision: either “yes” the uses are consistent and it may be approved, or “no”

00:23:00,366 --> 00:23:10,399
there’s something inconsistent or inappropriate about how this investigator claims to do the work and the request is disapproved. If it’s approved,

00:23:10,400 --> 00:23:18,966
notification goes to dbGaP and the staff are notified with their passwords and instructions for downloading the encrypted data, and likewise, if

00:23:18,966 --> 00:23:27,999
it’s disapproved the requester is notified by the DAC and often given a reason as to why it was disapproved. With this rationale, if this was a

00:23:28,000 --> 00:23:35,133
simple mistake or they didn’t understand something they can resubmit the request after making revisions. Sometimes it’s not as obvious

00:23:35,133 --> 00:23:41,966
what they’re planning to do in the research statement and so the DAC, the Data Access Committee, will have questions and so there can

00:23:41,966 --> 00:23:49,932
be some back-and-forth between the DAC Chairs or the DAC staff and the individual investigators

00:23:50,000 --> 00:23:58,266
Thank you.

00:23:58,266 --> 00:24:02,432
to make sure again that the DACs and the investigators are on the same page before the DAC makes a review. Then the DACS are also

00:24:02,433 --> 00:24:08,833
responsible—and again this speaks to transparency—for being timely in their work. And so, there is a semi-annual reporting process for

00:24:08,833 --> 00:24:15,433
all of the Data Access Committees from NIH to report to NIH how they are doing the work. What is their…and I’m going to show you some of the

00:24:15,433 --> 00:24:21,833
data that comes out of that later. How many reports are they approving? How many are they disapproving? How long is it taking them to do

00:24:21,833 --> 00:24:29,799
this? So, we can look at the system and make sure it’s efficient. If we want to think about how is this access process working, who is asking

00:24:29,800 --> 00:24:37,433
for the data, and what are they using it for, the requestors are coming from across the research community, so we’re getting investigators from

00:24:37,433 --> 00:24:45,033
the private sector, from academic sectors, from non-profits, and from all over the world, which is exactly what we wanted to happen. We really

00:24:45,033 --> 00:24:52,766
wanted this to be a community resource that anyone could use if they had an appropriate research question for the data they were

00:24:52,766 --> 00:25:01,266
requesting. In terms of the kinds of work they’re doing, most of the work is looking at the etiology of the particular disease or related conditions that

00:25:01,266 --> 00:25:08,499
the dataset was originally collected to study, but there are also a lot of methodology questions that are being asked: trying to learn how to work

00:25:08,500 --> 00:25:17,466
with these kinds of data, how to work with very large datasets, etc. Getting to some of the statistics I thought it might be interesting for the

00:25:17,466 --> 00:25:24,666
group just to look at how much activity there has been. These numbers aren’t perfect. They don’t add up, so don’t worry about that when you try

00:25:24,666 --> 00:25:32,932
to look at it. But if we look at this in terms of project requests…because you can request multiple datasets within any single project. So,

00:25:32,933 --> 00:25:46,299
since dbGaP opened through the middle of this week when we pulled the data, there have been nearly 2,500 research projects approved or

00:25:46,300 --> 00:25:53,600
nearly 2,100, I guess…nearly 2,500 that have been submitted. About 2,100 of those have been approved, about 500 rejected. Again, these

00:25:53,600 --> 00:26:03,300
numbers don’t add up for reasons that are technical that dbGaP sometimes has to remind me about, and then also a significant number of

00:26:03,300 --> 00:26:11,066
revisions requested, a lot of back-and-forth between the Data Access Committees and the investigators. If we look at this in terms of

00:26:11,066 --> 00:26:20,599
consent groups, because again, these are set up with data use limitations and in any given dataset 80% of the study population may have said, “My

00:26:20,600 --> 00:26:28,100
data can be used for any general research that you’d like to conduct,” while 20% may have said, “I only want my data used for diabetes.” And so,

00:26:28,100 --> 00:26:35,600
data are requested by a consent group, and if we look at these numbers, they’re of course that much larger because there are a lot more

00:26:35,600 --> 00:26:41,133
transactions that are taking place. But clearly, what’s important from here is that these data are being used, many investigators are getting

00:26:41,133 --> 00:26:49,133
access to the data and asking questions, and a lot of publications have resulted. Unfortunately, it’s a lot harder for us to track the publications

00:26:49,133 --> 00:27:00,633
coming out of it, but we are starting to get a little bit of data on that front. If we look at how this maps out to the various datasets that we have as

00:27:00,633 --> 00:27:08,399
they are at least identified or categorized according to the Data Access Committee that they belong to, we can see that it varies based on the

00:27:08,400 --> 00:27:18,633
Data Access Committee. Some of this has to do with the number of datasets that they have, some of it has to do with the size of those datasets that

00:27:18,633 --> 00:27:25,766
they have. So, you see that NHLBI is very large; they have a lot of datasets but they also have Framingham, which is incredibly popular for

00:27:25,766 --> 00:27:35,599
everyone to look at. The Cancer Genome Atlas is another one; it actually has the highest usage of information out of the dataset. So, this is really

00:27:35,600 --> 00:27:45,466
more just to show the breadth of interest and the number of people looking at particular datasets. Another key point within the policy, and again,

00:27:45,466 --> 00:27:55,066
going back to trying to respect participant wishes and our principles of respect for participants that we use in putting together the policy, is to note

00:27:55,066 --> 00:28:02,732
that from the beginning it was expected that there would some datasets where it simply wasn’t appropriate to share the data through dbGaP in

00:28:02,733 --> 00:28:10,833
this broad way as we wanted to do to build the community resource. So, it was conceived from the very beginning that exceptions to the data

00:28:10,833 --> 00:28:19,833
sharing expectation would be possible. This is something that, because different institutes have different program priorities in putting together

00:28:19,833 --> 00:28:28,399
their initiatives, it needs to be agreed upon with the program staff prior to funding so that everyone is on the same page and there are no

00:28:28,400 --> 00:28:40,300
surprises down the road where their primary intent is about data sharing or to share a particular kind of data within their dataset, doesn’t

00:28:40,300 --> 00:28:48,533
find out until the point that it’s time to put the data in dbGaP that the IRB said “no” because the consent explicitly says there can’t be any

00:28:48,533 --> 00:28:55,799
sharing. That’s a perfectly appropriate thing for the consent to have said and the NIH wants to honor that. The NIH just wants to know what

00:28:55,800 --> 00:29:05,600
those restrictions are before anything is funded. We have had several exceptions that have been granted and these have, not surprisingly, been

00:29:05,600 --> 00:29:13,333
due to limited consent language where the dataset was still felt to be very scientifically important and so the institute felt it was within the

00:29:13,333 --> 00:29:22,833
interest of their program priorities to fund the data but to have an alternative data sharing plan besides using working through dbGaP. There

00:29:22,833 --> 00:29:28,299
have also been a few cases where there have been legal restrictions so that data couldn’t leave—a particular kind of data couldn’t leave—a given

00:29:28,300 --> 00:29:33,700
state or data couldn’t leave a given country, and also some cases where the geographic representation of the study population has been

00:29:33,700 --> 00:29:42,366
felt by the institution to be too localized to be appropriate to share the data. And so institutes, again because of the scientific merits of a given

00:29:42,366 --> 00:29:49,766
study, have wanted to fund that and have granted an exception to the data sharing expectation. We are in the process of developing

00:29:49,766 --> 00:29:59,066
points to consider on how to write your data sharing plans and how to write requests for exceptions so that institutions understand more

00:29:59,066 --> 00:30:08,566
what the NIH is looking for and even so that our own staff understand more of what the expectations of the policy are with regard to

00:30:08,566 --> 00:30:20,666
granting exceptions. Again, coming back to transparency and looking at the concept of governance, which I mentioned earlier as

00:30:20,666 --> 00:30:29,832
something that we’re seeing more and more of in the oversight of data access, the GWAS policy also, from the very beginning, set up a system of

00:30:29,833 --> 00:30:38,333
governance committees so that the policy and the implementation of the policy could be fluid over time and could be responsive to changes either in

00:30:38,333 --> 00:30:46,833
technology, changes in our scientific understanding, or changes also in the public or societal conversations that were going on around

00:30:46,833 --> 00:30:56,499
this science. And so in terms of accountability, the governance system and all of the conduct of the GWAS policy and the infrastructure that was

00:30:56,500 --> 00:31:05,333
created to oversee it is directly responsible to the NIH Director. He is immediately informed by a senior oversight committee which consists of

00:31:05,333 --> 00:31:16,799
several IC directors as well as some of his most senior staff, and they meet at least quarterly to oversee how the policy implementation is

00:31:16,800 --> 00:31:27,300
happening to make decisions on policy questions that are being brought before the committee, or they meet whenever there’s some urgent issue

00:31:27,300 --> 00:31:34,700
which needs to be attended to, to make sure that the policy can be as responsive as possible. The Senior Oversight Committee is staffed in some

00:31:34,700 --> 00:31:42,566
ways by two steering committees which are constituted with senior staff from across the NIH. The Participant Protection and Data Management

00:31:42,566 --> 00:31:49,366
Steering Committee is made up of all of the Chairs of the Data Access Committee, and they come together monthly to meet and talk about their

00:31:49,366 --> 00:31:57,866
experiences, develop programs and procedures so that they operate consistently across the agency, and similarly, the Technical Standards

00:31:57,866 --> 00:32:04,866
and Data Submission Steering Committee is made up of the lead scientific staff for each institute or center that is conducting a GWAS study or a

00:32:04,866 --> 00:32:14,566
genomic data sharing program, and they also come together once a month to talk about issues around scientific quality of data that are being

00:32:14,566 --> 00:32:20,666
submitted, policy questions around data submission, again, with the aim of sharing experience and building common practices so

00:32:20,666 --> 00:32:28,332
that we can be as transparent and consistent as possible. Something to keep in mind as we go forward is always what’s happening in the public

00:32:28,333 --> 00:32:36,333
conversation. There have been many things of note lately that are influencing what the public thinks about genomics research and also what

00:32:36,333 --> 00:32:47,266
they think about research with their data or their samples going forward, and of course that’s important for research involving a community

00:32:47,266 --> 00:32:54,866
resource and a data bank like dbGaP where we are holding genomic information for a very long time. This one has several headlines from the

00:32:54,866 --> 00:33:04,899
very well-known Havasupai case, the book regarding the story of Henrietta Lacks and the creation of HeLa cells and recently, too, all of the

00:33:04,900 --> 00:33:12,800
information and court cases around newborn screening and the holding of blood spots and what the public attitudes are about the ownership

00:33:12,800 --> 00:33:20,333
of the information that is contained within those blood spots and how that’s changing the conversation. Again, this speaks to the need for

00:33:20,333 --> 00:33:32,699
transparency and the need for the policy conversation to be ongoing and not static over time. We are hoping very soon to release an

00:33:32,700 --> 00:33:42,633
extended data sharing policy that would include whole genome sequence information. It would include epigenomic information, expression level

00:33:42,633 --> 00:33:49,433
data. Obviously, in the five years since the policy has been in development we’ve moved on from the basic GWAS study, but there are all the

00:33:49,433 --> 00:33:56,633
questions that we need to ask ourselves in putting together this policy that we asked ourselves the first time for these new data types

00:33:56,633 --> 00:34:05,299
and also looking at the study designs for these new data types that are different than GWAS and there are some different issues with regard

00:34:05,300 --> 00:34:11,700
to data quality and data submission that need to be addressed. So, some of the questions that we’ve been considering are, again, what data

00:34:11,700 --> 00:34:20,400
and project types should be included? Are these only community resource projects at the whole level or is it every R01 and where do you draw

00:34:20,400 --> 00:34:31,400
the line in between? What is the data submission process and timeline going to look at for this broader array of genomic data types? Are there

00:34:31,400 --> 00:34:36,500
options now for Open Access data release of genomic information? At the moment, no genomic data are available or very, very minimal genomic

00:34:36,500 --> 00:34:44,200
data are available through open access. What should the data release process and timeline be? What have we learned from our GWAS data

00:34:44,200 --> 00:34:52,966
release policy that we should integrate into what we’re doing now? Again, all of the participant protection issues that we considered

00:34:52,966 --> 00:35:00,866
before…informed consent…are the questions different when we’re talking about whole genome information relative to GWAS information? And I

00:35:00,866 --> 00:35:08,232
know you’re going to consider some of this in greater depth tomorrow. And as I already mentioned, what have we learned from GWAS

00:35:08,233 --> 00:35:19,499
and the policy implementation and infrastructure development for that purpose that we can integrate into what we’re going to do in the future

00:35:19,500 --> 00:35:26,233
to make it better, to make it more streamlined, to make it more responsive to investigator needs, to participant concerns, etc.? And of course, how

00:35:26,233 --> 00:35:41,499
are we going to implement it and oversee it? And with that, I will stop and take any questions. [applause] Everyone is just ready to be done for

00:35:41,500 --> 00:35:46,666
the day, which I completely understand. It’s been a long day from looking at your agenda.

00:35:46,666 --> 00:35:54,066
JEFFREY KOPP: Maybe you said it and I missed it. Are whole exomes at this point required to be placed in dbGaP? You had a slide where it’s

00:35:54,066 --> 00:35:57,966
under discussion.

00:35:57,966 --> 00:36:02,832
LAURA LYMAN-RODRIGUEZ: That depends on…yes. It’s under discussion at an agency level, so it’s one of those things where you really need to

00:36:02,833 --> 00:36:06,766
talk to your program director because some ICs are expecting it now, and it’s following the basic GWAS data sharing policy, but not every institute or center.

00:36:06,766 --> 00:36:14,899
JEFFREY KOPP: So as consents are written, they should include the appropriate language to let…

00:36:14,900 --> 00:36:20,000
LAURA LYMAN-RODRIGUEZ: They should include the appropriate language because that is the direction that we expect it to go, but it’s still

00:36:20,000 --> 00:36:25,700
under discussion. FEMALE: Thanks for a great presentation. It was

00:36:25,700 --> 00:36:36,266
interesting to see the data from the experience with GWAS and dbGaP to date. Then you listed some of the data per consent group and I want to

00:36:36,266 --> 00:36:47,466
talk this through to make sure my interpretation is correct. You talked about a significant number—6,500 plus—of projects that were submitted by

00:36:47,466 --> 00:36:57,832
consent group, but 1,755 that were rejected. Does that mean that 27% of the requests that came in from scientists to access data by

00:36:57,833 --> 00:37:06,599
consent group were rejected because of data use limitations that were described and recorded from the consent?

00:37:06,600 --> 00:37:11,933
LAURA LYMAN-RODRIGUEZ: It means that about that percentage were rejected on their first submission, and so sometimes those came back

00:37:11,933 --> 00:37:22,699
in and were amended, because sometimes investigators are not super-careful when they’re checking off boxes for the consent groups for

00:37:22,700 --> 00:37:27,866
the data that they want and this is one of the arguments that we still have for a manual data access process, because they will say very

00:37:27,866 --> 00:37:35,266
specifically in their abstract that they’re going to do research on cancer and then they will request a diabetes-only dataset. And so, the Data Access

00:37:35,266 --> 00:37:43,199
Committee will approve those datasets that are fine for cancer and disapprove those consent groups that have that restriction on it.

00:37:43,200 --> 00:37:50,666
FEMALE: So, that’s a good protection but also may be a lesson about the way in which prospective consents are constructed and

00:37:50,666 --> 00:37:58,699
whether they should be narrowly tailored versus broader and how that might affect the science.

00:37:58,700 --> 00:38:03,000
LAURA LYMAN-RODRIGUEZ: Yes, it does speak to that issue.

00:38:03,000 --> 00:38:05,566
FEMALE: Okay. Thanks. JOAN BAILEY-WILSON: I won’t be here

00:38:05,566 --> 00:38:13,032
tomorrow but some of the things that I’ve been discussing with various institute program staff about how are we going to do whole exome

00:38:13,033 --> 00:38:19,999
sequencing has been really interesting and pertinent to what you were talking about, because what we’ve been doing so far has

00:38:20,000 --> 00:38:32,100
mostly been in families and, of course, the GWAS policy never applied to linkage studies.It didn’t apply really to family studies because there were

00:38:32,100 --> 00:38:45,700
considered to be more risks with family studies because these people that gave you their DNA in the family have consented, but perhaps there are

00:38:45,700 --> 00:38:52,900
several other important affecteds in the family that you know are affected by they refuse to give you DNA and they haven’t specifically

00:38:52,900 --> 00:39:06,566
consented. But if I sequence their relatives I now know quite a lot about their genome, etc. So, there are all kinds of interesting policy and

00:39:06,566 --> 00:39:17,432
privacy and IRB questions that, as NIH is developing this policy they are thinking about and they’re talking about and trying to figure out

00:39:17,433 --> 00:39:30,566
where do we draw the line and how do we protect the human subjects but still get the data shared. For example, NICHD is doing a cleft

00:39:30,566 --> 00:39:40,966
sequencing study that some of my samples are part of and these come from Syria. They are extended pedigrees with many affected

00:39:40,966 --> 00:39:52,532
individuals, and when we set up this study in collaboration we had to really be careful in our consents for the Syrian government to agree to

00:39:52,533 --> 00:40:05,733
work with us. And so, NICHD has really worked with us to make it so that we can deposit the data but have protections for the access that satisfied

00:40:05,733 --> 00:40:17,533
the consents—the Syrian investigators—but more importantly, the Syrian ethics boards and the Syrian government. It was quite a process,

00:40:17,533 --> 00:40:25,633
so those things are going on, but those are kinds of things that are really going to have to be thought about and figured out. So, I’m sure you

00:40:25,633 --> 00:40:29,566
guys will enjoy discussing it tomorrow. Since I won’t be here I thought I’d…

00:40:29,566 --> 00:40:35,632
LAURA LYMAN-RODRIGUEZ: Preview. FEMALE: Laura, thank you very much. It was

00:40:35,633 --> 00:40:41,066
really interesting to see some of the new data. I have two questions for you. One, I hope, is an easy question; maybe they’re both really easy

00:40:41,066 --> 00:40:46,032
questions. LAURA LYMAN-RODRIGUEZ: I hope they’re both

00:40:46,033 --> 00:40:47,599
easy. FEMALE: It’s a question that Gail Jarvik asked

00:40:47,600 --> 00:40:54,500
Mike Fiello when he gave a presentation to the EMERGE Steering Committee earlier this week and it had to do with the implications of the proposed

00:40:54,500 --> 00:41:01,733
changes to the Common Rule and whether in particular a proposal in the advanced news for the proposed rule-making which would make all

00:41:01,733 --> 00:41:08,833
research with biospecimans require informed consent; whether that would have any implications for the data that are already shared

00:41:08,833 --> 00:41:17,466
in dbGaP. My assumption is that’s a “no” but I wasn’t sure and Mike couldn’t answer her question, so I now ask you.

00:41:17,466 --> 00:41:27,166
LAURA LYMAN-RODRIGUEZ: So, in the ANPRM there is also a question about whether or not prior studies should be grandfathered into this

00:41:27,166 --> 00:41:35,899
new expectation for informed consent on any study. So, I think whether or not that particular clause goes forward and any notice or any final

00:41:35,900 --> 00:41:45,233
rule will determine what happens to the studies that are already in dbGaP where there isn’t explicit consent or whatever it may be for this

00:41:45,233 --> 00:41:49,899
process. So, I can’t…there is no answer to that question yet.

00:41:49,900 --> 00:41:57,066
FEMALE: Okay. All right, so maybe that was easy. The second one is it’s really nice to see the numbers on how many data access requests

00:41:57,066 --> 00:42:04,932
were refused because, as far as I was aware, those data were not readily available until quite recently. Another issue that I know the bioethics

00:42:04,933 --> 00:42:13,199
community is sometimes concerned about are the degree to which there have been frank research misuses of data that have been deposited in

00:42:13,200 --> 00:42:25,200
dbGaP, and there was a poster at ASHG last fall for giving some data on the GAIN DAC, giving some limited information about examples of

00:42:25,200 --> 00:42:35,100
research misuses. Do you have any numbers on that either for GAIN or for any of the other data access committees?

00:42:35,100 --> 00:42:41,666
LAURA LYMAN-RODRIGUEZ: I was looking…I do have a slide on that that I didn’t include for this particular talk because I knew I was already

00:42:41,666 --> 00:42:57,966
going to be over time. Let’s see if it’s here. No, it’s not here. So, at the moment I did look at our information this morning and we have 14 data

00:42:57,966 --> 00:43:06,732
management incidents that have taken place. We can put that over the total of studies that have been submitted, so the number is fairly small,

00:43:06,733 --> 00:43:20,033
relatively speaking, and those have ranged in infraction, I guess. There have been a variety of infractions that have been represented. Some

00:43:20,033 --> 00:43:28,499
have been because of computer glitches where something went out either from dbGaP or from the institution to us that should not have gone out,

00:43:28,500 --> 00:43:39,600
and so we’ve corrected those. We’ve had three issues where investigator misconduct was determined to be the issue and those were…two

00:43:39,600 --> 00:43:49,033
were publication issues prior to an embargo end date, and so the sanction that was issued was that they lost access to dbGaP and they had to

00:43:49,033 --> 00:43:57,233
destroy all of the data and cease all activity on anything that they already had and were currently working on and we worked with the

00:43:57,233 --> 00:44:06,066
institutions, again, shared responsibility. So, whenever there is a data management incident, we contact the dean and investigators once we

00:44:06,066 --> 00:44:11,966
know that there’s been an issue. Investigators are sort of cc’d versus being the primary they were working with because we really are

00:44:11,966 --> 00:44:19,566
working with the institution, and ask the institution to tell us what they’re going to do to prevent it from happening again. So again, the investigators

00:44:19,566 --> 00:44:25,532
in those two cases lost access for six months, and that included everyone on their research team and they weren’t able to do anything to

00:44:25,533 --> 00:44:32,266
continue their work on any of their dbGaP datasets, whether it was involved in that particular project or not. The other one was

00:44:32,266 --> 00:44:37,432
someone who asked a different question than what was in their research use statement, and through the annual report, we realized this. It did

00:44:37,433 --> 00:44:49,533
not violate the use parameters, so we didn’t have any consent violations in that case, but it wasn’t what they said they were going to do, and so

00:44:49,533 --> 00:44:58,133
they were sanctioned for three months, I believe, from dbGaP and had to cease use of all of the

00:44:58,133 --> 00:45:05,033
datasets that they had in that case. So we are, again, tracking it and putting it out there and we are looking at publishing some global analysis, not

00:45:05,033 --> 00:45:11,799
just the GAIN paper. I think NHLBI also has a paper in development to talk about their docs experience, but we are hoping to put together

00:45:11,800 --> 00:45:15,666
something across the board to look at the entire agency’s perspective.

00:45:15,666 --> 00:45:21,199
FEMALE: That would be really wonderful for those of us who talk to research participants and who have questions about these sorts of things.

Date Last Updated: 9/18/2012

General Inquiries may be addressed to:
Office of Communications and Public Liaison
Building 31, Rm 9A06
31 Center Drive, MSC 2560
Bethesda, MD 20892-2560
Phone: 301.496.3583