0% found this document useful (0 votes)
83 views

Introduction Data Management

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
83 views

Introduction Data Management

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Introduction to data management

Contents
1. Contents
2. Introduction to data management
3. Aims
4. The data life cycle
5. The data life cycle: Design
6. The data life cycle: Collect
7. The data life cycle: Curate
1. Data standards
1. Common data standards
8. Variability among datasets (1)
9. Variability among datasets (2)
10. Predictable datasets
11. Metadata
12. The data dictionary
13. Use of the data dictionary
14. Exercise 1
15. Exercise 1: Discussion
16. The data life cycle: Store
1. Cloud computing
2. Factors to consider in choosing storage solution
17. The data life cycle: Archive
18. Data management plan
1. DMP structure (1)
2. DMP structure (2)
19. Data sharing plan
1. Example data sharing plan
20. Summary
21. Quiz

Introduction to data management


The objective of this module is to provide the basics and best practices for data management to
researchers in data collection, storage, processing and sharing.

Source: Data sharing and management Snafu in three short acts by Karen Hanson, Alisa Surkis
& Karen Yacobucci, NYU Health Sciences Libraries, 3 August 2012

This video highlights the challenges that researchers often encounter in data sharing, which
include:

Understanding the data sharing requirements from the funders and journals

Inappropriate data storage

Insufficient data documentation

Data interoperability and uncertainty in the reuse of shared data

Aims
In this module, we look at the following areas in data management and sharing:
The data life cycle, which includes design, collection, curation, storage and archiving of
research data

Contents and importance of a data management plan in research

Contents and importance of a data sharing plan

The data life cycle


The data life cycle comprises six key iterative steps, providing a high-level overview of the
stages involved in successful management and preservation of data for use and reuse.

The data life cycle: Design


The first phase of the data life cycle is design. Start by defining
WHAT data points will be used to measure an outcome,
including the key data points that will be considered as the study
end points.

Using appropriate statistical methods, define the sample size, i.e.


HOW many observations will be made. Then write a plan on how
these data will be analysed.

Identify the possible data sources, i.e. WHERE these data will be
collected from. Depending on the data sources identified, next
determine the appropriate data collection methods to get the
data effectively.

In order to answer the research question(s), the data variables need to be defined and
translated to conversational questions.

Data Dictionary

Once the questionnaire ready, it is time to derive variables from the questions and develop a
data dictionary. A data dictionary describes the properties of variables such as data type (e.g.
string, number, date), size of the variable, data coding, and constraints and validations attached
to given variable. The data dictionary is a critical component in building the database, as it
forms the metadata, which is data about the database. The database designer will use the data
dictionary to program the database and build in validation checks.

To ensure the study personnel understand the research flow and how to use the study data
collection tools, we also need to develop training materials, user manuals and standard
operating procedures (SOPs). Undertake study specific trainings and document every process of
the study to ensure appropriate data collection.

The data life cycle: Collect


Data collection is the process of creating data values that help
answer the research question. Observations are made from the
defined study data sources, e.g. lab machines, manual readings,
audio tapes, etc. The process to capture data from the source
must be defined. For instance, data might be transcribed
manually into your questionnaire or entered directly into the
electronic database. Database entry can either be done
manually or, if the data are already in electronic format, they
can be imported directly into the database. Ensure an audit trail
is maintained for all data imports.

To ensure research data attains the desired quality, develop a


data quality plan to validate the research data. This could involve
source data verification, edit checks, double data entry, and so
on.

The data life cycle: Curate


Data curation is the consolidation, organization and integration of
data collected from multiple sources. It involves quality
assurance and standardization of data. Unique identifiers are
used in linking data to form a story.

For example, a patient enrolled in a TB study will have medical


notes at admission, chest X-ray results, microbiology culture
results and genotyping data. These data are collected by
different people and very likely, in different geographical regions.
Ultimately the data will all be brought together to build a story:
that participant 766 was diagnosed with TB based on the chest X-
ray and the TB was found to be multi-dug resistant based on the
lab report; this seems to be the same strain of bacterium that
was reported in location XXX, based on the genome sequencing
results.

Data standards
Data standards are the rules by which data are described and
recorded. Using standards makes using things easier. For
example, let's say you need AAA battery for your flashlight. You
don't need to worry about the make or brand of the battery,
since all AAA batteries are the same size – because they are
produced to a standard. Therefore, all AAA batteries will work in
your flashlight.

Standards provide data integrity, accuracy and consistency,


clarify ambiguous meanings and minimize data redundancy. Use
of common agreed standards facilitate sharing and understanding of data.

Common data standards


Examples of common data standards in biomedical research include:

Clinical Data Interchange Standards Consortium (CDISC): Focuses primarily on regulated studies
to support medical research from protocol through analysis and reporting of results.

SDTM (Study Data Tabulation Model) defines a standard structure for human clinical trial (study)
data tabulations and for nonclinical study data tabulations that are to be submitted as part of a
product application to a regulatory authority such as the United States Food and Drug
Administration (FDA). The Submission Data Standards team of Clinical Data Interchange
Standards Consortium (CDISC) defines SDTM.

International Classification of Diseases (ICD-11): Diagnostic classification standard for diseases,


disorders, injuries and other related health conditions.

Systematized Nomenclature of Medicine (SNOMED): General terminology for electronic health


records that includes: clinical findings, symptoms, diagnoses, procedures, body structures,
organisms and other etiologies, substances, pharmaceuticals, devices and specimens.

WHODRUG: classification of medicines created by the WHO and used for identifying drug
names.

Unified Medical Language System: thesaurus and ontology of biomedical concepts that supports
natural language processing and used mainly by developers in medical informatics.

Variability among datasets (1)


Have a look at these four data tables, what do you see?

Adapted from a slide courtesy of Armando Oliva and Amy Malia, FDA

Variability among datasets (2)


The data tables look as though they might be from the same study conducted in different
places. However they vary as follows:

File name for demographics dataset varies

Column headings vary for both subject identifiers (SUBJID, PTID, USUBID, ID) and
sex/gender

Gender/sex is denoted by Male or Female, M or F, 1 or 2, 0 or 1

Subject identifiers look different in every study – it is not clear if the same person was
included more than one study

Predictable datasets
By contrast, here we observe that all four datasets share the same data standards and thus are
predictable as follows:

The column header (the variable) for Subject ID is always same

The column ‘Sex’ always has the same heading and the sex data always uses same
terminology (codelist)

There is a consistent format and standard definition for subject ID (USUBJID), makes it clear
which subjects were in more than one study

The name for demographics dataset is always the same

Metadata
Metadata is structured minimal information that describes, explains, locates or otherwise makes
it easier to retrieve, use or manage an information resource. Metadata is used to summarize
basic information about data which can make tracking and working with specific data easier.
Some examples of basic metadata are author, date created, date modified, and file size.

The importance of metadata

It enables data transparency and understanding. It is possible to share and compare health
information between hospitals, regions and countries

It allows data comparisons for the same location across different time periods

Use of common language/data protocols and set of expectations enable interoperability


between systems thus promoting system automation

Metadata improve data reusability thus reducing redundancy and data collection costs

The data dictionary


In order to understand the data dictionary, imagine you have been asked
to describe a tea mug.

In describing the tea mug, here are a few questions to guide you:

1. What material is the cup made of?

2. What size is the cup?

3. What colour is the cup?

4. What shape is the cup?

If you answer these questions, then you will have built a dictionary that describes the tea mug.

Likewise, in data management, a data dictionary is a set of information describing the contents,
format, and structure of a database, and the relationship between its elements, used to control
access to and manipulation of the database.

For a clinical trial, the data dictionary will be made up of the following:

Name of form/CRF section eg demographics


A listing of data objects (names and definitions)
Description of the data element in natural language
Detailed properties of data elements (data type, size, nullability, optionality, indexes)
Response options eg checkbox, radio button, text
The validation rule(s) eg required field, range check
Relationship of the data items to others.
Details on any privacy and security constraints that should be associated with the item eg
Protected Health Information status

Data dictionary template

The template below shows how the data dictionary might look.
Use of the data dictionary
(Source: https://www.usgs.gov/products/data-and-tools/data-management/data-dictionaries )

The data dictionary is used in a variety of ways.

Documentation – It provides data structure details for users, developers, and other
stakeholders

Communication – It equips users with a common vocabulary and definitions for shared
data

Application design – It helps application developers create forms and reports with proper
data types and controls, and ensures that navigation is consistent with data relationships

Systems analysis – It enables analysts to understand overall system design and data
flow, and to find where data interact with various processes or components

Data integration – Clear definitions of data elements provide the contextual


understanding needed when deciding how to map one data system to another, or whether
to subset, merge, stack, or transform a dataset for a specific use

Decision making – The dictionary assists in planning data collection, project development,
and other collaborative efforts

Exercise 1
Using the exercise provided and this template, develop a data dictionary. (Please click the image
to download a copy of the document)
Exercise 1: Discussion
See the table here for the filled out data dictionary. (Please click the image to download the full
table)

The data life cycle: Store


Once research data has been created, it needs to be stored and
protected, with the appropriate level of security applied. A robust
backup and recovery process should also be implemented to
ensure retention of data during the lifecycle.

All source documents need to be stored, including paper records


like the patient questionnaires, lab print outs, drug records, etc.
They should be stored in safe place with controlled access.

Electronic records should be encrypted and stored in database or


electronic devices like USB drives, hard drives, DVDs, etc. that
allow for future access of the data.

Develop a data backup plan to mitigate risk of data loss. Data


loss can be caused by many things, for instance: computer viruses, hardware failures, file
corruption, fire, flood, or theft.

Secure all data storage media from inappropriate access, tampering and destruction.

Cloud computing
Cloud computing the delivery of computing services, including servers, storage, databases,
networking, software, analytics, and intelligence over the Internet (“the cloud”) to offer faster
innovation, flexible resources, and economies of scale.
Source: https://commons.wikimedia.org/wiki/File:Cloud_computing.svg

Overview of cloud computing, with typical types of applications supported. © Sam Johnston. CC
BY-SA 3.0.

There are diverse data storage solutions from onsite storage to offsite storage. A hybrid
solution, mix of onsite and offsite data storage is highly recommended to mitigate the risks of
data loss. Offsite data storage can be deployed to work as data backup.

Factors to consider in choosing storage solution


Cost
Data storage needs vary depending on organisations needs and data security requirements.
Onsite storage, the equipment cost and software licences for a stack of servers are some
aspects of data storage you may need to budget for. Additional budget provisions will be
required to maintain and service the servers and software licence renewals.

In offsite storage for small volumes of data, online file storage service allow free back up for
about 2GB. If you need more, most services charge about $10 (USD) per month for 50 GB.

To be cost effective, understand your data storage needs and make sure there is provision in
your budget for your chosen onsite and offline data storage solution.

Uptime
Go for solutions that will allow you to back up your data or get to your stored data with minimal
or no downtime. Check the quality of the service agreement for the online storage providers
that you are investigating, and make sure that they offer an uptime guarantee.

Security
Speed and convenience are one thing; security is something entirely different. You don’t want
someone looking at your files, do you? Aside from requiring a user name and password, ensure
the transmission channel is secure and under secure data transfer protocols.

Interfacing with the service


Moving your data between your computers and an online storage service should be easy. The
service should provide a client software that integrates with your desktop or your operating
system’s file manager.

Level and responsiveness of support


At some point, you’ll run into a problem with a storage solution. Find out how support is offered
for your data storage solution.

The data life cycle: Archive


Archiving data means moving data that is no longer actively used
to a separate storage location for long-term retention. This is
distinct from data backup, which is making a copy of actively
used data and storing it in a separate location for the purpose of
restoration in the event of loss or damage of original data.

When archiving your data, you need to ensure that you, or your
future colleagues, will be able to understand the archive and find
what they need. Think back to the video at the start of the
module – make sure you would be able to deal with a similar
request for your data.
You will need to keep the database, master file (paper
&/electronic), and original audit trail. Ensure that you have
controlled access to maintain authenticity, and that access to documents and data is maintained
for the entire archiving period.

This will mean assessing the hardware and software used to access the data in its original
archived format, and possibly putting in a new system to emulate the old software or migration
of the data into a new format to ensure continual access with new software.

The retention period for the data is based on the applicable regulations. For example:

Clinical trials under Directive 2001/20/EC: data needs to be kept for 5 years

Studies supporting marketing authorization: : data needs to be kept for 15 years

Studies sponsored by Oxford involving children: data needs to be kept for 20 years

Data management plan


A data management plan (DMP) is a written document that describes the data you expect to
acquire or generate in a research project, how you will collect, describe, clean and store those
data; and what mechanisms you will use at the end of your project to share and preserve your
data. It covers the entire data life-cycle and all the processes involved in it.

DMP structure (1)


A generic data management plan has the following sections:

1. Description of the data


This section covers the study protocol/research proposal; study name and design, research
question, sample size, study site location and study duration.

This is where you describe the types of data to be collected (e.g. qualitative or quantitative),
data sources and formats.

2. Data generation and collection


This describes how data will be generated, and defines which instruments will be used for data
generation and measurements. For example, will the results be generated by a particular
machine, or are they manual measurements like vital signs?

You also need to define the data collection tool. For example, will you use paper questionnaires,
or electronic data capture tools like a mobile app or a specific spreadsheet?. State any expected
external data imports expected, the sources and time lines.

3. Capture, documentation & curation


This section is probably the most complex. You will need to define all of the following:

Study database: Define the electronic database you will use to collect your data. This needs to
include the software name and version.

Data entry: Define where data entry will be done, at site or in the office. Describe the data
model that will be used to enter the data into the database, single entry or double entry.

Data quality: Define the data quality measures to be applied to ensure research data is valid,
reliable and accurate. State the frequency with which data quality checks are done and the
supporting documentation.

Metadata: Define the metadata that will be released to describe your data to the data users.

Archiving and preservation: Describe how your data will be preserved for future use and how
long the research data will be retained.

Training: Describe the documentation and training materials available to help the research team
in understanding the research procedures.

DMP structure (2)


4. Security
This section describes the measures that will be in place to protect your data in case of disaster,
indicating the reference documents applicable in back up and disaster recovery.

You need to state data security and privacy protection policies applicable to your data, and if
required, state the applicable data standards.

5. Analysis and reporting


Describe any analysis that will take with reference to the statistical analysis plan and Data Safety
and Monitoring Board (DSMB). Indicate any reports that will be generated and the frequency.
State any data sharing and external data transfers expected (see the Data sharing plan section).
6. Roles and responsibilities
This provides a brief summary of key project team leaders, role and contacts.

Resources to help create a Data Management Plan

https://dmptool.org

https://dmponline.dcc.ac.uk/

Data sharing plan


A generic data sharing plan takes the following structure.

1. What data will be shared?

Define data collected in study eg demographics, genotyping, bioassays?

Will data be shared in aggregate form or as original raw data points?

What metadata will be shared along with the dataset?

Will any coding standards be applied?

Are there additional documents to be shared eg protocol, questionnaires?

2. Who will access the data?

Will all data or some of the data be accessible publicly?

Will access to certain components of the data be restricted, if so, why?

Will people accessing data need to comply with any conditions?

Will data agreements or contracts be necessary before sharing the data?

3. Where will the data be stored?

Where will the final dataset be stored? Eg repository, enclave

Who maintains the data store?

Are the data files backed up regularly? Are there replicas in different locations? Are older
versions of the data kept?

Are the system and storage that will be used secure?

What plans do you have to archive the data and other research products?

4. When will data be shared?

Will data be released before, during or after publication?

How long will maintenance and access of data be maintained?

For long-term studies, will new data components be made incrementally available?

5. How will the data be found?


How will other researchers get aware of existence of your data?

How will other researchers and the public gain access to where your data is stored?

Example data sharing plan


What data that will be shared:
I will share aggregated PK data associated with the collected samples by depositing these data
at XXXX which is an XXX funded repository. Additional data documentation and de-identified
data will be deposited for sharing along with PK data, which includes demographics and medical
history, consistent with applicable laws and regulations. I will comply with XXX funding policies
on sharing data on XXXXXX disease. Submitted data will confirm with relevant data and
terminology standards.

Who will have access to the data:


Data will be shared will be under open access license and could be used for secondary study
purposes.

Where will the data be available:


I agree to deposit and maintain the PK data, and secondary analysis of data (if any) at XXXX
repository. The repository has data access policies and procedures consistent with XXX
institution data sharing policies.

When will the data be shared:


I agree to deposit PK data into XXXX repository as soon as possible but no later than within one
year of the completion of the project or upon acceptance of the data for publication, or public
disclosure of a submitted patent application, whichever is earlier.

How will researchers locate and access the data:


I agree that I will identify the repositories where the data will be available under open access
license and how to access the data in any publications and presentations that I author or co-
author about these data, as well as acknowledge the repository and funding source in any
publications and presentations.

Summary
In order to execute your research effectively, you need to have a data management and
sharing plan. Train the study team on all the research procedures to ensure accurate data is
collected. In collecting and manipulating your data, adopt commonly agreed data
documentation standards to help other researchers understand your data. A robust data
storage solution will keep your data secure and help mitigate the risk of data loss. Preserve data
for long-term use and sharing as per your institution’s policy on data retention. Review your
study documents to ensure they provide clear guideline to study procedures.

Quiz

Results
Score for this question group: 80%
1. The data life cycle comprises how many iterative steps?
5
4
6 Your correct answer
3
2. Which of the following is not part of data life cycle iterative steps?
Design
Collect
Store
Analysis Your correct answer
3. Which of the following does not describe metadata? Pick all that apply
Structured minimal information that describes, explains, locates or otherwise makes it
easier to retrieve, use or manage an information resource
Summary of basic information about data which can make tracking and working with
specific data easier
A set of information describing the contents, format, and structure of a database
Correct answer
A set of specifications (or requirements) for how some sets of data should be made
publicly available Your correct answer
4. The data management and sharing plan should be designed alongside:
The study handbook
The investigators’ brochure
The standard operating procedures
The protocol and funder requirements Your correct answer
5. Which of the following topics is NOT an aspect of data management plan:
Designing the CRF
Planning data collection
Drafting data staff contracts Your correct answer
Designing the statistical analysis plan
6. Which of the following is NOT a factor to consider in choosing a data storage solution Pick
all that apply
Security
Brand of the server Your correct answer
Number of users Your correct answer
Cost
7. The data dictionary is used for all of the following except:
Documentation of data structure for users and developers
In application design to help designers create forms and reports
In developing data sharing plan Your correct answer
In decision making to plan data collection and project management
8. A generic data sharing plan has the following structure: What variable will be collected?
What data will be shared? Who will have access to th…
True Your answer
False Correct answer
9. An objective of data standards is to enable reusability of data elements and their metadata
thus reducing redundancy.
True Your correct answer
False
10. An objective of good clinical data management is to ensure the database reflects
accurately the data collected in the study.
True Your correct answer
False

You might also like