Introduction Data Management
Introduction Data Management
Contents
1. Contents
2. Introduction to data management
3. Aims
4. The data life cycle
5. The data life cycle: Design
6. The data life cycle: Collect
7. The data life cycle: Curate
1. Data standards
1. Common data standards
8. Variability among datasets (1)
9. Variability among datasets (2)
10. Predictable datasets
11. Metadata
12. The data dictionary
13. Use of the data dictionary
14. Exercise 1
15. Exercise 1: Discussion
16. The data life cycle: Store
1. Cloud computing
2. Factors to consider in choosing storage solution
17. The data life cycle: Archive
18. Data management plan
1. DMP structure (1)
2. DMP structure (2)
19. Data sharing plan
1. Example data sharing plan
20. Summary
21. Quiz
Source: Data sharing and management Snafu in three short acts by Karen Hanson, Alisa Surkis
& Karen Yacobucci, NYU Health Sciences Libraries, 3 August 2012
This video highlights the challenges that researchers often encounter in data sharing, which
include:
Understanding the data sharing requirements from the funders and journals
Aims
In this module, we look at the following areas in data management and sharing:
The data life cycle, which includes design, collection, curation, storage and archiving of
research data
Identify the possible data sources, i.e. WHERE these data will be
collected from. Depending on the data sources identified, next
determine the appropriate data collection methods to get the
data effectively.
In order to answer the research question(s), the data variables need to be defined and
translated to conversational questions.
Data Dictionary
Once the questionnaire ready, it is time to derive variables from the questions and develop a
data dictionary. A data dictionary describes the properties of variables such as data type (e.g.
string, number, date), size of the variable, data coding, and constraints and validations attached
to given variable. The data dictionary is a critical component in building the database, as it
forms the metadata, which is data about the database. The database designer will use the data
dictionary to program the database and build in validation checks.
To ensure the study personnel understand the research flow and how to use the study data
collection tools, we also need to develop training materials, user manuals and standard
operating procedures (SOPs). Undertake study specific trainings and document every process of
the study to ensure appropriate data collection.
Data standards
Data standards are the rules by which data are described and
recorded. Using standards makes using things easier. For
example, let's say you need AAA battery for your flashlight. You
don't need to worry about the make or brand of the battery,
since all AAA batteries are the same size – because they are
produced to a standard. Therefore, all AAA batteries will work in
your flashlight.
Clinical Data Interchange Standards Consortium (CDISC): Focuses primarily on regulated studies
to support medical research from protocol through analysis and reporting of results.
SDTM (Study Data Tabulation Model) defines a standard structure for human clinical trial (study)
data tabulations and for nonclinical study data tabulations that are to be submitted as part of a
product application to a regulatory authority such as the United States Food and Drug
Administration (FDA). The Submission Data Standards team of Clinical Data Interchange
Standards Consortium (CDISC) defines SDTM.
WHODRUG: classification of medicines created by the WHO and used for identifying drug
names.
Unified Medical Language System: thesaurus and ontology of biomedical concepts that supports
natural language processing and used mainly by developers in medical informatics.
Adapted from a slide courtesy of Armando Oliva and Amy Malia, FDA
Column headings vary for both subject identifiers (SUBJID, PTID, USUBID, ID) and
sex/gender
Subject identifiers look different in every study – it is not clear if the same person was
included more than one study
Predictable datasets
By contrast, here we observe that all four datasets share the same data standards and thus are
predictable as follows:
The column ‘Sex’ always has the same heading and the sex data always uses same
terminology (codelist)
There is a consistent format and standard definition for subject ID (USUBJID), makes it clear
which subjects were in more than one study
Metadata
Metadata is structured minimal information that describes, explains, locates or otherwise makes
it easier to retrieve, use or manage an information resource. Metadata is used to summarize
basic information about data which can make tracking and working with specific data easier.
Some examples of basic metadata are author, date created, date modified, and file size.
It enables data transparency and understanding. It is possible to share and compare health
information between hospitals, regions and countries
It allows data comparisons for the same location across different time periods
Metadata improve data reusability thus reducing redundancy and data collection costs
In describing the tea mug, here are a few questions to guide you:
If you answer these questions, then you will have built a dictionary that describes the tea mug.
Likewise, in data management, a data dictionary is a set of information describing the contents,
format, and structure of a database, and the relationship between its elements, used to control
access to and manipulation of the database.
For a clinical trial, the data dictionary will be made up of the following:
The template below shows how the data dictionary might look.
Use of the data dictionary
(Source: https://www.usgs.gov/products/data-and-tools/data-management/data-dictionaries )
Documentation – It provides data structure details for users, developers, and other
stakeholders
Communication – It equips users with a common vocabulary and definitions for shared
data
Application design – It helps application developers create forms and reports with proper
data types and controls, and ensures that navigation is consistent with data relationships
Systems analysis – It enables analysts to understand overall system design and data
flow, and to find where data interact with various processes or components
Decision making – The dictionary assists in planning data collection, project development,
and other collaborative efforts
Exercise 1
Using the exercise provided and this template, develop a data dictionary. (Please click the image
to download a copy of the document)
Exercise 1: Discussion
See the table here for the filled out data dictionary. (Please click the image to download the full
table)
Secure all data storage media from inappropriate access, tampering and destruction.
Cloud computing
Cloud computing the delivery of computing services, including servers, storage, databases,
networking, software, analytics, and intelligence over the Internet (“the cloud”) to offer faster
innovation, flexible resources, and economies of scale.
Source: https://commons.wikimedia.org/wiki/File:Cloud_computing.svg
Overview of cloud computing, with typical types of applications supported. © Sam Johnston. CC
BY-SA 3.0.
There are diverse data storage solutions from onsite storage to offsite storage. A hybrid
solution, mix of onsite and offsite data storage is highly recommended to mitigate the risks of
data loss. Offsite data storage can be deployed to work as data backup.
In offsite storage for small volumes of data, online file storage service allow free back up for
about 2GB. If you need more, most services charge about $10 (USD) per month for 50 GB.
To be cost effective, understand your data storage needs and make sure there is provision in
your budget for your chosen onsite and offline data storage solution.
Uptime
Go for solutions that will allow you to back up your data or get to your stored data with minimal
or no downtime. Check the quality of the service agreement for the online storage providers
that you are investigating, and make sure that they offer an uptime guarantee.
Security
Speed and convenience are one thing; security is something entirely different. You don’t want
someone looking at your files, do you? Aside from requiring a user name and password, ensure
the transmission channel is secure and under secure data transfer protocols.
When archiving your data, you need to ensure that you, or your
future colleagues, will be able to understand the archive and find
what they need. Think back to the video at the start of the
module – make sure you would be able to deal with a similar
request for your data.
You will need to keep the database, master file (paper
&/electronic), and original audit trail. Ensure that you have
controlled access to maintain authenticity, and that access to documents and data is maintained
for the entire archiving period.
This will mean assessing the hardware and software used to access the data in its original
archived format, and possibly putting in a new system to emulate the old software or migration
of the data into a new format to ensure continual access with new software.
The retention period for the data is based on the applicable regulations. For example:
Clinical trials under Directive 2001/20/EC: data needs to be kept for 5 years
Studies sponsored by Oxford involving children: data needs to be kept for 20 years
This is where you describe the types of data to be collected (e.g. qualitative or quantitative),
data sources and formats.
You also need to define the data collection tool. For example, will you use paper questionnaires,
or electronic data capture tools like a mobile app or a specific spreadsheet?. State any expected
external data imports expected, the sources and time lines.
Study database: Define the electronic database you will use to collect your data. This needs to
include the software name and version.
Data entry: Define where data entry will be done, at site or in the office. Describe the data
model that will be used to enter the data into the database, single entry or double entry.
Data quality: Define the data quality measures to be applied to ensure research data is valid,
reliable and accurate. State the frequency with which data quality checks are done and the
supporting documentation.
Metadata: Define the metadata that will be released to describe your data to the data users.
Archiving and preservation: Describe how your data will be preserved for future use and how
long the research data will be retained.
Training: Describe the documentation and training materials available to help the research team
in understanding the research procedures.
You need to state data security and privacy protection policies applicable to your data, and if
required, state the applicable data standards.
https://dmptool.org
https://dmponline.dcc.ac.uk/
Are the data files backed up regularly? Are there replicas in different locations? Are older
versions of the data kept?
What plans do you have to archive the data and other research products?
For long-term studies, will new data components be made incrementally available?
How will other researchers and the public gain access to where your data is stored?
Summary
In order to execute your research effectively, you need to have a data management and
sharing plan. Train the study team on all the research procedures to ensure accurate data is
collected. In collecting and manipulating your data, adopt commonly agreed data
documentation standards to help other researchers understand your data. A robust data
storage solution will keep your data secure and help mitigate the risk of data loss. Preserve data
for long-term use and sharing as per your institution’s policy on data retention. Review your
study documents to ensure they provide clear guideline to study procedures.
Quiz
Results
Score for this question group: 80%
1. The data life cycle comprises how many iterative steps?
5
4
6 Your correct answer
3
2. Which of the following is not part of data life cycle iterative steps?
Design
Collect
Store
Analysis Your correct answer
3. Which of the following does not describe metadata? Pick all that apply
Structured minimal information that describes, explains, locates or otherwise makes it
easier to retrieve, use or manage an information resource
Summary of basic information about data which can make tracking and working with
specific data easier
A set of information describing the contents, format, and structure of a database
Correct answer
A set of specifications (or requirements) for how some sets of data should be made
publicly available Your correct answer
4. The data management and sharing plan should be designed alongside:
The study handbook
The investigators’ brochure
The standard operating procedures
The protocol and funder requirements Your correct answer
5. Which of the following topics is NOT an aspect of data management plan:
Designing the CRF
Planning data collection
Drafting data staff contracts Your correct answer
Designing the statistical analysis plan
6. Which of the following is NOT a factor to consider in choosing a data storage solution Pick
all that apply
Security
Brand of the server Your correct answer
Number of users Your correct answer
Cost
7. The data dictionary is used for all of the following except:
Documentation of data structure for users and developers
In application design to help designers create forms and reports
In developing data sharing plan Your correct answer
In decision making to plan data collection and project management
8. A generic data sharing plan has the following structure: What variable will be collected?
What data will be shared? Who will have access to th…
True Your answer
False Correct answer
9. An objective of data standards is to enable reusability of data elements and their metadata
thus reducing redundancy.
True Your correct answer
False
10. An objective of good clinical data management is to ensure the database reflects
accurately the data collected in the study.
True Your correct answer
False