SummaryStats is an R package developed to support exploratory summary and quality control of electronic health record (EHR) data. It is capable of processing structured healthcare data such as diagnoses, medications, procedures, labs, and concept identifiers (CUIs) to intermediary files which supports downstream analysis. The package allows users to extract frequency counts of these codes across patients and time periods, and visualize trends through customizable summary plots.
This package has two major function modules:
- Quality Control (QC) Pipelines: Processes both Codified and NLP-derived features by cleaning & formatting to facilitate downstream feature harmonization and similarity-based validation.
- Summary Statistics Function:
- Main Summary Function: Aggregates raw data, generates total and patient-specific counts, maps medical codes to their ontology descriptions, and visualizes these counts.
- Code Over Time Function: Extends the primary capabilities by capturing and visualizing data over specified timeframes, enabling trend analysis and temporal insights.
- Aggregates data into structured, analyzable intermediary files.
- Provides summary statistics on medications, lab tests, procedures, diagnoses, and CUIs.
- Maps medical codes to standardized ontology descriptions.
- Customizable visualizations.
- Supports SQLite for efficient data storage and retrieval.
Install directly from GitHub:
devtools::install_github("celehs/SummaryStats")Load the package in R:
library(SummaryStats)- Description: Prepares a code-to-description mapping from a user-provided dictionary CSV. Cleans and deduplicates descriptions for consistency.
- Parameters:
dictionary_path: File path to the data dictionary.
- Output: A cleaned data dictionary with two columns:
feature_idanddescription.
- Description: Loads ONCE-generated codified and NLP feature dictionaries based on a target phenotype. Used in Module 5 but initialized here for consistency.
- Parameters:
target_code: The codified feature ID of interest (e.g.,"PheCode:335").O2: Logical flag for whether the script is running on O2 (default:TRUE).path_code,path_nlp: (Optional) Manual file paths for ONCE codified and NLP CSVs if not using O2.
- Output: A list of two data frames: one for codified (
code) and one for NLP (nlp), each including feature similarity scores.
- Description: Visualizes the annual trends and overall patient counts for a selected target code and CUI. Returns combined line and bar plots for both NLP and codified features.
- Parameters:
data_inputs: A named list of cleaned codified and NLP datasets fromclean_data().target_code: The codified feature ID of interest (e.g.,"PheCode:335").target_cui: The NLP-derived concept ID of interest (e.g.,"C0026769").sample_labels: A named list labeling each sample (e.g.,"1"= "Site A","2"= "Site B").
- Output: A list containing three elements:
sample_sizes(summary table),nlp_plot(CUI trends), andcodified_plot(code trends).
- Description: Evaluates hierarchical consistency for selected Phecodes by comparing trends for parent and child codes. Supports automatic or user-defined child relationships.
- Parameters:
data_inputs: A named list of cleaned codified datasets.dictionary: Cleaned feature dictionary.sample_labels: A named list labeling each sample.phecodes: A character vector of parent Phecodes to analyze.custom_children: (Optional) Named list specifying child codes for each parent.
- Output: A list of plots for each parent Phecode, including
rate_plot(line chart of prevalence) andcombined_plot(line + bar charts of patient counts).
- Description: Analyzes agreement between a codified code and its corresponding CUI over time. Produces prevalence trends, patient count plots, and intra-patient correlation metrics.
- Parameters:
data_inputs: Cleaned codified and NLP datasets from Module 1.target_code: The Phecode of interest.target_cui: The CUI of interest.dictionary: Cleaned feature dictionary.sample_labels: A named list labeling the samples.
- Output: A list with three plots:
rates_plot,counts_plot, andcorrelation_plot, each comparing trends and alignment across samples.
- Description: Identifies and visualizes trends for ONCE-derived features most similar to the target code and CUI, across multiple domains (e.g., diagnoses, labs, medications).
- Parameters:
data_inputs: Cleaned codified and NLP datasets from Module 1.target_code: Target Phecode.target_cui: Target CUI.sample_labels: A named list labeling each sample.O2: Logical flag for whether ONCE dictionaries are loaded from the O2 cluster.manual_ONCE_path_code: File path to the ONCE codified dictionary (ifO2 = FALSE).manual_ONCE_path_nlp: File path to the ONCE NLP dictionary (ifO2 = FALSE).types: Character vector of clinical domains to evaluate (e.g.,"Diagnosis","Lab").type_dict: A named list mapping domains to one or more feature prefixes.
- Output: A list of plot objects for each domain, including rate and count line plots and total patient bar plots for the most related features.
Below are example outputs from each module of the QC pipeline, generated using data from a single institution with multiple sclerosis as the target feature. Each function visualizes trends in codified and/or NLP-derived clinical features and supports quality control through interpretable plots.
# Generate module 2 results
results_module2 <- plot_target_prevalence(data_inputs, target_code, target_cui, sample_labels)
# Display plots
print(results_module2$nlp_plot)
print(results_module2$codified_plot)NLP Plot:
Codified Plot:
# Generate module 3 results
results_module3 <- analyze_code_hierarchy(data_inputs, dictionary, sample_labels,
phecodes = c("PheCode:411"),
custom_children = NULL)
# Display plots
for (res in results_module3) {
print(res$rate_plot)
print(res$combined_plot)
}
# Generate module 4 results
results_module4 <- code_cui_alignment(data_inputs, target_code, target_cui, dictionary, sample_labels)
# Print plots
print(results_module4$rates_plot)# Generate module 5 results
results_module5 <- plot_related_features(data_inputs, target_code, target_cui, sample_labels, O2,
manual_ONCE_path_code = manual_ONCE_path_code,
manual_ONCE_path_nlp = manual_ONCE_path_nlp,
types = c("Diagnosis"),
type_dict = list("Diagnosis" = "PheCode"))
# Print plots
for (res in results_module5) {
if (!is.null(res$line_plot)) print(res$line_plot)
if (!is.null(res$combined_plot)) print(res$combined_plot)
}- Description: Aggregates data from an SQLite database into a processed format.
- Parameters:
con: Connection to the source SQLite database.output_sqlite_path: Path to save the aggregated intermediary SQLite database.time_column(optional): Column name indicating time data for aggregation by year.
- Description: Extracts and processes data from the SQLite database for visualization.
- Parameters:
sqlite_file: Path to the intermediary SQLite file.prefix: Medical code prefix (e.g., "RXNORM:").top_n: Number of top occurrences to select (default 20).additional_vars(optional): Additional variables to include.wanted_items_df(optional): Specific items to extract.manual_replacement_bank(optional): List for renaming codes.dict_prefix: Dictionary prefix for mapping.dictionary_mapping: Dictionary used for mapping codes.
- Description: Creates histograms for patient and total counts.
- Parameters:
data: List containingTotal_CountsandPatient_Counts.count_column(optional): Specify "Total_Count" or "Patient_Count".prefix: Medical code prefix.description_label: Dataset description label.output_path: Path for saving plots.save_plots: Logical flag to save plots (default FALSE).log_scale: Logical flag for logarithmic scale on y-axis (default FALSE).
- Description: Extracts yearly patient counts for specified codes.
- Parameters:
sqlite_file: Path to intermediary SQLite database.codes_of_interest: Codes to extract.dictionary_mapping: Dictionary for code descriptions.
- Description: Line plots for patient counts over time.
- Parameters:
data: Data fromextract_patient_counts_over_years.title: Plot title.year_range(optional): Year range for x-axis.output_path: Directory for saving plot.save_plots: Logical flag to save plot.auto_breaks: Logical flag for non-uniform y-axis breaks.log_scale: Logical flag for logarithmic y-axis scale.
library(RSQLite)
# Dictionary mapping
dictionary_mapping <- data.frame(
Group_Code = c("RXNORM:123", "RXNORM:456"),
Common_Ontology_Code = c("RXNORM:001", "RXNORM:002"),
Common_Ontology_Description = c("Acetaminophen", "Ibuprofen"),
Group_Description = c("Pain Relievers", "NSAIDs")
)
# Simulated EHR data
df_ehr <- data.frame(
Year = sample(1995:2020, 1000, replace = TRUE),
Patient = sample(1:500, 1000, replace = TRUE),
Parent_Code = sample(c("RXNORM:123", "RXNORM:456", "RXNORM:789"), 1000, replace = TRUE),
Count = sample(1:10, 1000, replace = TRUE)
)
# Creating temporary SQLite database
test_db_path <- tempfile()
test_db <- dbConnect(SQLite(), test_db_path)
dbWriteTable(test_db, 'df_monthly', df_ehr, overwrite = TRUE)intermediary_test_db_path <- tempfile()
generate_intermediary_sqlite(test_db, output_sqlite_path = intermediary_test_db_path)intermediary_test_db_path <- tempfile()
generate_intermediary_sqlite(test_db, output_sqlite_path = intermediary_test_db_path, time_column = "Year")- Datafile contains two tables: Total_Counts and Patient_Counts
data_null <- extract_data_for_visualization(
sqlite_file = intermediary_test_db_path,
prefix = "RXNORM:",
top_n = 20,
dictionary_mapping = dictionary_mapping
)
head(data_null$Patient_Counts, 3)- Datafile contains n+1(n=the number of codes of interest) tables: one for each code of interest (here n=3) and combined table. All the counts are in patient counts.
data_test <- extract_patient_counts_over_years(
sqlite_file = intermediary_test_db_path,
codes_of_interest = c("RXNORM:123","RXNORM:456","RXNORM:789"),
dictionary_mapping = dictionary_mapping
)
head(data_test$`RXNORM:123`,3)
head(data_test$combined,6)Here we use real dataset of RA (Rheumatoid Arthritis) in monthly counts, which contains information on 6,131 patients. And we load a comprehensive mapping dictionary that includes accurate corresponding Common Ontology Description and Group Description to the Code.
data_rxnorm <- extract_data_for_visualization(
sqlite_file = "./RA_intermediary.sqlite",
prefix = "RXNORM:",
top_n = 20,
dictionary_mapping = dictionary_mapping
)
plot_visualized_data(
data = data_rxnorm,
count_column = "Patient_Count",
prefix = "RXNORM:",
description_label = "RXNORM Medications"
)additional_vars <- c("Erythrocyte sedimentation rate", "C reactive protein",
"Rheumatoid factor")
data_add <- extract_data_for_visualization(
sqlite_file = "./RA_intermediary.sqlite",
prefix = "LOINC:",
dictionary_mapping = dictionary_mapping,
additional_vars = additional_vars
)
}
plot_visualized_data(
data = data_add,
count_column = "Patient_Count",
prefix = "LOINC:",
description_label = "Lab Tests"
)wanted_items <- data.frame(
Type = rep("MED", 26),
Category = c(rep("Non-biologics DMARDs", 10),
rep("Biologics DMARDs", 10),
rep("Targeted synthetic DMARDs", 3),
rep("Glucocorticoids", 3)),
Name = c('Azathioprine', 'Cyclophosphamide', 'Hydroxychloroquine',
'Leflunomide', 'Methotrexate', 'Sulfasalazine', 'Gold',
'Penicillamine', 'Chloroquine', 'Cyclosporine',
'Adalimumab', 'Certolizumab pegol', 'Etanercept',
'Golimumab', 'Infliximab', 'Abatacept', 'Anakinra',
'Rituximab', 'Sarilumab', 'Tocilizumab',
'Baricitinib', 'Tofacitinib', 'Upadacitinib',
'Methylprednisolone', 'Prednisone', 'Dexamethasone')
)
data_want <- extract_data_for_visualization(
sqlite_file = "./RA_intermediary.sqlite",
prefix = "RXNORM:",
dictionary_mapping = dictionary_mapping,
wanted_items_df = wanted_items
)
plot_visualized_data(
data = data_want,
prefix = "RXNORM:",
description_label = "RA Medications"
)Here we also use real dataset of RA (Rheumatoid Arthritis) in monthly counts with a comprehensive dictionary.
output_data <- extract_patient_counts_over_years(
sqlite_file = "./RA_intermediary_time.sqlite",
codes_of_interest = c("PheCode:714.1", "RXNORM:5487", "RXNORM:6851"),
dictionary_mapping = dictionary_mapping
)
plot_patient_counts_over_time(
data = summary_stats_data_year$combined,
title = "RA Trends (2000–2020)",
year_range = c(2000, 2020),
auto_breaks = FALSE,
log_scale = TRUE,
save_plot = FALSE,
output_path = tempdir()
)- R (>= 3.5.0)
- Essential R libraries:
data.table,dplyr,ggplot2,Matrix,RSQLite, and more.
License: GPL-3
The Shiny app requires a dictionary mapping file to function properly. Due to its large size (>200MB), this file is not included in the repository.
-
Download the dictionary mapping file (
dictionary_mapping_v3.4.tsv) from one of these sources:- O2 Server:
/n/data1/hsph/biostat/celehs/lab/datasets/dictionaries/dictionary_mapping_v3.4.tsv - Dropbox: Download dictionary_mapping_v3.4.tsv.gz
Note: If downloading from Dropbox, you'll need to decompress the .gz file after downloading.
- O2 Server:
-
Place the file in the
shiny-serverdirectory of your cloned repository:SummaryStats/ └── shiny-server/ ├── app.R └── dictionary_mapping_v3.4.tsv # Place the downloaded file here -
This location is required for the Docker setup to work correctly - the app will not function without the dictionary file in this exact location
Once you have placed the dictionary file in the correct location, you can run the app using Docker:
docker compose up --build shinyThe app will be available at http://localhost:3838
Note: The Docker setup will only work if the dictionary file is present in the correct location (shiny-server/dictionary_mapping_v3.4.tsv).












