Skip to content

michakinyemi/Immune-Cell-Datamining

Repository files navigation

Immune Cell Datamining


This pipeline currently only supports scRNA-Seq datasets in the 10X genomics format.

All pre-processing steps are fully automated, with hands-on analysis steps being included in the DataVisualization.Rmd file.

Additional Notes:

  • Supports automatic conversion of ENSEMBL IDs to gene symbols.

  • Filtering metrics can be configured in the DataVisualization.Rmd file.

Instructions


  1. Source all R script files and load packages/config variables through the DataVisualization.Rmd file.

  2. Place 10X Genomics formatted scRNA-Seq data files into the datasets/ directory.

  3. Read and store data into individual seurat object variables

    • samples <- generateSampleList(dataID)
  4. Perform batch correction through integrateData() function or simply merge the samples normally through mergeData().

    • Not necessary if only one sample is present in the dataset.

    • TODO: Create metrics for determining if batch correction is necessary

  5. Use runDimReduction() function to perform dimension reduction analysis on the processed dataset.

  6. Use code blocks in DataVisualization.Rmd file to generate figures and visualize clustering/markers.

    • TODO: Better support image file generation for results


(TODO) Usage instructions for all custom functions created for the pipeline:

Additional Notes

  • The "Immune-Cell-Datamining" folder should always be your current working directory (cwd).

Advice

  • When in doubt, use help(foo) to get quick documentation on a function

Pipeline Planning


Implementation/Design


  • Upon generation of a SeuratObject, the majority of sample-specific information will be imprinted into the @misc slot

Planned Features


Analysis Techniques:

  • Weighted Gene Co-Expression Network Analysis (WGCNA)

  • Clonality Trees

  • Trajectory Analysis/Pseudotime

  • Copy-Number Variations (CNVs)

Pipeline Improvements:

  • Bring expression level violin plots more in-line with past lab papers
  • Change symbol conversion function to update dataset files rather than being used during pipeline.
  • Support integration of separate datasets, while retaining sample characteristics/identifiers
  • Automatically classify as mouse vs human model (with manually override) and update gene references if necessary

ATAC-Seq


filter_gtf.py - Generates a subset GTF file containing only the annotations for the genes of interest. This prevents non-target genes from being included in the tracks.

Additional Information


Quick Notes

It is unlikely that the information each database stores for entries will be the same

  • We can use marker genes (those that are highly associated with specific cell types) to differentiate

  • Papers involving sequencing data analysis often require you to include the steps you had taken along the way (filtering process, etc.)

    • Seurat does a lot of the heavy lifting by storing data transformations
    • We could still make this a part of the script's job by saving the most relevant parameters & results (Ex: # of clusters) and generating an associated output file
  • Global Assignment (within functions): variable <<- data

Future Plans

  • Sub Cluster Discovery Pipeline

    • Support "zooming in" on clusters we had generated to further investigate

Questions:

  • Most documentation/papers suggest using high [% mitochondrial gene expression as a filtering metric]{.underline}. How does this impact the study of genes such as TFAM (Mitochondrial Transcription Factor A)?

Resources

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published