Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added a SILVA parseFunction. #854

Open
wants to merge 16 commits into
base: master
Choose a base branch
from
Open

Added a SILVA parseFunction. #854

wants to merge 16 commits into from

Conversation

grabear
Copy link

@grabear grabear commented Dec 4, 2017

Added a function to parse SILVA formatted taxonomy strings in the BIOM formatted files.

Overview

  • Added parse_silva_taxonomy function
    • Added comments and docstrings
    • Formats unassigned taxa
    • Formats ambiguous taxa
    • Returns parsed SILVA taxonomy vector

The following information can be found in the SILVA-qiime notes file.

Consensus and Majority Taxonomies

Reason for these alternative taxonomy string files:

A user of the Silva119 data pointed out that the taxonomy with the SILVA119 release is based only upon the taxonomy string of the representative sequence for the cluster of reads, which could lead to incorrect confidence in taxonomy assignments at the fine level (genus/species). To address this, I have endeavored to create taxonomy strings that are either consensus (all taxa strings must match for every read that fell into the cluster) or majority (greater than or equal to 90% of the taxonomy strings for a given cluster). If a taxonomy string fails to be consensus or majority, then it becomes ambiguous, moving up the levels of taxonomy until consensus/majority taxonomy strings are met.

For example, if a cluster had two reads, and one taxonomy string was:

D_0__Archaea;D_1__Euryarchaeota;D_2__Methanobacteria;D_3__Methanobacteriales;D_4__Methanobacteriaceae;D_5__Methanobrevibacter;D_6__Methanobrevibacter sp. HW3

and the second taxonomy string was:

D_0__Archaea;D_1__Euryarchaeota;D_2__Methanobacteria;D_3__Methanobacteriales;D_4__Methanobacteriaceae;D_5__Methanobrevibacter;D_6__Methanobrevibacter smithii

Then for either consensus or majority strings, the level 7 (0 is the first level, the domain) data would become ambiguous, as the species levels do not match. The above string for the representative sequence taxonomy mapping file becomes:

D_0__Archaea;D_1__Euryarchaeota;D_2__Methanobacteria;D_3__Methanobacteriales;D_4__Methanobacteriaceae;D_5__Methanobrevibacter;Ambiguous_taxa

Because the taxonomy strings are not perfectly matched in terms of names/depths across all of the SILVA data, this can lead to some taxonomies being more ambiguous with my approach (exact string matches) than they actually are, particularly for the eukaryotes. There are over 1.5 million taxonomy strings in the non-redundant SILVA 119 release (even more in later releases), so I can’t fault the maintainers of SILVA for these taxonomy strings being imperfect from a parsing/bioinformatics perspective.

The scripts used to create the consensus and 90% majority taxonomy strings, create_consensus_taxonomy.py and create_majority_taxonomy.py, are located here (the OTU mapping files used with these scripts were generated during the “creation of representative sequence files� section):
https://gist.github.com/walterst/bd69a19e75748f79efeb
https://gist.github.com/walterst/f6f08f6583bb320bb10d

@grabear
Copy link
Author

grabear commented Dec 4, 2017

#162

@joey711
Copy link
Owner

joey711 commented Dec 4, 2017

Thanks, this sounds good. Before I review, do you have test file/function for testing the parsing behavior? Better to do that now while it is fresh, than later after there is a problem.

Why was the pycharm .gitignore change required? Hopefully there are no external calls to python in this PR...

You refer to python scripts in that gist to generate an intermediate version of the silva output. However, we won't want to ask phyloseq users to go find and execute this script, which means the input to parse_taxonomy_silva() should be the unadulterated silva taxonomy output files rather than some intermediate. In general, the phyloseq package should not have any implicit dependencies to external scripts, including parsing tools. All this means is that the logic to arrive at those files should be encoded in R within the parse_taxonomy_silva function, or if that is too restrictive for the transformation needed here, then encoded in a wrapping import_taxonomy_silva() function where the logic to generate the representation of the intermediate table is executed prior to parse_taxonomy_silva.

Make sense? Let me know if you have questions. Thanks for the PR. I agree this is a useful feature to add.

@joey711 joey711 added the Feature label Dec 4, 2017
@grabear
Copy link
Author

grabear commented Dec 4, 2017

Thanks, this sounds good. Before I review, do you have test file/function for testing the parsing behavior? Better to do that now while it is fresh, than later after there is a problem.

I don't yet, but that was my next step pending your thoughts on this PR.

Why was the pycharm .gitignore change required? Hopefully there are no external calls to python in this PR...

Ahh sorry. I didn't think about those implications. I use PyCharm along with the R language plugin for the version control (I'm not used to R-Studio's VCS setup). However, those commits can be removed if necessary.

You refer to python scripts in that gist to generate an intermediate version of the silva output. However, we won't want to ask phyloseq users to go find and execute this script, which means the input to parse_taxonomy_silva() should be the unadulterated silva taxonomy output files rather than some intermediate. In general, the phyloseq package should not have any implicit dependencies to external scripts, including parsing tools. All this means is that the logic to arrive at those files should be encoded in R within the parse_taxonomy_silva function, or if that is too restrictive for the transformation needed here, then encoded in a wrapping import_taxonomy_silva() function where the logic to generate the representation of the intermediate table is executed prior to parse_taxonomy_silva.

There are NO external calls to Python.

It will not be necessary for anyone to call these scripts. The SILVA database system has already generated these files for the user. I will however explain this in more detail tomorrow, when I get the chance.

Thanks for your input @joey711. Glad to help contribute to this package. For my current project I'm using Nephele for data generation, and phyloseq/ampvis2 /ggtree for data visualization.

I've been using phyloseq to format the data, and then I use ampvis2 for data visualization with ggplot2 and ggtree.

@grabear
Copy link
Author

grabear commented Dec 5, 2017

The parser I made is for the SILVA128 Qiime release. The standard release is a bit different.
The silva_databases/release_128/Exports/taxonomy/ files look like so this:

Archaea; 2 domain
Archaea;Aenigmarchaeota; 11084 phylum 123
Archaea;Aenigmarchaeota;Aenigmarchaeota Incertae Sedis; 11085 class 123
Archaea;Aenigmarchaeota;Aenigmarchaeota Incertae Sedis;Unknown Order; 11086 order 123
Archaea;Aenigmarchaeota;Aenigmarchaeota Incertae Sedis;Unknown Order;Unknown Family; 11087 family 123
Archaea;Aenigmarchaeota;Aenigmarchaeota Incertae Sedis;Unknown Order;Unknown Family;Candidatus Aenigmarchaeum; 11088 genus 123
Archaea;Aenigmarchaeota;Deep Sea Euryarchaeotic Group(DSEG); 11089 class a 123
Archaea;Aigarchaeota; 11090 phylum 123
Archaea;Aigarchaeota;Aigarchaeota Incertae Sedis; 11091 class 123
Archaea;Aigarchaeota;Aigarchaeota Incertae Sedis;Unknown Order; 11092 order 123
Archaea;Aigarchaeota;Aigarchaeota Incertae Sedis;Unknown Order;Unknown Family; 11093 family 123
Archaea;Aigarchaeota;Aigarchaeota Incertae Sedis;Unknown Order;Unknown Family;Candidatus Caldiarchaeum; 11094 genus 123

And I'm not sure what parses that at the moment.

The SILVA_128_QIIME_release/taxonomy directories are set up like this:

  • taxonomy_all
    • 99
      • taxonomy_all_levels.txt
      • taxonomy_7_levels.txt

Taxonomy_all, 162_only, and 18s_only contain the directories 99, 97, 94, and 90 for the various sequence cluster similarity percentages. Each contain a group of .txt files that are based on the full taxonomy, majority taxonomy, and the consensus taxonomy at 7 levels of taxonomic rank or 15 levels of taxonomic rank.

Regardless, they are all formatted the same...

(for all_levels)

KF494428.1.1396 D_0__Bacteria;D_1__Proteobacteria;D_2__Epsilonproteobacteria;D_3__Campylobacterales;D_4__Helicobacteraceae;D_5__Sulfuricurvum;D_6__Sulfuricurvum sp. EW1;D_7__;D_8__;D_9__;D_10__;D_11__;D_12__;D_13__;D_14__

(for 7 levels)

KF494428.1.1396 D_0__Bacteria;D_1__Proteobacteria;D_2__Epsilonproteobacteria;D_3__Campylobacterales;D_4__Helicobacteraceae;D_5__Sulfuricurvum;D_6__Sulfuricurvum sp. EW1

@grabear
Copy link
Author

grabear commented Dec 5, 2017

@joey711 So in that spirit, I'll make a function called parse_silva-qiime-release_taxonomy.

Or would you rather I do something different here?

I can set up a top level function called

parse_silva128_taxonomy(release="qiime_7levels"){
      if(release=="qiime_7levels"){
          return(parse_sqiime7_taxonomy)
    } else if(release=="all"){
          return(parse_sqiimeall_taxonomy)
    }
}

@joey711
Copy link
Owner

joey711 commented Dec 5, 2017

Is that output specific to QIIME? or is it a format put out by SILVA that other software can also use? If so, the qiime mention is unnecessary.

Along the lines of names, you have both qiime and silva releases to track. I suggest the top-level wrapper function name be something that will persist even as new releases appear.

parse_taxonomy_silva = function(File, silva="128", qiime="7", ...){
  # dispatch
}

Better yet, those release versions can be read from the files, and so normal user does not need to specify them directly, either as part of function name nor parameter argument.

For development, an internal function/method that is specific to a version is still useful (but not seen by most users), and I prefer a name convention that is most general to most specific, even if that doesn't fit the normal grammatical usage, hence:

parse_taxonomy_qiime_7_silva_128(...)
# or, if qiime not relevant
parse_taxonomy_silva_128(...)

Again, this version-soup should mostly be shielded from normal users if at all possible, and if so, these would not be exported

@grabear
Copy link
Author

grabear commented Jan 9, 2018

So I've simply changed the function name here. Everything seems to be in order as far as I can tell. I can give you access to one of my private repositories with data on it for testing this parsing function. @joey711

@grabear
Copy link
Author

grabear commented Jan 9, 2018

On second thought, I'll work out the testing tomorrow and add it to the package.

@grabear
Copy link
Author

grabear commented Jan 11, 2018

@joey711 I've added some lines to the test-IO.R file, and I've added a separate test-silva.R file. They build has passed, but what do you think?

@grabear
Copy link
Author

grabear commented Jan 11, 2018

I also added a .biom in extdata file for test-silva.R. @joey711

@grabear
Copy link
Author

grabear commented Mar 19, 2018

@joey711 Any thought on merging this? I'd also love to contribute more. In particular I could work on answering issues that I know how to deal with.

@joey711
Copy link
Owner

joey711 commented Apr 1, 2018

Thanks @grabear I'll take a look. Sorry for the delay. It sounds good, and thank you for adding tests and such.

@russellj7
Copy link

I'm curious as to when this SILVA taxonomy parse function might be implemented in a new version of Phyloseq? I would be interested in using it for convenience.

@grabear
Copy link
Author

grabear commented May 31, 2018

@russellj7 I know the phyloseq team is busy with various projects, so until then you can use the gist I made for this very purpose. https://gist.github.com/018e86413b19b62a6bb8e72a9adba349

Corrected comment in parse_taxonomy_silva_128 function.
Some references to the function "parse_taxonomy_silva_128" did not contain "_128".
pooranibcbb added a commit to pooranibcbb/datavis16s_refactor that referenced this pull request May 23, 2023
kind of testing with this: joey711/phyloseq#854


Former-commit-id: 2859b8cf722aaaf261083bf1f17cfcd20b74d39c
Former-commit-id: 8494fcfd59de161a84d7da197efa27e3fee8ad3b
Former-commit-id: 01edb447063238d4d3bf3de94dd40bfeb832b12e
pooranibcbb added a commit to pooranibcbb/datavis16s_refactor that referenced this pull request May 23, 2023
kind of testing with this: joey711/phyloseq#854


Former-commit-id: 278b3e9df2f32d8435128309ef0c7df8309304cf
Former-commit-id: 491f223576bb2ca41e9ba2692b137d16d0d3f77d
Former-commit-id: de434c8ac7d0d4476515589eb1bab2e75145b2fc
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants