Added a SILVA parseFunction. #854

grabear · 2017-12-04T18:45:28Z

Added a function to parse SILVA formatted taxonomy strings in the BIOM formatted files.

Overview

Added parse_silva_taxonomy function
- Added comments and docstrings
- Formats unassigned taxa
- Formats ambiguous taxa
- Returns parsed SILVA taxonomy vector

The following information can be found in the SILVA-qiime notes file.

Consensus and Majority Taxonomies

Reason for these alternative taxonomy string files:

A user of the Silva119 data pointed out that the taxonomy with the SILVA119 release is based only upon the taxonomy string of the representative sequence for the cluster of reads, which could lead to incorrect confidence in taxonomy assignments at the fine level (genus/species). To address this, I have endeavored to create taxonomy strings that are either consensus (all taxa strings must match for every read that fell into the cluster) or majority (greater than or equal to 90% of the taxonomy strings for a given cluster). If a taxonomy string fails to be consensus or majority, then it becomes ambiguous, moving up the levels of taxonomy until consensus/majority taxonomy strings are met.

For example, if a cluster had two reads, and one taxonomy string was:

D_0__Archaea;D_1__Euryarchaeota;D_2__Methanobacteria;D_3__Methanobacteriales;D_4__Methanobacteriaceae;D_5__Methanobrevibacter;D_6__Methanobrevibacter sp. HW3

and the second taxonomy string was:

D_0__Archaea;D_1__Euryarchaeota;D_2__Methanobacteria;D_3__Methanobacteriales;D_4__Methanobacteriaceae;D_5__Methanobrevibacter;D_6__Methanobrevibacter smithii

Then for either consensus or majority strings, the level 7 (0 is the first level, the domain) data would become ambiguous, as the species levels do not match. The above string for the representative sequence taxonomy mapping file becomes:

D_0__Archaea;D_1__Euryarchaeota;D_2__Methanobacteria;D_3__Methanobacteriales;D_4__Methanobacteriaceae;D_5__Methanobrevibacter;Ambiguous_taxa

Because the taxonomy strings are not perfectly matched in terms of names/depths across all of the SILVA data, this can lead to some taxonomies being more ambiguous with my approach (exact string matches) than they actually are, particularly for the eukaryotes. There are over 1.5 million taxonomy strings in the non-redundant SILVA 119 release (even more in later releases), so I canâ€™t fault the maintainers of SILVA for these taxonomy strings being imperfect from a parsing/bioinformatics perspective.

The scripts used to create the consensus and 90% majority taxonomy strings, create_consensus_taxonomy.py and create_majority_taxonomy.py, are located here (the OTU mapping files used with these scripts were generated during the â€œcreation of representative sequence filesâ€� section):
https://gist.github.com/walterst/bd69a19e75748f79efeb
https://gist.github.com/walterst/f6f08f6583bb320bb10d

grabear · 2017-12-04T18:48:32Z

#162

joey711 · 2017-12-04T19:10:30Z

Thanks, this sounds good. Before I review, do you have test file/function for testing the parsing behavior? Better to do that now while it is fresh, than later after there is a problem.

Why was the pycharm .gitignore change required? Hopefully there are no external calls to python in this PR...

You refer to python scripts in that gist to generate an intermediate version of the silva output. However, we won't want to ask phyloseq users to go find and execute this script, which means the input to parse_taxonomy_silva() should be the unadulterated silva taxonomy output files rather than some intermediate. In general, the phyloseq package should not have any implicit dependencies to external scripts, including parsing tools. All this means is that the logic to arrive at those files should be encoded in R within the parse_taxonomy_silva function, or if that is too restrictive for the transformation needed here, then encoded in a wrapping import_taxonomy_silva() function where the logic to generate the representation of the intermediate table is executed prior to parse_taxonomy_silva.

Make sense? Let me know if you have questions. Thanks for the PR. I agree this is a useful feature to add.

grabear · 2017-12-04T20:18:06Z

Thanks, this sounds good. Before I review, do you have test file/function for testing the parsing behavior? Better to do that now while it is fresh, than later after there is a problem.

I don't yet, but that was my next step pending your thoughts on this PR.

Why was the pycharm .gitignore change required? Hopefully there are no external calls to python in this PR...

Ahh sorry. I didn't think about those implications. I use PyCharm along with the R language plugin for the version control (I'm not used to R-Studio's VCS setup). However, those commits can be removed if necessary.

You refer to python scripts in that gist to generate an intermediate version of the silva output. However, we won't want to ask phyloseq users to go find and execute this script, which means the input to parse_taxonomy_silva() should be the unadulterated silva taxonomy output files rather than some intermediate. In general, the phyloseq package should not have any implicit dependencies to external scripts, including parsing tools. All this means is that the logic to arrive at those files should be encoded in R within the parse_taxonomy_silva function, or if that is too restrictive for the transformation needed here, then encoded in a wrapping import_taxonomy_silva() function where the logic to generate the representation of the intermediate table is executed prior to parse_taxonomy_silva.

There are NO external calls to Python.

It will not be necessary for anyone to call these scripts. The SILVA database system has already generated these files for the user. I will however explain this in more detail tomorrow, when I get the chance.

Thanks for your input @joey711. Glad to help contribute to this package. For my current project I'm using Nephele for data generation, and phyloseq/ampvis2 /ggtree for data visualization.

I've been using phyloseq to format the data, and then I use ampvis2 for data visualization with ggplot2 and ggtree.

grabear · 2017-12-05T17:33:42Z

The parser I made is for the SILVA128 Qiime release. The standard release is a bit different.
The silva_databases/release_128/Exports/taxonomy/ files look like so this:

Archaea; 2 domain
Archaea;Aenigmarchaeota; 11084 phylum 123
Archaea;Aenigmarchaeota;Aenigmarchaeota Incertae Sedis; 11085 class 123
Archaea;Aenigmarchaeota;Aenigmarchaeota Incertae Sedis;Unknown Order; 11086 order 123
Archaea;Aenigmarchaeota;Aenigmarchaeota Incertae Sedis;Unknown Order;Unknown Family; 11087 family 123
Archaea;Aenigmarchaeota;Aenigmarchaeota Incertae Sedis;Unknown Order;Unknown Family;Candidatus Aenigmarchaeum; 11088 genus 123
Archaea;Aenigmarchaeota;Deep Sea Euryarchaeotic Group(DSEG); 11089 class a 123
Archaea;Aigarchaeota; 11090 phylum 123
Archaea;Aigarchaeota;Aigarchaeota Incertae Sedis; 11091 class 123
Archaea;Aigarchaeota;Aigarchaeota Incertae Sedis;Unknown Order; 11092 order 123
Archaea;Aigarchaeota;Aigarchaeota Incertae Sedis;Unknown Order;Unknown Family; 11093 family 123
Archaea;Aigarchaeota;Aigarchaeota Incertae Sedis;Unknown Order;Unknown Family;Candidatus Caldiarchaeum; 11094 genus 123

And I'm not sure what parses that at the moment.

The SILVA_128_QIIME_release/taxonomy directories are set up like this:

taxonomy_all

99

taxonomy_all_levels.txt

taxonomy_7_levels.txt

Taxonomy_all, 162_only, and 18s_only contain the directories 99, 97, 94, and 90 for the various sequence cluster similarity percentages. Each contain a group of .txt files that are based on the full taxonomy, majority taxonomy, and the consensus taxonomy at 7 levels of taxonomic rank or 15 levels of taxonomic rank.

Regardless, they are all formatted the same...

(for all_levels)

KF494428.1.1396 D_0__Bacteria;D_1__Proteobacteria;D_2__Epsilonproteobacteria;D_3__Campylobacterales;D_4__Helicobacteraceae;D_5__Sulfuricurvum;D_6__Sulfuricurvum sp. EW1;D_7__;D_8__;D_9__;D_10__;D_11__;D_12__;D_13__;D_14__

(for 7 levels)

KF494428.1.1396 D_0__Bacteria;D_1__Proteobacteria;D_2__Epsilonproteobacteria;D_3__Campylobacterales;D_4__Helicobacteraceae;D_5__Sulfuricurvum;D_6__Sulfuricurvum sp. EW1

grabear · 2017-12-05T17:38:38Z

@joey711 So in that spirit, I'll make a function called parse_silva-qiime-release_taxonomy.

Or would you rather I do something different here?

I can set up a top level function called

parse_silva128_taxonomy(release="qiime_7levels"){
      if(release=="qiime_7levels"){
          return(parse_sqiime7_taxonomy)
    } else if(release=="all"){
          return(parse_sqiimeall_taxonomy)
    }
}

joey711 · 2017-12-05T18:53:22Z

Is that output specific to QIIME? or is it a format put out by SILVA that other software can also use? If so, the qiime mention is unnecessary.

Along the lines of names, you have both qiime and silva releases to track. I suggest the top-level wrapper function name be something that will persist even as new releases appear.

parse_taxonomy_silva = function(File, silva="128", qiime="7", ...){
  # dispatch
}

Better yet, those release versions can be read from the files, and so normal user does not need to specify them directly, either as part of function name nor parameter argument.

For development, an internal function/method that is specific to a version is still useful (but not seen by most users), and I prefer a name convention that is most general to most specific, even if that doesn't fit the normal grammatical usage, hence:

parse_taxonomy_qiime_7_silva_128(...)
# or, if qiime not relevant
parse_taxonomy_silva_128(...)

Again, this version-soup should mostly be shielded from normal users if at all possible, and if so, these would not be exported

grabear · 2018-01-09T22:52:55Z

So I've simply changed the function name here. Everything seems to be in order as far as I can tell. I can give you access to one of my private repositories with data on it for testing this parsing function. @joey711

grabear · 2018-01-09T23:02:32Z

On second thought, I'll work out the testing tomorrow and add it to the package.

…ction.

Silva test branch

Fixed bad function call in test for silva.

grabear · 2018-01-11T23:24:28Z

@joey711 I've added some lines to the test-IO.R file, and I've added a separate test-silva.R file. They build has passed, but what do you think?

grabear · 2018-01-11T23:25:17Z

I also added a .biom in extdata file for test-silva.R. @joey711

grabear · 2018-03-19T03:29:52Z

@joey711 Any thought on merging this? I'd also love to contribute more. In particular I could work on answering issues that I know how to deal with.

joey711 · 2018-04-01T18:18:30Z

Thanks @grabear I'll take a look. Sorry for the delay. It sounds good, and thank you for adding tests and such.

russellj7 · 2018-05-31T20:53:12Z

I'm curious as to when this SILVA taxonomy parse function might be implemented in a new version of Phyloseq? I would be interested in using it for convenience.

grabear · 2018-05-31T22:46:53Z

@russellj7 I know the phyloseq team is busy with various projects, so until then you can use the gist I made for this very purpose. https://gist.github.com/018e86413b19b62a6bb8e72a9adba349

Corrected comment in parse_taxonomy_silva_128 function.

Some references to the function "parse_taxonomy_silva_128" did not contain "_128".

kind of testing with this: joey711/phyloseq#854 Former-commit-id: 2859b8cf722aaaf261083bf1f17cfcd20b74d39c Former-commit-id: 8494fcfd59de161a84d7da197efa27e3fee8ad3b Former-commit-id: 01edb447063238d4d3bf3de94dd40bfeb832b12e

kind of testing with this: joey711/phyloseq#854 Former-commit-id: 278b3e9df2f32d8435128309ef0c7df8309304cf Former-commit-id: 491f223576bb2ca41e9ba2692b137d16d0d3f77d Former-commit-id: de434c8ac7d0d4476515589eb1bab2e75145b2fc

grabear added 4 commits December 4, 2017 10:06

Added Pycharm items to .gitignore.

0dea41d

Initial addition of "parse_taxonomy_silva" function.

de106d8

Added documentation for the SILVA parse function.

bbc5993

Updated function comments.

2c0dcb2

grabear mentioned this pull request Dec 4, 2017

Importing Silva taxonomy data #162

Closed

joey711 added the Feature label Dec 4, 2017

Added the silva db version to the parsing funciton name.

e0bff71

grabear added 8 commits January 11, 2018 13:48

Initial commit for silva test.

554d64f

Added .biom file with SILVA formatted taxonomy strings.

c7b421c

Added lines for creating a phyloseq object with new SILVA parsing fun…

ff481af

…ction.

Added parse function tests for the new SILVA function.

8c10b1a

Added testing functions to the test file for SILVA strings.

cf8bb30

Merge pull request #1 from grabear/silva_test_branch

4228057

Silva test branch

Fixed bad function call in test for silva.

526c6e4

Merge pull request #2 from grabear/silva_test_branch

b992c4b

Fixed bad function call in test for silva.

grabear mentioned this pull request Mar 19, 2018

Merging OTU table and phylogenetic tree #889

Closed

grabear mentioned this pull request Apr 27, 2018

Phyloseq and metacoder grunwaldlab/metacoder#141

Closed

grabear added 3 commits September 10, 2018 19:54

Corrected comment in parse_taxonomy_silva_128 function.

cd77232

Merge pull request #3 from grabear/silva_test_branch

5ce6fe3

Corrected comment in parse_taxonomy_silva_128 function.

Added "_128" to documentation.

5dbb4d3

Some references to the function "parse_taxonomy_silva_128" did not contain "_128".

This was referenced Sep 28, 2018

tax_glom question for finding genus foun in all samples of a given group #957

Open

Removing Ranks from taxa table #762

Open

Unable to import #758

Open

tip_glom and annotations? #877

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added a SILVA parseFunction. #854

Added a SILVA parseFunction. #854

grabear commented Dec 4, 2017

grabear commented Dec 4, 2017

joey711 commented Dec 4, 2017

grabear commented Dec 4, 2017

grabear commented Dec 5, 2017 •

edited

Loading

grabear commented Dec 5, 2017

joey711 commented Dec 5, 2017

grabear commented Jan 9, 2018

grabear commented Jan 9, 2018

grabear commented Jan 11, 2018

grabear commented Jan 11, 2018

grabear commented Mar 19, 2018

joey711 commented Apr 1, 2018

russellj7 commented May 31, 2018

grabear commented May 31, 2018 •

edited

Loading

Added a SILVA parseFunction. #854

Are you sure you want to change the base?

Added a SILVA parseFunction. #854

Conversation

grabear commented Dec 4, 2017

Overview

Consensus and Majority Taxonomies

grabear commented Dec 4, 2017

joey711 commented Dec 4, 2017

grabear commented Dec 4, 2017

grabear commented Dec 5, 2017 • edited Loading

grabear commented Dec 5, 2017

joey711 commented Dec 5, 2017

grabear commented Jan 9, 2018

grabear commented Jan 9, 2018

grabear commented Jan 11, 2018

grabear commented Jan 11, 2018

grabear commented Mar 19, 2018

joey711 commented Apr 1, 2018

russellj7 commented May 31, 2018

grabear commented May 31, 2018 • edited Loading

grabear commented Dec 5, 2017 •

edited

Loading

grabear commented May 31, 2018 •

edited

Loading