-
Notifications
You must be signed in to change notification settings - Fork 188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added a SILVA parseFunction. #854
base: master
Are you sure you want to change the base?
Conversation
Thanks, this sounds good. Before I review, do you have test file/function for testing the parsing behavior? Better to do that now while it is fresh, than later after there is a problem. Why was the pycharm .gitignore change required? Hopefully there are no external calls to python in this PR... You refer to python scripts in that gist to generate an intermediate version of the silva output. However, we won't want to ask phyloseq users to go find and execute this script, which means the input to Make sense? Let me know if you have questions. Thanks for the PR. I agree this is a useful feature to add. |
I don't yet, but that was my next step pending your thoughts on this PR.
Ahh sorry. I didn't think about those implications. I use PyCharm along with the R language plugin for the version control (I'm not used to R-Studio's VCS setup). However, those commits can be removed if necessary.
There are NO external calls to Python. It will not be necessary for anyone to call these scripts. The SILVA database system has already generated these files for the user. I will however explain this in more detail tomorrow, when I get the chance. Thanks for your input @joey711. Glad to help contribute to this package. For my current project I'm using Nephele for data generation, and phyloseq/ampvis2 /ggtree for data visualization. I've been using phyloseq to format the data, and then I use ampvis2 for data visualization with ggplot2 and ggtree. |
The parser I made is for the SILVA128 Qiime release. The standard release is a bit different.
And I'm not sure what parses that at the moment. The SILVA_128_QIIME_release/taxonomy directories are set up like this:
Taxonomy_all, 162_only, and 18s_only contain the directories 99, 97, 94, and 90 for the various sequence cluster similarity percentages. Each contain a group of .txt files that are based on the full taxonomy, majority taxonomy, and the consensus taxonomy at 7 levels of taxonomic rank or 15 levels of taxonomic rank. Regardless, they are all formatted the same... (for all_levels)
(for 7 levels)
|
@joey711 So in that spirit, I'll make a function called Or would you rather I do something different here? I can set up a top level function called parse_silva128_taxonomy(release="qiime_7levels"){
if(release=="qiime_7levels"){
return(parse_sqiime7_taxonomy)
} else if(release=="all"){
return(parse_sqiimeall_taxonomy)
}
} |
Is that output specific to QIIME? or is it a format put out by SILVA that other software can also use? If so, the qiime mention is unnecessary. Along the lines of names, you have both qiime and silva releases to track. I suggest the top-level wrapper function name be something that will persist even as new releases appear.
Better yet, those release versions can be read from the files, and so normal user does not need to specify them directly, either as part of function name nor parameter argument. For development, an internal function/method that is specific to a version is still useful (but not seen by most users), and I prefer a name convention that is most general to most specific, even if that doesn't fit the normal grammatical usage, hence:
Again, this version-soup should mostly be shielded from normal users if at all possible, and if so, these would not be exported |
So I've simply changed the function name here. Everything seems to be in order as far as I can tell. I can give you access to one of my private repositories with data on it for testing this parsing function. @joey711 |
On second thought, I'll work out the testing tomorrow and add it to the package. |
Silva test branch
Fixed bad function call in test for silva.
@joey711 I've added some lines to the test-IO.R file, and I've added a separate test-silva.R file. They build has passed, but what do you think? |
I also added a .biom in extdata file for test-silva.R. @joey711 |
@joey711 Any thought on merging this? I'd also love to contribute more. In particular I could work on answering issues that I know how to deal with. |
Thanks @grabear I'll take a look. Sorry for the delay. It sounds good, and thank you for adding tests and such. |
I'm curious as to when this SILVA taxonomy parse function might be implemented in a new version of Phyloseq? I would be interested in using it for convenience. |
@russellj7 I know the phyloseq team is busy with various projects, so until then you can use the gist I made for this very purpose. https://gist.github.com/018e86413b19b62a6bb8e72a9adba349 |
Corrected comment in parse_taxonomy_silva_128 function.
Some references to the function "parse_taxonomy_silva_128" did not contain "_128".
kind of testing with this: joey711/phyloseq#854 Former-commit-id: 2859b8cf722aaaf261083bf1f17cfcd20b74d39c Former-commit-id: 8494fcfd59de161a84d7da197efa27e3fee8ad3b Former-commit-id: 01edb447063238d4d3bf3de94dd40bfeb832b12e
kind of testing with this: joey711/phyloseq#854 Former-commit-id: 278b3e9df2f32d8435128309ef0c7df8309304cf Former-commit-id: 491f223576bb2ca41e9ba2692b137d16d0d3f77d Former-commit-id: de434c8ac7d0d4476515589eb1bab2e75145b2fc
Added a function to parse SILVA formatted taxonomy strings in the BIOM formatted files.
Overview
The following information can be found in the SILVA-qiime notes file.
Consensus and Majority Taxonomies
Reason for these alternative taxonomy string files:
A user of the Silva119 data pointed out that the taxonomy with the SILVA119 release is based only upon the taxonomy string of the representative sequence for the cluster of reads, which could lead to incorrect confidence in taxonomy assignments at the fine level (genus/species). To address this, I have endeavored to create taxonomy strings that are either consensus (all taxa strings must match for every read that fell into the cluster) or majority (greater than or equal to 90% of the taxonomy strings for a given cluster). If a taxonomy string fails to be consensus or majority, then it becomes ambiguous, moving up the levels of taxonomy until consensus/majority taxonomy strings are met.
For example, if a cluster had two reads, and one taxonomy string was:
and the second taxonomy string was:
Then for either consensus or majority strings, the level 7 (0 is the first level, the domain) data would become ambiguous, as the species levels do not match. The above string for the representative sequence taxonomy mapping file becomes:
Because the taxonomy strings are not perfectly matched in terms of names/depths across all of the SILVA data, this can lead to some taxonomies being more ambiguous with my approach (exact string matches) than they actually are, particularly for the eukaryotes. There are over 1.5 million taxonomy strings in the non-redundant SILVA 119 release (even more in later releases), so I can’t fault the maintainers of SILVA for these taxonomy strings being imperfect from a parsing/bioinformatics perspective.
The scripts used to create the consensus and 90% majority taxonomy strings, create_consensus_taxonomy.py and create_majority_taxonomy.py, are located here (the OTU mapping files used with these scripts were generated during the “creation of representative sequence files� section):
https://gist.github.com/walterst/bd69a19e75748f79efeb
https://gist.github.com/walterst/f6f08f6583bb320bb10d