genefetch is a Python-based tool for fetching, processing, and locally saving Profiles from the NCBI database. Based on a given search term, it fetches profiles, filters them based on given rules, extracts metadata and fasta files and stores everything into a convenient structure.
- Fetch sequences using customizable NCBI search queries.
- Support for parallel and sequential processing.
- Metadata extraction with publication information, assembly and sequencing technology, and geographic information.
- Automated handling of sequence updates and filtering based on custom criteria.
- Post-processing checks for data consistency and integrity.
- Python 3.8 or higher
- NCBI API key (optional for increased rate limits)
-
Clone the repository:
git clone https://github.com/forensicgenomics/genefetch.git cd genefetch -
Install dependencies:
pip install -r requirements.txt
-
Add your email (and optionally NCBI API key):
- Save your email in
secrets/ncbi_email.txt. - Save your NCBI API key in
secrets/ncbi_api_key.txt.
- Save your email in
-
Adjust defaults
- Change global vars in
fetcher/global_defaults.pyif desired. - Add txt files of accession numbers to be excluded into
exclusions/
- Change global vars in
You can use genefetch as a command-line tool or integrate it into a larger pipeline.
Run the fetcher with customizable options:
python run_fetcher.py \
--search-term "mitochondrion complete genome AND Homo Sapiens[Organism]" \
--max-num 1000 \
--batch-size 100 \
--fetch-parallel \
--num-workers 20 \
--soft-restartor as a module:
python -m fetcher.main \
--search-term "mitochondrion complete genome AND Homo Sapiens[Organism]" \
--max-num 1000 \
...| Argument | Description |
|---|---|
--search-term |
Term to eSearch the ncbi database with for fetching profiles. See the ncbi nuccore docu on how to adjust this. |
--max-num |
Maximum number of profiles to fetch (Maximum of 10k as per Entrez). |
--batch-size |
(optional, defaults to 100) Number of profiles to process before saving. |
--num-workers |
(optional, defaults to 16) Number of workers for parallel fetching. |
--fetch-parallel |
(flag) Enable parallel fetching. |
--soft-restart |
(flag) Restart with previously processed profiles. Use this if you are building upon the last run with the same search term and dont want to fetch all profiles at once. If set, all profiles of the previous fetched are not checked for updates! |
--clean-dir |
(flag) Removes profile data from all files (ids_list / removed_ids / metadata / fasta_files) if they are not returned by the databank query with the current search parameters. As the max-num parameter, the LIMIT_NUM is used.Set this flag if you made a mistake with the last fetch search term for example and dont want to start over or manually remove them. Be careful though, as this can remove a lot of data if your current search term is incorrect! |
--update-exclusions |
(flag) Call with this flag if you think that the exclusion dir has changed since the last fetch. It removes any profiles from (ids_list / metadata / fasta_files) if they are new in the exclusons and adds them to removed_ids. Any profiles in removed_ids that are now no longer found in the exclusions dir are removed from the removed file. Be careful though, this currently also removes dynamically filtered out profiles from filters. |
--y |
(flag) Automatic 'yes' to all user questions. Use when integrating into other scripts / workflows. |
You cannot use soft-restart in combination with clean-dir or update-exclusions.
It is also not advised to use soft-restart afterwards, as the tracker for the the soft restart mode is not altered by the cleaning process.
In general, use soft-restart only, when you are building your database and want to accumulate it in smaller batches, or when testing something.
When simply updating or cleaning your data, go without the soft restart option.
Default values are stored in genefetch/global_defaults.py.
- Fetch Profiles: Fetch profiles using the specified search term.
- Process Profiles:
- Filter out unneeded profiles.
- Fetch metadata for each sequence.
- Save processed sequences and metadata.
- Post-Processing:
- Remove duplicates.
- Validate metadata and sequence files.
- Clean up old logs and intermediate files.
MitoFetch uses a global_defaults.py file to define global settings like directories, rate limits, and search terms.
Update this file to customize your fetcher.
If you have any profiles you want to exclude during fetching, you can add any number of txt files to exclusions/.
Filters for these will be created dynamically.
genefetch/
│
├── fetcher/
│ ├── __init__.py
│ ├── fetch.py # Main fetcher script
│ ├── fetch_tools.py # Helper functions for Entrez queries
│ ├── file_io.py # File handling utilities
│ ├── metadata_tools.py # Metadata extraction functions
│ ├── logger_setup.py # Logger setup
│ ├── global_defaults.py # Global configuration
│ ├── filter_tools.py # Dynamic filtering rules
│ └── post_process_check.py # Post-processing validation
│
├── exclusions/ # Manual exclusion lists
├── secrets/ # NCBI API keys and email
│
│── data/
│ ├── logs/ # Logs for debugging and auditing
│ └── debug_info # dditional files that may be useful for debugging
│ ├── processed_ids/ # Processed ID lists for soft restarting
│ ├── seqs/ # Sequence FASTA files
│ ├── ids_list.txt # Fetched profile accession numbers
│ ├── removed_ids.csv # Filtered out profiles
│ ├── metadata.csv # Metadata df of fetched profiles
│ └── last_run_date.txt # Date stored when last run occured
│
├── tests/ # Unit tests
├── requirements.txt # Python dependencies
├── README.md
└── LICENSE
This project is licensed under the MPL 2.0 License. See the LICENSE file for details.
genefetch leverages the NCBI Entrez API and Biopython for efficient sequence and metadata management.
genefetch was created as part of the mitoTree Project.