Some python functions to deal with DNA alignments.
To install the development version from github, clone the repository to a local directory using something like:
git clone https://github.com/sherrillmix/dnapy.git
and run setup.py from the resulting directory (the --user installs it locally and doesn't require root access):
cd dnapy python setup.py install --user python setup.py test
To use the scripts directly from command line, e.g. countbases file.bam, (and to pass all tests above), you will need to make sure that your PATH contains the bin directory installed in by setup.py above. On Linux, this would mean putting something like:
if [ -d "$HOME/.local/bin" ] ; then
PATH="$HOME/.local/bin:$PATH"
fi
in your .bashrc or .profile.
The package provides the scripts:
usage: abitotrace [-h] abiFile A program to take an ABI .ab1 file from Sanger sequencing and output the traces to standard out as four columns. positional arguments: abiFile an ABI .ab1 file containing the sequence read optional arguments: -h, --help show this help message and exit
usage: bamtoalign [-h] -s REFSEQ [-q MINQUALITY] [-v] [-r REGION] [-e ENDSPAN]
bamFile
A program to convert a bam file into an aligned fasta file. The command
generates fasta formatted output (two lines for each sequence: a name line
prepended by > and a line containing the aligned sequence) to standard out.
positional arguments:
bamFile a bam file containing the alignment
optional arguments:
-h, --help show this help message and exit
-s REFSEQ, --refseq REFSEQ
fasta file giving the reference sequence of interest
-q MINQUALITY, --minQuality MINQUALITY
don't count alignments with a mapping quality less
than this
-v, --verbose increase output verbosity to stderr
-r REGION, --region REGION
the region to pull reads from (note that the
underlying pysam does not like single base regions
like ch1:25. These instead be specified as
chr1:25-25.)
-e ENDSPAN, --endSpan ENDSPAN
ignore spans of matches at the start or end of a read
less than this cutoff
usage: countbarcodes [-h] [-s START] [-e END] [-w WHITELIST] [-d DOTS] [-a]
fastqFiles [fastqFiles ...]
A program to take a fastq file and count barcodes (potentially only including
those found in a whitelist). The script takes a standard fastq read file and
an optional barcode whitelist file containing one barcode per line and outputs
a header-less csv with columns barcode, count for each barcode to standard out
positional arguments:
fastqFiles a fastq file(s) (potentially gzipped) containing the
sequence reads
optional arguments:
-h, --help show this help message and exit
-s START, --start START
Start position of barcode in reads (1-based)
-e END, --end END End position of barcode in reads (1-based)
-w WHITELIST, --whitelist WHITELIST
a (potentially gzipped) file containing whitelisted
barcodes one to a line with no header
-d DOTS, --dots DOTS output dot to stderr every X reads. Input a negative
number to suppress output (default:-1)
-a, --all if set output 0 for all barcodes in whitelist not
appearing
usage: countbases [-h] [-v] [-r REGION] [-s] [-q MINQUALITY] bamFile
A program to count the number of bases at each position in a region. The
command generates standard output with columns referenceName, position,
numberOfReads, and numbers of A, C, G, T (or A+, A-, C+, C-, G+, G-, T+, T- if
--strand).
positional arguments:
bamFile a bam file containing the alignment
optional arguments:
-h, --help show this help message and exit
-v, --verbose increase output verbosity to stderr
-r REGION, --region REGION
the region to count in
-s, --strand break base counts into positive and negative strand
alignments
-q MINQUALITY, --minQuality MINQUALITY
don't count bases with a quality less than this
usage: countkmers [-h] [-k KMERLENGTH] [-t NTHREADS]
fastqFiles [fastqFiles ...]
A program to take a fastq file(s) and count the total k-mers across all reads
in each file. Note that partial kmers are discarded e.g. the last 3 reads of a
23 base read will be ignored. Return a comma separated file with a header row
then a row for each file and a file column then a column for each kmer
positional arguments:
fastqFiles a fastq file(s) (potentially gzipped) containing the
sequence reads
optional arguments:
-h, --help show this help message and exit
-k KMERLENGTH, --kmerLength KMERLENGTH
the lengh of kmer to be used. Be careful with values
larger than 20.
-t NTHREADS, --nThreads NTHREADS
the number of threadss to use for processing. Should
be less than or equal the number of threads on
computer.
usage: getstartends [-h] [-v] [-g MAXGAPS] [-r REGION] [-f FILE] [-n] [-c]
bamFile
A program to pull start and end positions in a region. The command generates
standard output with columns referenceName, start (1-based), end (1-based),
strand
positional arguments:
bamFile a bam file containing the alignment
optional arguments:
-h, --help show this help message and exit
-v, --verbose increase output verbosity to stderr
-g MAXGAPS, --maxGaps MAXGAPS
maximum allowed insertions or deletions in a read.
Otherwise discard
-r REGION, --region REGION
the region to count in
-f FILE, --file FILE a text file specifying several regions to count where
each line gives a region e.g. chr1:1-100
-n, --noHeader suppress the initial header on csv output
-c, --regionColumn specify target region in first column (default: don't
show column)
usage: removereads [-h] [-d DOTS] [-o [OUTPUTFILES ...]] -f FILTERFILE [-k]
fastqFiles [fastqFiles ...]
A program to filter reads by name from a single/set of fastq file(s). The
script looks for reads which have a name line where the string before a space
exactly matches a pattern. If multiple files are passed in, then they are
processed in sync and if any name matches that read is discarded (or kept)
from all files.
positional arguments:
fastqFiles a (potentially gzipped) fastq file(s) containing the
reads with the order of reads the same in all files
optional arguments:
-h, --help show this help message and exit
-d DOTS, --dots DOTS output dot to stderr every X reads. Input a negative
number to suppress output (default:-1)
-o [OUTPUTFILES ...], --outputFiles [OUTPUTFILES ...]
an output file(s) (one for each input fastq file).
default(out1.fastq.gz ... outn.fastq.gz where n is the
number of fastqFiles)
-f FILTERFILE, --filterFile FILTERFILE
a (potentially gzipped) file containing the names of
reads to be filtered one per line
-k, --keep keep reads matching the filter file and filter all
nonmatching reads
usage: removeshort [-h] [-d DOTS] [-l MINLENGTH] [-n] [-p] [-b BADOUT]
fastqFile
A program to remove short reads from a fastq file.
positional arguments:
fastqFile a (potentially gzipped) fastq file containing the
sequence data
optional arguments:
-h, --help show this help message and exit
-d DOTS, --dots DOTS output dot to stderr every X reads. Input a negative
number to suppress output (default:-1)
-l MINLENGTH, --minLength MINLENGTH
minimum length read to output (default:15)
-n, --removeN remove reads which contain anything other than A, C, T
or G
-p, --removePoor remove reads with different length sequence and
qualities. Note this requires assuming that all reads
are 4 lines each
-b BADOUT, --badOut BADOUT
a file path in which to save the first 10000 malformed
reads with different length sequence and qualities.
Note this requires assuming that all reads are 4 lines
each
usage: splitbarcodes [-h] [-i INDEXFILES [INDEXFILES ...]] [-d DOTS] -b
BARCODEFILE [-o OUTPUTPATH] [-u]
fastqFiles [fastqFiles ...]
A program to take a list of barcodes and one or more fastq reads and one or
two index reads and output reads matching the barcodes into a seperate file
for each barcode. The script takes read files and index files where the reads
and indexs are in the same order and outputs reads which match the appropriate
barcodes into separate files.
positional arguments:
fastqFiles a fastq file(s) (potentially gzipped) containing the
sequence reads
optional arguments:
-h, --help show this help message and exit
-i INDEXFILES [INDEXFILES ...], --indexFiles INDEXFILES [INDEXFILES ...]
a fastq file(s) (potentially gzipped) containing the
index reads
-d DOTS, --dots DOTS output dot to stderr every X reads. Input a negative
number to suppress output (default:-1)
-b BARCODEFILE, --barcodeFile BARCODEFILE
a (potentially gzipped) file containing comma
separated sample names, first barcode and second
barcode (with no header and no commas in the sample
names)
-o OUTPUTPATH, --outputPath OUTPUTPATH
a string giving the desired output directory
-u, --unassigned if set then store unassigned reads to
{outputPath}/__UNASSIGNED__R#.fastq.gz with their
corresponding barcodes in
{outputPath}/__UNASSIGNED__I#.fastq.gz
usage: splitreadsbyname [-h] -s SPLITFILE [-d DOTS] [-o OUTPUTPATH] [-u] [-a]
fastqFile
A program to take a list of read names and desired files and a fastq file
containing various reads and output reads to their assigned file locations.
The script takes a standard fastq read file and a csv separated file
containing read identifiers and file locations and outputs reads which match
the appropriate read names into their assigned files.
positional arguments:
fastqFile a fastq file (potentially gzipped) containing the
sequence reads
optional arguments:
-h, --help show this help message and exit
-s SPLITFILE, --splitFile SPLITFILE
a (potentially gzipped) file containing comma
separated reads names and file names (with no header
and no commas in the sample names)
-d DOTS, --dots DOTS output dot to stderr every X reads. Input a negative
number to suppress output (default:-1)
-o OUTPUTPATH, --outputPath OUTPUTPATH
a string giving the desired output directory
-u, --unassigned if set then store unassigned reads to
{outputPath}/__UNASSIGNED__.fastq.gz
-a, --append if set then append to already existing output files
0.1.9 (2022-03-20)
- Add helper class Trie
0.1.8 (2023-03-06)
- Add countbarcodes
0.1.7 (2023-02-11)
- Add splitreadsbyname
0.1.6 (2021-09-20)
- Fix code rot from python changes
0.1.5 (2018-12-21)
- Add abitotrace function
0.1.4 (2018-02-20)
- Adjust to changes in pysam v1.4
- Add option to only count bases with a quality greater above a specified limit
0.1.3 (2017-12-22)
- Add kmer counter
- Add option to filter reads containing N or unequal length sequence-qualities to removeshort
0.1.2 (2016-10-25)
- Add barcode splitter script
0.1.1 (2016-10-12)
- Add read filter script
0.1.0 (2016-01-20)
- Initial public release