Fastcov - Fast Multiple Covariance Detector v1.03
Usage
Name:
fastcov V1.03 -- Fast Multiple Covariance Detector
http://yanlilab.github.io/fastcov
Authors:
Yan Li <liyan.com@gmail.com>
Wei Shen <shenwei356@gmail.com>
Usage:
fastcov [options] inputfile
Available Options:
-p FLOAT minimum pairing purity of two sites [0.7]
-r FLOAT minimum matching ratio of to the pattern [0.45]
-n INT minimum residue number at each site [5]
-c FLOAT minimum proportion of any sequence identical to the
consensus [0.33]
-o STRING prefix of output files [inputfile]
-j INT CPU number [CPU number of your computer]
-h, --help show this help message
Copyright:
Copyright © 2015-2016, All Rights Reserved
This software is free to distribute for academic research.
Positional arguments
-
inputfile
should be aligned protein sequences in FASTA format file, produced by multi sequence alignment softwares. Case is not sensitive.One-seq-per-line format could be converted to FASTA format by
for f in *.aln; do cat -n $f | awk '{print ">"$1"\n"$2}' > $f.fas; done
Options
Main algorithm parameters
-p
defines the minimum pairing purity of two sites. Default is 0.7.-r
defines the minimum matching ratio of to the pattern at clustering stage. Default is 0.45.
Sequences filter criteria
-n
is the minimum residue number at each site. Default value is 5.-c
is the minimum proportion of any sequence identical to the consensus. Default value is 0.33, i.e. the number of residues identical to the that of the same position of consensus sequences should be at least one third of the length of consensus. Sequences that fail to reach this criteria will be discarded.
Output
-o
defines the prefix of output files, default value is the same as input file. e.g, for a input filetest.fa
, output files will be:test.aligned.fa.pairs.txt test.aligned.fa.clusters.txt test.aligned.fa.patterns.txt test.aligned.fa.seq2patterns.txt
Performance
-j
is the number of CPU.fastcov
detects your computer and set the default value with the maximum CPU number. The bigger the value is, the fasterfastcov
runs.
Examples
Taking examples/ABCD_RT_M.aligned.fas
for example.
Quik run:
fastcov ABCD_RT_M.aligned.fas
Terminal output:
Input: ABCD_RT_M.aligned.fas
Step 1/5: Reading sequences
Done
Step 2/5: Searching candidate sites
Done
Step 3/5: Searching independent pairs
21115 / 21115 [===================================================================================] 100.00 % 28s
Covariant site pairs saved to file: ABCD_RT_M.aligned.fas.pairs
Done
Step 4/5: Searching covariant patterns
52 / 52 [===========================================================================================] 100.00 % 0
Covariant patterns saved to file: ABCD_RT_M.aligned.fas.patterns
Done
Step 5/5: Clustering by covariant patterns
Covariant patterns assigned to sequences: ABCD_RT_M.aligned.fas.seq2patterns
Sequences clustered by covariant patterns: ABCD_RT_M.aligned.fas.clusters
The most time-consuming stage is step 3
, so we add a process bar.
Output files:
ABCD_RT_M.aligned.fas.pairs.txt # covariant pairs information, table file, could be imported to MS Excel
ABCD_RT_M.aligned.fas.patterns.txt # covariant patterns, table file, could be imported to MS Excel
ABCD_RT_M.aligned.fas.clusters.txt # sequence clusters by covariant patterns
ABCD_RT_M.aligned.fas.seq2patterns.txt # covariant patterns of every sequence, table file, could be imported to MS Excel
Note: For windows user, please use a modern text editor to view the result files. Notepad is not recommended, Notepad++ is a better choice.
More examples: fastcov-examples.tar.gz
Errors and Solutions
-
No input file given. Please feed
fastcov
a aligned amino acids sequences in FASTA format.$ fastcov [Error] no input file (aligned amino acids sequences in FASTA format) given. type "fastcov -h" for help
-
Input file is not aligned.
[Error] sequence length not equal: 343 (AB014392_Pol-C) != 344. input file should be aligned amino acids sequences in FASTA format
-
Illegal characters in sequence. FASTA parsing module of
fastcov
strictly check the sequences, you may check input sequence according according to the IUPAC nucleotide code (http://www.bioinformatics.org/sms2/iupac.html). It may also be caused by unmatch of sequence type (PROTEIN) and actual sequence type (DNA) in FASTA file.Input: test.fa Step 1/5: Reading sequences error when reading AB014367_Pol-C: invalid Protein sequence: AB014367_Pol-C
FAQ
Please don't hesitate to email us.
Q: What a mess when opening the result files!
A: Microsoft Windows user may open the result files by Notepad provided by the Operating system. Please choose another moder text editor like Notepad++.
Authors
Yan Li liyan.com@gmail.com, Wei Shen shenwei356@gmail.com
Citation
Wei Shen, Yan Li*. A novel algorithm for detecting multiple covariance and clustering of biological sequences. Sci. Rep. 6, 30425; doi:10.1038/srep30425 (2016).
Copyright
Copyright © 2015-2016, All Rights Reserved.
This software is free to distribute for academic research.