bfast_wrapper.py --numThreads="4" ##HACK: hardcode numThreads for now, should come from a location file --fastq="$input1" #if $input1.extension.startswith( "fastqcs" ): ##if extention starts with fastqcs, then we have a color space file --space="1" ##color space #else --space="0" #end if --output="$output" $suppressHeader #if $refGenomeSource.refGenomeSource_type == "history": ##build indexes on the fly --buildIndex --ref="${refGenomeSource.ownFile}" --indexMask="${",".join( [ "%s:%s" % ( str( custom_index.get( 'mask' ) ).strip(), str( custom_index.get( 'hash_width' ) ).strip() ) for custom_index in $refGenomeSource.custom_index ] )}" ${refGenomeSource.indexing_repeatmasker} #if $refGenomeSource.indexing_option.indexing_option_selector == "contig_offset": --indexContigOptions="${refGenomeSource.indexing_option.start_contig},${refGenomeSource.indexing_option.start_pos},${refGenomeSource.indexing_option.end_contig},${refGenomeSource.indexing_option.end_pos}" #elif $refGenomeSource.indexing_option.indexing_option_selector == "exons_file": --indexExonsFileName="${refGenomeSource.indexing_option.exons_file}" #end if #else: ##use precomputed indexes --ref="${ filter( lambda x: str( x[0] ) == str( $refGenomeSource.indices ), $__app__.tool_data_tables[ 'bfast_indexes' ].get_fields() )[0][-1] }" #end if #if $params.source_select == "full": --offsets="$params.offsets" --keySize="$params.keySize" --maxKeyMatches="$params.maxKeyMatches" --maxNumMatches="$params.maxNumMatches" --whichStrand="$params.whichStrand" #if str( $params.scoringMatrixFileName ) != 'None': --scoringMatrixFileName="$params.scoringMatrixFileName" #end if ${params.ungapped} ${params.unconstrained} --offset="${params.offset}" --avgMismatchQuality="${params.avgMismatchQuality}" --algorithm="${params.localalign_params.algorithm}" ${params.unpaired} ${params.reverseStrand} #if $params.localalign_params.algorithm == "3": ${params.localalign_params.pairedEndInfer} ${params.localalign_params.randomBest} #end if #end if **What it does** BFAST facilitates the fast and accurate mapping of short reads to reference sequences. Some advantages of BFAST include: * Speed: enables billions of short reads to be mapped quickly. * Accuracy: A priori probabilities for mapping reads with defined set of variants * An easy way to measurably tune accuracy at the expense of speed. Specifically, BFAST was designed to facilitate whole-genome resequencing, where mapping billions of short reads with variants is of utmost importance. BFAST supports both Illumina and ABI SOLiD data, as well as any other Next-Generation Sequencing Technology (454, Helicos), with particular emphasis on sensitivity towards errors, SNPs and especially indels. Other algorithms take short-cuts by ignoring errors, certain types of variants (indels), and even require further alignment, all to be the "fastest" (but still not complete). BFAST is able to be tuned to find variants regardless of the error-rate, polymorphism rate, or other factors. ------ Please cite the website "http://bfast.sourceforge.net" as well as the accompanying papers: Homer N, Merriman B, Nelson SF. BFAST: An alignment tool for large scale genome resequencing. PMID: 19907642 PLoS ONE. 2009 4(11): e7767. http://dx.doi.org/10.1371/journal.pone.0007767 Homer N, Merriman B, Nelson SF. Local alignment of two-base encoded DNA sequence. BMC Bioinformatics. 2009 Jun 9;10(1):175. PMID: 19508732 http://dx.doi.org/10.1186/1471-2105-10-175 ------ **Know what you are doing** .. class:: warningmark There is no such thing (yet) as an automated gearshift in short read mapping. It is all like stick-shift driving in San Francisco. In other words = running this tool with default parameters will probably not give you meaningful results. A way to deal with this is to **understand** the parameters by carefully reading the `documentation`__ and experimenting. Fortunately, Galaxy makes experimenting easy. .. __: http://bfast.sourceforge.net/ ------ **Input formats** BFAST accepts files in Sanger FASTQ format. Use the FASTQ Groomer to prepare your files. ------ **Outputs** The output is in SAM format, and has the following columns:: Column Description -------- -------------------------------------------------------- 1 QNAME Query (pair) NAME 2 FLAG bitwise FLAG 3 RNAME Reference sequence NAME 4 POS 1-based leftmost POSition/coordinate of clipped sequence 5 MAPQ MAPping Quality (Phred-scaled) 6 CIGAR extended CIGAR string 7 MRNM Mate Reference sequence NaMe ('=' if same as RNAME) 8 MPOS 1-based Mate POSition 9 ISIZE Inferred insert SIZE 10 SEQ query SEQuence on the same strand as the reference 11 QUAL query QUALity (ASCII-33 gives the Phred base quality) 12 OPT variable OPTional fields in the format TAG:VTYPE:VALU The flags are as follows:: Flag Description ------ ------------------------------------- 0x0001 the read is paired in sequencing 0x0002 the read is mapped in a proper pair 0x0004 the query sequence itself is unmapped 0x0008 the mate is unmapped 0x0010 strand of the query (1 for reverse) 0x0020 strand of the mate 0x0040 the read is the first read in a pair 0x0080 the read is the second read in a pair 0x0100 the alignment is not primary It looks like this (scroll sideways to see the entire example):: QNAME FLAG RNAME POS MAPQ CIAGR MRNM MPOS ISIZE SEQ QUAL OPT HWI-EAS91_1_30788AAXX:1:1:1761:343 4 * 0 0 * * 0 0 AAAAAAANNAAAAAAAAAAAAAAAAAAAAAAAAAAACNNANNGAGTNGNNNNNNNGCTTCCCACAGNNCTGG hhhhhhh;;hhhhhhhhhhh^hOhhhhghhhfhhhgh;;h;;hhhh;h;;;;;;;hhhhhhghhhh;;Phhh HWI-EAS91_1_30788AAXX:1:1:1578:331 4 * 0 0 * * 0 0 GTATAGANNAATAAGAAAAAAAAAAATGAAGACTTTCNNANNTCTGNANNNNNNNTCTTTTTTCAGNNGTAG hhhhhhh;;hhhhhhhhhhhhhhhhhhhhhhhhhhhh;;h;;hhhh;h;;;;;;;hhhhhhhhhhh;;hhVh ------- **BFAST settings** All of the options have a default value. You can change any of them. Most of the options in BFAST have been implemented here. ------ **BFAST parameter list** This is an exhaustive list of BFAST options: For **match**:: -o STRING Specifies the offset [Use all] -l Specifies to load all main or secondary indexes into memory -A INT 0: NT space 1: Color space [0] -k INT Specifies to truncate all indexes to have the given key size (must be greater than the hash width) [Not Using] -K INT Specifies the maximum number of matches to allow before a key is ignored [8] -M INT Specifies the maximum total number of matches to consider before the read is discarded [384] -w INT 0: consider both strands 1: forward strand only 2: reverse strand only [0] -n INT Specifies the number of threads to use [1] -t Specifies to output timing information For **localalign**:: -x FILE Specifies the file name storing the scoring matrix -u Do ungapped local alignment (the default is gapped). -U Do not use mask constraints from the match step -A INT 0: NT space 1: Color space [0] -o INT Specifies the number of bases before and after the match to include in the reference genome -M INT Specifies the maximum total number of matches to consider before the read is discarded [384] -q INT Specifies the average mismatch quality -n INT Specifies the number of threads to use [1] -t Specifies to output timing information For **postprocess**:: -a INT Specifies the algorithm to choose the alignment for each end of the read: 0: No filtering will occur. 1: All alignments that pass the filters will be output 2: Only consider reads that have been aligned uniquely 3: Choose uniquely the alignment with the best score 4: Choose all alignments with the best score -A INT 0: NT space 1: Color space [0] -U Specifies that pairing should not be performed -R Specifies that paired reads are on opposite strands -q INT Specifies the average mismatch quality -x FILE Specifies the file name storing the scoring matrix -z Specifies to output a random best scoring alignment (with -a 3) -r FILE Specifies to add the RG in the specified file to the SAM header and updates the RG tag (and LB/PU tags if present) in the reads (SAM only) -n INT Specifies the number of threads to use [1] -t Specifies to output timing information bfast