map short reads against reference sequence lastz_wrapper.py #if $seq_name.how_to_name=="yes": --ref_name=$seq_name.ref_name #else: --ref_name="None" #end if --ref_source=$source.ref_source --source_select=$params.source_select --out_format=$out_format --input2=$input2 #if $source.ref_source=="history": --input1=$source.input1 --ref_sequences=$input1.metadata.sequences #else: --input1=$source.input1_2bit --ref_sequences="None" #end if #if $params.source_select=="pre_set": --pre_set_options=${params.pre_set_options} --strand="None" --seed="None" --gfextend="None" --chain="None" --transition="None" --O="None" --E="None" --X="None" --Y="None" --K="None" --L="None" --entropy="None" #else: --pre_set_options="None" --strand=$params.strand --seed=$params.seed --gfextend=$params.gfextend --chain=$params.chain --transition="$params.transition" --O=$params.O --E=$params.E --X=$params.X --Y=$params.Y --K=$params.K --L=$params.L --entropy=$params.entropy #end if --identity_min=$min_ident --identity_max=$max_ident --coverage=$min_cvrg --output=$output1 --unmask=$unmask --lastzSeqsFileDir=${GALAXY_DATA_INDEX_DIR} lastz **What it does** **LASTZ** is a high performance pairwise sequence aligner derived from BLASTZ. It is written by Bob Harris in Webb Miller's laboratory at Penn State University. Special scoring sets were derived to improve runtime performance and quality. This Galaxy version of LASTZ is geared towards aligning short (Illumina/Solexa, AB/SOLiD) and medium (Roche/454) reads against a reference sequence. There is excellent, extensive documentation on LASTZ available here_. .. _here: http://www.bx.psu.edu/miller_lab/dist/README.lastz-1.02.00/README.lastz-1.02.00.html ------ **Input formats** LASTZ accepts reference and reads in FASTA format. However, because Galaxy supports implicit format conversion the tool will recognize fastq and other method specific formats. ------ **Outputs** LASTZ generates one output. Depending on the choice you make in the *Select output format* drop-down, LASTZ will produce a SAM file showing sequence alignments, a list of differences between the reads and reference (Polymorphisms), or a general table with one line per alignment block (Tabular). Examples of these outputs are shown below. **SAM output** SAM has 12 columns:: 1 2 3 4 5 6 7 8 9 10 11 12 ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- HWI-EAS91_1_30788AAXX:1:2:1670:915 99 chr9 58119878 60 36M = 58120234 392 GACCCCTACCCCACCGTGCTCTGGATCTCAGTGTTT IIIIIIIIIIIIIIIIEIIIIIII7IIIIIIIIIII XT:A:U NM:i:0 SM:i:37 AM:i:37 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:36 HWI-EAS91_1_30788AAXX:1:2:1670:915 147 chr9 58120234 60 36M = 58119878 -392 ATGAGTCGAATTCTATTTTCCAAACTGTTAACAAAA IFIIDI;IIICIIIIIIIIIIIIIIIIIIIIIIIII XT:A:U NM:i:0 SM:i:37 AM:i:37 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:36 where:: Column Description --------- --------------------------------------------------------------------- 1. QNAME Query (pair) NAME 2. FLAG bitwise FLAG 3. RNAME Reference sequence NAME 4. POS 1-based leftmost POSition/coordinate of clipped sequence 5. MAPQ MAPping Quality (Phred-scaled) 6. CIGAR extended CIGAR string 7. MRNM Mate Reference sequence NaMe ('=' if same as RNAME) 8. MPOS 1-based Mate POSition 9. ISIZE Inferred insert SIZE 10. SEQ query SEQuence on the same strand as the reference 11. QUAL query QUALity (ASCII-33 gives the Phred base quality) 12. OPT variable OPTional fields in the format TAG:VTYPE:VALUE, tab-separated The flags are as follows:: Flag Description ------ ------------------------------------- 0x0001 the read is paired in sequencing 0x0002 the read is mapped in a proper pair 0x0004 the query sequence itself is unmapped 0x0008 the mate is unmapped 0x0010 strand of the query (1 for reverse) 0x0020 strand of the mate 0x0040 the read is the first read in a pair 0x0080 the read is the second read in a pair 0x0100 the alignment is not primary **Polymorphism (SNP or differences) output** Polymorphism output contains 14 columns:: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 -------------------------------------------------------------------------------------------------------------------------------------------------------------- chrM 2490 2491 + 5386 HWI-EAS91_1_306UPAAXX:6:1:486:822 10 11 - 36 C A ACCTGTTTTACAGACACCTAAAGCTACATCGTCAAC ACCTGTTTTAAAGACACCTAAAGCTACATCGTCAAC chrM 2173 2174 + 5386 HWI-EAS91_1_306UPAAXX:6:1:259:1389 26 27 + 36 G T GCGTACTTATTCGCCACCATGATTATGACCAGTGTT GCGTACTTATTCGCCACCATGATTATTACCAGTGTT where:: 1. (chrM) - Reference sequence id 2. (2490) - Start position of the difference in the reference 3. (2491) - End position of the difference in the reference 4. (+) - Strand of the reference (always plus) 5. (5386) - Length of the reference sequence 6. (HWI...) - read id 7. (10) - Start position of the difference in the read 8. (11) - End position of the difference in the read 9. (+) - Strand of the read 10. (36) - Length of the read 11. (C) - Nucleotide in the reference 12. (A) - Nucleotide in the read 13. (ACC...) - Reference side os the alignment 14. (ACC...) - Read side of the alignment **Tabular output** Tabular output is a tab-separated format with 30 columns:: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 14 PHIX174 + 5386 4648 4647 4661 14 ATTTTCGTGATATT EYKX4VC01BV8HS + 204 154 153 167 154 153 167 14 ATTTTCGTGATATT .............. 14M 14/14 100.0% 14/204 6.9% 0/14 0.0% 4494 NA 16 PHIX174 + 5386 3363 3362 3378 16 GACGCCGGATTTGAGA EYKX4VC01AWJ88 - 259 36 35 51 209 208 224 16 GACGCCGGATTTGAGA ................ 16M 16/16 100.0% 16/259 6.2% 0/16 0.0% 3327 NA The following columns are present:: Field Meaning ---------------- ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 1. score Score of the alignment block. The scale and meaning of this number will vary, depending on the final stage performed and other command-line options. 2. name1 Name of the target sequence. 3. strand1 Target sequence strand, either "+" or "−". 4. size1 Size of the entire target sequence. 5. start1 Starting position of the alignment block in the target, origin-one. 6. zstart1 Starting position of the alignment block in the target, origin-zero. 7. end1 Ending position of the alignment block in the target, expressed either as origin-one closed or origin-zero half-open (the ending value is the same in both systems). 8. length1 Length of the alignment block in the target (excluding gaps). 9. text1 Aligned characters in the target, including gap characters. 10. name2 Name of the query sequence. 11. strand2 Query sequence strand, either "+" or "−". 12. size2 Size of the entire query sequence. 13. start2 Starting position of the alignment block in the query, origin-one. 14. zstart2 Starting position of the alignment block in the query, origin-zero. 15. end2 Ending position of the alignment block in the query, expressed either as origin-one closed or origin-zero half-open (the ending value is the same in both systems). 16. start2+ Starting position of the alignment block in the query, counting along the query sequence's positive strand (regardless of which query strand was aligned), origin-one. Note that if strand2 is "−", then this is the other end of the block from start2. 17. zstart2+ Starting position of the alignment block in the query, counting along the query sequence's positive strand (regardless of which query strand was aligned), origin-zero. Note that if strand2 is "−", then this is the other end of the block from zstart2. 18. end2+ Ending position of the alignment block in the query, counting along the query sequence's positive strand (regardless of which query strand was aligned), expressed either as origin-one closed or origin-zero half-open (the ending value is the same in both systems). Note that if strand2 is "−", then this is the other end of the block from end2. 19. length2 Length of the alignment block in the query (excluding gaps). 20. text2 Aligned characters in the query, including gap characters. 21. diff Differences between what would be written for text1 and text2. Matches are written as . (period), transitions as : (colon), transversions as X, and gaps as - (hyphen). 22. cigar A CIGAR-like representation of the alignment's path through the Dynamic Programming matrix. This is the short representation, without spaces, described in the Ensembl CIGAR specification. 23./24. identity Fraction of aligned bases in the block that are matches (see Identity). This is written as two fields. The first field is a fraction, written as <n>/<d>. The second field contains the same value, computed as a percentage. 25./26. coverage Fraction of the entire input sequence (target or query, whichever is shorter) that is covered by the alignment block (see Coverage). This is written as two fields. The first field is a fraction, written as <n>/<d>. The second field contains the same value, computed as a percentage. 27./28. gaprate Rate of gaps (also called indels) in the alignment block. This is written as two fields. The first field is a fraction, written as <n>/<d>, with the numerator being the number of alignment columns containing gaps and the denominator being the number without gaps. The second field contains the same value, computed as a percentage. 29. diagonal The diagonal of the start of the alignment block in the dynamic programming matrix, expressed as an identifying number start1-start2. 30. shingle A measurement of the shingle overlap between the target and the query. This is intended for the case where both the target and query are relatively short, and their ends are expected to overlap. ------- **LASTZ Settings** There are two setting modes: (1) **Commonly used settings** and (2) **Full Parameter List**. **Commonly used settings** There are seven modes:: Illumina-Solexa/AB-SOLiD 95% identity Illumina-Solexa/AB-SOLiD 85% identity Roche-454 98% identity Roche-454 95% identity Roche-454 90% identity Roche-454 85% identity Roche-454 75% identity When deciding which one to use, consider the following: a 36 bp read with two differences will be 34/36 = 94% identical to the reference. **Full Parameter List** This mode gives you fuller control over lastz. The description of these and other parameters is found at the end of this page. Note that not all parameters are included in this interface. If you would like to make additional options available through Galaxy, e-mail us at galaxy-bugs@bx.psu.edu. ------ **Do you want to modify the reference name?** This option allows you to set the name of the reference sequence manually. This is helpful when, for example, you would like to make the reference name compatible with the UCSC naming conventions to be able to display your lastz results as a custom track at the UCSC Genome Browser. ------ **LASTZ parameter list** This is an exhaustive list of LASTZ options. Once again, please note that not all options are included in this interface. If you would like to make additional options available through Galaxy, e-mail us at galaxy-bugs@bx.psu.edu:: target[[s..e]][-] spec/file containing target sequence (fasta or nib) [s..e] defines a subrange of the file - indicates reverse-complement (use --help=files for more details) query[[s..e]][-] spec/file containing query sequences (fasta or nib) if absent, queries come from stdin (unless they aren't needed, as for --self or --tableonly) (use --help=files for more details) --self the target sequence is also the query --quantum the query sequence contains quantum DNA --seed=match<length> use a word with no gaps instead of a seed pattern --seed=half<length> use space-free half-weight word instead of seed pattern --match=<reward>[,<penalty>] set the score values for a match (+<reward>) and mismatch (-<penalty>) --[no]trans[ition][=2] allow one or two transitions in a seed hit (by default a transition is allowed) --word=<bits> set max bits for word hash; use this to trade time for memory, eliminating thrashing for heavy seeds (default is 28 bits) --[no]filter=[<T>:]<M> filter half-weight seed hits, requiring at least M matches and allowing no more than T transversions (default is no filtering) --notwins require just one seed hit --twins=[<min>:]<maxgap> require two nearby seed hits on the same diagonal (default is twins aren't required) --notwins allow single, isolated seeds --[no]recoverseeds avoid losing seeds in hash collisions. Cannot be used with --twins --seedqueue=<entries> set number of entries in seed hit queue (default is 262144) --anchors=<file> read anchors from a file, instead of discovering anchors via seeding --recoverhits recover hash-collision seed hits (default is not to recover seed hits) --step=<length> set step length (default is 1) --maxwordcount=<limit> words occurring more often than <limit> in the target are not eligible for seeds --strand=both search both strands --strand=plus search + strand only (matching strand of query spec) --strand=minus search - strand only (opposite strand of query spec) (by default both strands are searched) --ambiguousn treat N as an ambiguous nucleotide (by default N is treated as a sequence splicing character) --[no]gfextend perform gap-free extension of seed hits to HSPs (by default no extension is performed) --[no]chain perform chaining --chain=<diag,anti> perform chaining with given penalties for diagonal and anti-diagonal (by default no chaining is performed) --[no]gapped perform gapped alignment (instead of gap-free) (by default gapped alignment is performed) --score[s]=<file> read substitution scores from a file (default is HOXD70) --unitscore[s] scores are +1/-1 for match/mismatch --gap=<[open,]extend> set gap open and extend penalties (default is 400,30) --xdrop=<score> set x-drop threshold (default is 10*sub[A][A]) --ydrop=<score> set y-drop threshold (default is open+300extend) --infer[=<control>] infer scores from the sequences, then use them --inferonly[=<control>] infer scores, but don't use them (requires --infscores) all inference options are read from the control file --infscores[=<file>] write inferred scores to a file --hspthresh=<score> set threshold for high scoring pairs (default is 3000) ungapped extensions scoring lower are discarded <score> can also be a percentage or base count --entropy adjust for entropy when qualifying HSPs in the x-drop extension method --noentropy don't adjust for entropy when qualifying HSPs --exact=<length> set threshold for exact matches if specified, exact matches are found rather than high scoring pairs (replaces --hspthresh) --inner=<score> set threshold for HSPs during interpolation (default is no interpolation) --gappedthresh=<score> set threshold for gapped alignments gapped extensions scoring lower are discarded <score> can also be a percentage or base count (default is to use same value as --hspthresh) --ball=<score> set minimum score required of words 'in' a quantum ball --[no]entropy involve entropy in filtering high scoring pairs (default is "entropy") --[no]mirror report/use mirror image of all gap-free alignments (default is "mirror" for self-alignments only) --traceback=<bytes> space for trace-back information (default is 80.0M) --masking=<count> mask any position in target hit this many times zero indicates no masking (default is no masking) --targetcapsule=<capsule_file> the target seed word position table and seed (as well as the target sequence)are read from specified file --segments=<segment_file> read segments from a file, instead of discovering them via seeding. Replaces other seeding or gap-free extension options --[no]census[=<file>] count/report how many times each target base aligns (default is to not report census) --identity=<min>[..<max>] filter alignments by percent identity 0<=min<=max<=100; blocks (or HSPs) outside min..max are discarded (default is no identity filtering) --coverage=<min>[..<max>] filter alignments by percentage pf query covered 0<=min<=max<=100; blocks (or HSPs) outside min..max are discarded (default is no query coverage filtering) --notrivial do not output trivial self-alignment block if the target and query sequences are identical. Using --self enables this option automatically --output=<output_file> write the alignments to the specified file name instead of stdout --code=<file> give quantum code for query sequence (only for display) --format=<type> specify output format; one of lav, axt, maf, maf+, maf-, text, lav+text, cigar, text, rdplot, general, or general:<fields> (by default output is LAV) --rdotplot=<file> create an additional output file suitable for plotting the alignments with the R statistical package. --markend Just before normal completion, write "# lastz end-of-file" to output file --census[=<output_file>] count and report how many times each target base aligns, up to 255. Ns are included in the count --census16[=<output_file>] count and report how many times each target base aligns, up up 65 thousand --census32[=<output_file>] count and report how many times each target bas aligns, up to 4 billion --writecapsule=<capsule_file> just write out a targegt capsule file and quit; don't search for seeds or perform subsequent stages --verbosity=<level> set info level (0 is minimum, 10 is everything) (default is 0) --[no]runtime report runtime in the output file (default is to not report runtime) --tableonly[=count] just produce the target position table, don't search for seeds --[no]stats[=<file>] show search statistics (or don't) (not available in this build) --version report the program version and quit --help list all options --help=files list information about file specifiers --help=short[cuts] list blastz-compatible shortcuts --help=yasra list yasra-specific shortcuts