indel_analysis.py --input=$input1 --threshold=$threshold --out_ins=$out_ins --out_del=$out_del **What it does** Given an input sam file, this tool provides analysis of the indels. It filters out matches that do not meet the frequency threshold. The way this frequency of occurence is calculated is different for deletions and insertions. The CIGAR string's "M" can indicate an exact match or a mismatch. For SAM containing the following bits of information (assuming the reference "ACTGCTCGAT"):: CHROM POS CIGAR SEQ ref 3 2M1I3M TACTTC ref 1 2M1D3M ACGCT ref 4 4M2I3M GTTCAAGAT ref 2 2M2D3M CTCCG ref 1 3M1D4M AACCTGG ref 6 3M1I2M TTCAAT ref 5 3M1I3M CTCTGTT ref 7 4M CTAT ref 5 5M CGCTA ref 3 2M1D2M TGCC The following totals would be calculated (this is an intermediate step and not output):: ------------------------------------------------------------------------------------------------------- POS BASE NUMREADS DELPROPCALC DELPROP INSPROPSTARTCALC INSSTARTPROP INSPROPENDCALC INSENDPROP ------------------------------------------------------------------------------------------------------- 1 A 2 2/2 1.00 --- --- --- --- 2 A 1 1/3 0.33 --- --- --- --- C 2 2/3 0.67 --- --- --- --- 3 C 1 1/5 0.20 --- --- --- --- T 3 3/5 0.60 --- --- --- --- - 1 1/5 0.20 --- --- --- --- 4 A 1 1/6 0.17 --- --- --- --- G 3 3/6 0.50 --- --- --- --- - 1 1/6 0.17 --- --- --- --- -- 1 1/6 0.17 --- --- --- --- 5 C 4 4/7 0.57 --- --- --- --- T 2 2/7 0.29 --- --- --- --- - 1 1/7 0.14 --- --- --- --- +C 1 --- --- 1/7 0.14 1/9 0.11 6 C 2 2/9 0.22 --- --- --- --- G 1 1/9 0.11 --- --- --- --- T 6 6/9 0.67 --- --- --- --- 7 C 7 7/9 0.78 --- --- --- --- G 1 1/9 0.11 --- --- --- --- T 1 1/9 0.11 --- --- --- --- 8 C 1 1/7 0.14 --- --- --- --- G 4 4/7 0.57 --- --- --- --- T 2 2/7 0.29 --- --- --- --- +T 1 --- --- 1/8 0.13 1/6 0.17 +AA 1 --- --- 1/8 0.13 1/6 0.17 9 A 4 4/5 0.80 --- --- --- --- T 1 1/5 0.20 --- --- --- --- +A 1 --- --- 1/6 0.17 1/5 0.20 10 T 4 4/4 1.00 --- --- --- --- The general idea for calculating these is that we want to find out the proportion of times a particular event occurred at a position among all reads that touch that base in some way. First, the basic total number of reads at a given position is the number of reads with each particular base plus the number of reads with that a deletion at that given position (including the bases that are "mismatches"). Note that deletions of two bases and one base would be counted completely separately. Insertions are not counted in this total. For position 4 above, the reference base is G, and there are 3 occurrences of it along with one mismatching base, A. Also, there is a 1-base deletion and another 2-base deletion. So there are a total of 5 matches/mismatches/deletions, and the proportions for each base are 1/6 = 0.17 (A) and 3/6 = 0.50 (G), and for each deletion it is 1/6 = 0.17. Insertions are slightly more complicated. We actually want to get the frequency of occurrence for both the associated start and end positions, since an insertion appears between those two bases. Each insertion is regarded individually, and the total number of occurrences of that insertion is divided by the sum of the number of its occurrences and the basic total for either the start or end. So for the insertions at position 8, there are a total of 7 matches/mismatches/deletions at position 8, and two insertions that each occur once, so each has an INSSTARTPROP of 1/8 = 0.13. For the end position there are 5 matches/mismatches/deletions, so the INSENDPROP is 1/6 = 0.17 for both insertions (T and AA). These proportions (DELPROP and either INSSTARTPROP or INSENDPROP) need to be greater than the threshold frequency specified by the user in order for that base, deletion or insertion to be included in the output. ** Output format ** The output varies for deletions and insertions, although for both, the first three columns are chromosome, start position, and end position. Columns in the deletions file:: Column Description ----------------------------- --------------------------------------------------------------------------------------------------- 1 Chrom Chromosome 2 Start Starting position 3 End Ending position 4 Coverage Number of reads containing this exact deletion 5 Frequency Percentage Frequency of this exact deletion (2 and 1 are mutually exclusive, for instance), as percentage (%) Columns in the insertions file:: Column Description ------------------------ ----------------------------------------------------------------------------------------------------------------- 1 Chrom Chromosome 2 Start Starting position 3 End Ending position (always Start + 1 for insertions) 4 Inserted Base(s) The exact base(s) inserted at Start position 5 Coverage Number of reads containing this exact insertion 6 Freq. Perc. at Start Frequency of this exact insertion given Start position ("GG" and "G" are considered distinct), as percentage (%) 7 Freq. Perc. at End Frequency of this exact insertion given End position ("GG" and "G" are considered distinct), as percentage (%) Before using this tool, you may will want to use the Filter SAM for indels tool to filter out indels on bases with insufficient quality scores, but this is not required. ----- **Example** If you set the threshold to 0.0 and have the following SAM file:: r327 16 chrM 11 37 8M1D10M * 0 0 CTTACCAGATAGTCATCA -+<2;?@BA@?-,.+4=4 XT:A:U NM:i:1 X0:i:1 X1:i:0 XM:i:0 XO:i:1 XG:i:1 MD:Z:41^C35 r457 0 chr1 14 37 14M * 0 0 ACCTGACAGATATC =/DF;?@1A@?-,. XT:A:U NM:i:0 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:76 r501 16 chrM 6 23 7M1I13M * 0 0 TCTGTGCCTACCAGACATTCA +=$2;?@BA@?-,.+4=4=4A XT:A:U NM:i:3 X0:i:1 X1:i:1 XM:i:2 XO:i:1 XG:i:1 MD:Z:28C36G9 XA:Z:chrM,+134263658,14M1I61M,4; r1288 16 chrM 8 37 11M1I7M * 0 0 TCACTTACCTGTACACACA /*F2;?@%A@?-,.+4=4= XT:A:U NM:i:4 X0:i:1 X1:i:0 XM:i:3 XO:i:1 XG:i:1 MD:Z:2T0T1A69 r1902 0 chr1 4 37 7M2D18M * 0 0 AGTCTCTTACCTGACGGTTATGA <2;?@BA@?-,.+4=4=4AA663 XT:A:U NM:i:3 X0:i:1 X1:i:0 XM:i:1 XO:i:1 XG:i:2 MD:Z:17^CA58A0 r2204 16 chrM 9 0 19M * 0 0 CTGGTACCTGACAGGTATC 2;?@BA@?-,.+4=4=4AA XT:A:R NM:i:1 X0:i:2 X1:i:0 XM:i:1 XO:i:0 XG:i:0 MD:Z:0T75 XA:Z:chrM,-564927,76M,1; r2314 16 chrM 6 37 10M2D8M * 0 0 TCACTCTTACGTCTGA <2;?@BA@?-,.+4=4 XT:A:U NM:i:3 X0:i:1 X1:i:0 XM:i:1 XO:i:1 XG:i:2 MD:Z:25A5^CA45 r3001 0 chrM 13 37 3M1D5M2I7M * 0 0 TACAGTCACCCTCATCA <2;?@BA/(@?-,$& XT:A:U NM:i:3 X0:i:1 X1:i:0 XM:i:1 XO:i:1 XG:i:2 MD:Z:17^CA58A0 r3218 0 chr1 13 37 8M1D7M * 0 0 TACAGTCACTCATCA <2;?@BA/(@?-,$& XT:A:U NM:i:3 X0:i:1 X1:i:0 XM:i:1 XO:i:1 XG:i:2 MD:Z:17^CA58A0 r4767 16 chr2 3 37 15M2I7M * 0 0 CAGACTCTCTTACCAAAGACAGAC <2;?@BA/(@?-,.+4=4=4AA66 XT:A:U NM:i:4 X0:i:1 X1:i:0 XM:i:3 XO:i:1 XG:i:1 MD:Z:2T1A4T65 r5333 0 chrM 5 37 17M1D8M * 0 0 GTCTCTCATACCAGACAACGGCAT FB3$@BA/(@?-,.+4=4=4AA66 XT:A:U NM:i:4 X0:i:1 X1:i:0 XM:i:3 XO:i:1 XG:i:1 MD:Z:45C10^C0C5C13 r6690 16 chrM 7 23 20M * 0 0 CTCTCTTACCAGACAGACAT 2;?@BA/(@?-,.+4=4=4A XT:A:U NM:i:0 X0:i:1 X1:i:1 XM:i:0 XO:i:0 XG:i:0 MD:Z:76 XA:Z:chrM,-568532,76M,1; r7211 0 chrM 7 37 24M * 0 0 CGACAGAGACAAAATAACATTTAA //<2;?@BA@?-,.+4=442;;6: XT:A:U NM:i:3 X0:i:1 X1:i:0 XM:i:2 XO:i:1 XG:i:1 MD:Z:73G0G0 r9922 16 chrM 4 0 7M3I9M * 0 0 CCAGACATTTGAAATCAGG F/D4=44^D++26632;;6 XT:A:U NM:i:0 X0:i:1 X1:i:1 XM:i:0 XO:i:0 XG:i:0 MD:Z:76 r9987 16 chrM 4 0 9M1I18M * 0 0 AGGTTCTCATTACCTGACACTCATCTTG G/AD6"/+4=4426632;;6:<2;?@BA XT:A:U NM:i:0 X0:i:1 X1:i:1 XM:i:0 XO:i:0 XG:i:0 MD:Z:76 r10145 16 chr1 16 0 5M2D7M * 0 0 CACATTGTTGTA G//+4=44=4AA XT:A:U NM:i:0 X0:i:1 X1:i:1 XM:i:0 XO:i:0 XG:i:0 MD:Z:76 r10324 16 chrM 15 0 6M1D5M * 0 0 CCGTTCTACTTG A@??8.G//+4= XT:A:U NM:i:0 X0:i:1 X1:i:1 XM:i:0 XO:i:0 XG:i:0 MD:Z:76 r12331 16 chrM 17 0 4M2I6M * 0 0 AGTCGAATACGTG 632;;6:<2;?@B XT:A:U NM:i:0 X0:i:1 X1:i:1 XM:i:0 XO:i:0 XG:i:0 MD:Z:76 r12914 16 chr2 24 0 4M3I3M * 0 0 ACTACCCCAA G//+4=42,. XT:A:U NM:i:0 X0:i:1 X1:i:1 XM:i:0 XO:i:0 XG:i:0 MD:Z:76 The following will be produced (deletions file followed by insertions file):: chr1 11 13 1 100.00 chr1 21 22 1 25.00 chr1 21 23 1 25.00 chrM 16 18 1 9.09 chrM 19 20 1 8.33 chrM 21 22 1 9.09 chrM 22 23 1 9.09 chr2 18 19 AA 1 50.00 50.00 chr2 28 29 CCC 1 50.00 50.00 chrM 11 12 TTT 1 9.09 9.09 chrM 13 14 C 1 9.09 9.09 chrM 13 14 T 1 9.09 9.09 chrM 19 20 T 1 7.69 8.33 chrM 21 22 GA 1 8.33 8.33