| 1 | <tool id="rgLDIndep1" name="LD Independent:"> | 
|---|
| 2 | <code file="rgLDIndep_code.py"/> | 
|---|
| 3 |  | 
|---|
| 4 | <description>filter high LD pairs - decrease redundancy</description> | 
|---|
| 5 |  | 
|---|
| 6 | <command interpreter="python"> | 
|---|
| 7 | rgLDIndep.py '$input_file.extra_files_path' '$input_file.metadata.base_name' '$title1' '$mind' | 
|---|
| 8 | '$geno' '$hwe' '$maf' '$mef' '$mei' '$out_file1' | 
|---|
| 9 | '$out_file1.files_path'  '$window' '$step' '$r2' | 
|---|
| 10 | </command> | 
|---|
| 11 |  | 
|---|
| 12 | <inputs> | 
|---|
| 13 | <param name="input_file"  type="data" label="RGenetics genotype data from your current history" | 
|---|
| 14 | size="80" format="pbed" /> | 
|---|
| 15 | <param name="title1" type="text" size="80" label="Descriptive title for cleaned genotype file" value="LD_Independent"/> | 
|---|
| 16 | <param name="r2" type="float" value = "0.1" | 
|---|
| 17 | label="r2 threshold: Select only pairs at or below this r^2 threshold (eg 0.1)" | 
|---|
| 18 | help="LD threshold defining LD independent markers" /> | 
|---|
| 19 | <param name="window" type="integer" value = "40" label="Window: Window size to limit LD pairwise" | 
|---|
| 20 | help = "Bigger is better but time taken blows up exponentially as the window grows!" /> | 
|---|
| 21 | <param name="step" type="integer" value = "30" label="Step: Move window this far and recompute" | 
|---|
| 22 | help = "Smaller is better but of course, time increases..." /> | 
|---|
| 23 | <param name="geno"  type="float" label="Maximum Missing Fraction: Markers" value="1.0" /> | 
|---|
| 24 | <param name="mind" type="float" value="1.0" label="Maximum Missing Fraction: Subjects"/> | 
|---|
| 25 | <param name="mef"  type="float" label="Maximum Mendel Error Rate: Family" value="1.0"/> | 
|---|
| 26 | <param name="mei"  type="float" label="Maximum Mendel Error Rate: Marker" value="1.0"/> | 
|---|
| 27 | <param name="hwe" type="float" value="0.0" label="Smallest HWE p value (set to 0 for all)" /> | 
|---|
| 28 | <param name="maf" type="float" value="0.0" | 
|---|
| 29 | label="Smallest Allowable Minor Allele Frequency (set to 0.0 for all)"/> | 
|---|
| 30 |  | 
|---|
| 31 | </inputs> | 
|---|
| 32 |  | 
|---|
| 33 | <outputs> | 
|---|
| 34 | <data format="pbed" name="out_file1" metadata_source="input_file"  /> | 
|---|
| 35 | </outputs> | 
|---|
| 36 | <tests> | 
|---|
| 37 | <test> | 
|---|
| 38 |  | 
|---|
| 39 | <param name='input_file' value='tinywga' ftype='pbed' > | 
|---|
| 40 | <metadata name='base_name' value='tinywga' /> | 
|---|
| 41 | <composite_data value='tinywga.bim' /> | 
|---|
| 42 | <composite_data value='tinywga.bed' /> | 
|---|
| 43 | <composite_data value='tinywga.fam' /> | 
|---|
| 44 | <edit_attributes type='name' value='tinywga' /> | 
|---|
| 45 | </param> | 
|---|
| 46 | <param name='title1' value='rgLDIndeptest1' /> | 
|---|
| 47 | <param name="mind" value="1" /> | 
|---|
| 48 | <param name="geno" value="1" /> | 
|---|
| 49 | <param name="hwe" value="0" /> | 
|---|
| 50 | <param name="maf" value="0" /> | 
|---|
| 51 | <param name="mef" value="1" /> | 
|---|
| 52 | <param name="mei" value="1" /> | 
|---|
| 53 | <param name="window" value="10000" /> | 
|---|
| 54 | <param name="step" value="5000" /> | 
|---|
| 55 | <param name="r2" value="0.1" /> | 
|---|
| 56 | <output name='out_file1' file='rgtestouts/rgLDIndep/rgLDIndeptest1.pbed' ftype='pbed' compare="diff" lines_diff='7'> | 
|---|
| 57 | <extra_files type="file" name='rgLDIndeptest1.bim' value="rgtestouts/rgLDIndep/rgLDIndeptest1.bim" compare="sim_size" delta="1000"/> | 
|---|
| 58 | <extra_files type="file" name='rgLDIndeptest1.fam' value="rgtestouts/rgLDIndep/rgLDIndeptest1.fam" compare="diff" /> | 
|---|
| 59 | <extra_files type="file" name='rgLDIndeptest1.bed' value="rgtestouts/rgLDIndep/rgLDIndeptest1.bed" compare="sim_size" delta = "1000" /> | 
|---|
| 60 | </output> | 
|---|
| 61 | </test> | 
|---|
| 62 | </tests> | 
|---|
| 63 | <help> | 
|---|
| 64 |  | 
|---|
| 65 | .. class:: infomark | 
|---|
| 66 |  | 
|---|
| 67 | **Attribution** | 
|---|
| 68 |  | 
|---|
| 69 | This tool relies on Plink from Shaun Purcell. For full documentation, please see his web site | 
|---|
| 70 | at http://pngu.mgh.harvard.edu/~purcell/plink/ where there is excellent documentation describing | 
|---|
| 71 | the parameters you can set here. | 
|---|
| 72 |  | 
|---|
| 73 | Rgenetics merely exposes them, wrapping Plink so you can use it in Galaxy. | 
|---|
| 74 |  | 
|---|
| 75 | **Summary** | 
|---|
| 76 |  | 
|---|
| 77 | In addition to filtering some marker and sample quality measures, | 
|---|
| 78 | this tool reduces the amount of overlapping information, by removing | 
|---|
| 79 | most of the duplicate information contained in linkage disequilibrium. This is | 
|---|
| 80 | a lossy process and for some methods, signal may be lost. However, this makes | 
|---|
| 81 | the dataset far more compact (eg 10% of the original storage size) while still | 
|---|
| 82 | being highly informative and less biased for some (note NOT all!) statistical methods. | 
|---|
| 83 | This is the Clean tool with additional data reduction via Plink LD pruning. | 
|---|
| 84 | Use the Clean tool if you don't want LD pruning - which you don't for most statistical testing. | 
|---|
| 85 | For ancestry and relatedness, you may well want LD pruned data as it has | 
|---|
| 86 | some specific desirable properties. | 
|---|
| 87 |  | 
|---|
| 88 | **LD** | 
|---|
| 89 |  | 
|---|
| 90 | Pairwise Linkage disequilibrium (LD) measures the extent to which the genotype at one locus | 
|---|
| 91 | predicts the state of another locus at the level of an entire population. | 
|---|
| 92 | When population LD between a pair of markers is high, | 
|---|
| 93 | knowing an individual's genotype at one locus allows confident prediction of the genotype at the other. | 
|---|
| 94 | In other words, high LD means information redundancy between markers. For some | 
|---|
| 95 | purposes, removing some of this redundancy can improve the performance of some analyses. | 
|---|
| 96 | Executing this tool will create a new genotype dataset in your current history containing | 
|---|
| 97 | LD independent markers - most of the genetic information is retained but without as much redundancy. | 
|---|
| 98 |  | 
|---|
| 99 | Set a pairwise LD threshold (eg r^2 = 0.2) and the (smaller) resulting dataset will have no | 
|---|
| 100 | pairs of marker with r^2 greater than 0.2. Additional filters are available to remove markers | 
|---|
| 101 | below a specific minor allele frequency, or above a specific level of missingness, | 
|---|
| 102 | and to remove subjects using similar criteria. Subjects and markers for family data can be | 
|---|
| 103 | filtered by proportions of Mendelian errors in observed transmission. | 
|---|
| 104 |  | 
|---|
| 105 | ----- | 
|---|
| 106 |  | 
|---|
| 107 | **Syntax** | 
|---|
| 108 |  | 
|---|
| 109 | - **Genotype data** is the input pedfile chosen from available library files | 
|---|
| 110 | - **New name** is the name to use for the filtered output file | 
|---|
| 111 | - **Missfrac threshold: subjects** is the threshold for missingness by subject. Subjects with more than this fraction missing will be excluded from the import | 
|---|
| 112 | - **Missfrac threshold: markers** is the threshold for missingness by marker. Markers with more than this fraction missing will be excluded from the import | 
|---|
| 113 | - **MaxMendel Individuals** Mendel error fraction above which to exclude subjects with more than the specified fraction of mendelian errors in transmission (for family data only) | 
|---|
| 114 | - **MaxMendel Families** Mendel error fraction above which to exclude families with more than the specified fraction of mendelian errors in transmission (for family data only) | 
|---|
| 115 | - **HWE** is the threshold for HWE test p values below which the marker will not be imported. Set this to -1 and all markers will be imported regardless of HWE p value | 
|---|
| 116 | - **MAF** is the threshold for minor allele frequency - SNPs with lower MAF will be excluded | 
|---|
| 117 | - **r^2** is the pairwise LD threshold as r^2. Lower -> less marker redundancy -> fewer markers | 
|---|
| 118 | - **Window** is the window width for LD threshold. Bigger -> slower -> more complete | 
|---|
| 119 | - **Skip** is the distance to move the window along the genome. Should be window or less. | 
|---|
| 120 |  | 
|---|
| 121 | ----- | 
|---|
| 122 |  | 
|---|
| 123 | **Disclaimer** | 
|---|
| 124 |  | 
|---|
| 125 | This tool relies on Plink from Shaun Purcell. For full documentation, please see his web site | 
|---|
| 126 | at http://pngu.mgh.harvard.edu/~purcell/plink/ where thereis excellent documentation describing | 
|---|
| 127 | the parameters you can set here. Rgenetics merely exposes them, and wraps Plink so you can use it in Galaxy. | 
|---|
| 128 |  | 
|---|
| 129 | This tool is designed to create genotype data files with more or less LD independent sets of markers. These | 
|---|
| 130 | reduced genotype data files are particularly useful for purposes such as evaluating | 
|---|
| 131 | ancestry (eg eigenstrat) or relatedness (eg rgGRR) | 
|---|
| 132 |  | 
|---|
| 133 | LD pruning decreases redundancy among the genotype data by removing one of each pair of markers | 
|---|
| 134 | in strong LD (above the r^2 threshold) over successive genomic windows (the Window parameter), | 
|---|
| 135 | skipping (the Skip parameter bases between windows. The defaults should produce useable outputs. | 
|---|
| 136 |  | 
|---|
| 137 | This might be more efficient for rgGRR and | 
|---|
| 138 | eigenstrat...The core quote is | 
|---|
| 139 |  | 
|---|
| 140 | "This generates the same output files as the first version; | 
|---|
| 141 | the only difference is that a simple pairwise threshold is used. | 
|---|
| 142 | The first two parameters (50 and 5) are the same as above (window size and step); | 
|---|
| 143 | the third parameter represents the r^2 threshold. | 
|---|
| 144 | Note: this represents the pairwise SNP-SNP metric now, not the | 
|---|
| 145 | multiple correlation coefficient; also note, this is based on the | 
|---|
| 146 | genotypic correlation, i.e. it does not involve phasing. | 
|---|
| 147 | " | 
|---|
| 148 |  | 
|---|
| 149 | ----- | 
|---|
| 150 |  | 
|---|
| 151 |  | 
|---|
| 152 |  | 
|---|
| 153 | This Galaxy tool was written by Ross Lazarus for the Rgenetics project | 
|---|
| 154 | It uses Plink for most calculations - for full Plink attribution, source code and documentation, | 
|---|
| 155 | please see http://pngu.mgh.harvard.edu/~purcell/plink/ plus some custom python code | 
|---|
| 156 |  | 
|---|
| 157 | </help> | 
|---|
| 158 | </tool> | 
|---|