| 1 | <tool id="rgClean1" name="Clean genotypes:"> |
|---|
| 2 | <code file="rgClean_code.py"/> |
|---|
| 3 | |
|---|
| 4 | <description>filter markers, subjects</description> |
|---|
| 5 | |
|---|
| 6 | <command interpreter="python"> |
|---|
| 7 | rgClean.py '$input_file.extra_files_path' '$input_file.metadata.base_name' '$title' '$mind' |
|---|
| 8 | '$geno' '$hwe' '$maf' '$mef' '$mei' '$out_file1' '$out_file1.files_path' |
|---|
| 9 | '$relfilter' '$afffilter' '$sexfilter' '$fixaff' |
|---|
| 10 | </command> |
|---|
| 11 | |
|---|
| 12 | <inputs> |
|---|
| 13 | <param name="input_file" type="data" label="RGenetics genotype library file in compressed Plink format" |
|---|
| 14 | size="120" format="pbed" /> |
|---|
| 15 | <param name="title" type="text" size="80" label="Descriptive title for cleaned genotype file" value="Cleaned_data"/> |
|---|
| 16 | <param name="geno" type="text" label="Maximum Missing Fraction: Markers" value="0.05" /> |
|---|
| 17 | <param name="mind" type="text" value="0.1" label="Maximum Missing Fraction: Subjects"/> |
|---|
| 18 | <param name="mef" type="text" label="Maximum Mendel Error Rate: Family" value="0.05"/> |
|---|
| 19 | <param name="mei" type="text" label="Maximum Mendel Error Rate: Marker" value="0.05"/> |
|---|
| 20 | <param name="hwe" type="text" value="0" label="Smallest HWE p value (set to 0 for all)" /> |
|---|
| 21 | <param name="maf" type="text" value="0.01" |
|---|
| 22 | label="Smallest Minor Allele Frequency (set to 0 for all)"/> |
|---|
| 23 | <param name='relfilter' label = "Filter on pedigree relatedness" type="select" |
|---|
| 24 | optional="false" size="132" |
|---|
| 25 | help="Optionally remove related subjects if pedigree identifies founders and their offspring"> |
|---|
| 26 | <option value="all" selected='true'>No filter on relatedness</option> |
|---|
| 27 | <option value="fo" >Keep Founders only (pedigree m/f ID = "0")</option> |
|---|
| 28 | <option value="oo" >Keep Offspring only (one randomly chosen if >1 sibs in family)</option> |
|---|
| 29 | </param> |
|---|
| 30 | <param name='afffilter' label = "Filter on affection status" type="select" |
|---|
| 31 | optional="false" size="132" |
|---|
| 32 | help="Optionally remove affected or non affected subjects"> |
|---|
| 33 | <option value="allaff" selected='true'>No filter on affection status</option> |
|---|
| 34 | <option value="affonly" >Keep Controls only (affection='1')</option> |
|---|
| 35 | <option value="unaffonly" >Keep Cases only (affection='2')</option> |
|---|
| 36 | </param> |
|---|
| 37 | <param name='sexfilter' label = "Filter on gender" type="select" |
|---|
| 38 | optional="false" size="132" |
|---|
| 39 | help="Optionally remove all male or all female subjects"> |
|---|
| 40 | <option value="allsex" selected='true'>No filter on gender status</option> |
|---|
| 41 | <option value="msex" >Keep Males only (pedigree gender='1')</option> |
|---|
| 42 | <option value="fsex" >Keep Females only (pedigree gender='2')</option> |
|---|
| 43 | </param> |
|---|
| 44 | <param name="fixaff" type="text" value="0" |
|---|
| 45 | label = "Change ALL subjects affection status to (0=no change,1=unaff,2=aff)" |
|---|
| 46 | help="Use this option to switch the affection status to a new value for all output subjects" /> |
|---|
| 47 | </inputs> |
|---|
| 48 | |
|---|
| 49 | <outputs> |
|---|
| 50 | <data format="pbed" name="out_file1" metadata_source="input_file" /> |
|---|
| 51 | </outputs> |
|---|
| 52 | |
|---|
| 53 | <tests> |
|---|
| 54 | <test> |
|---|
| 55 | <param name='input_file' value='tinywga' ftype='pbed' > |
|---|
| 56 | <metadata name='base_name' value='tinywga' /> |
|---|
| 57 | <composite_data value='tinywga.bim' /> |
|---|
| 58 | <composite_data value='tinywga.bed' /> |
|---|
| 59 | <composite_data value='tinywga.fam' /> |
|---|
| 60 | <edit_attributes type='name' value='tinywga' /> |
|---|
| 61 | </param> |
|---|
| 62 | <param name='title' value='rgCleantest1' /> |
|---|
| 63 | <param name="geno" value="1" /> |
|---|
| 64 | <param name="mind" value="1" /> |
|---|
| 65 | <param name="mef" value="0" /> |
|---|
| 66 | <param name="mei" value="0" /> |
|---|
| 67 | <param name="hwe" value="0" /> |
|---|
| 68 | <param name="maf" value="0" /> |
|---|
| 69 | <param name="relfilter" value="all" /> |
|---|
| 70 | <param name="afffilter" value="allaff" /> |
|---|
| 71 | <param name="sexfilter" value="allsex" /> |
|---|
| 72 | <param name="fixaff" value="0" /> |
|---|
| 73 | <output name='out_file1' file='rgtestouts/rgClean/rgCleantest1.pbed' compare="diff" lines_diff="25" > |
|---|
| 74 | <extra_files type="file" name='rgCleantest1.bim' value="rgtestouts/rgClean/rgCleantest1.bim" compare="diff" /> |
|---|
| 75 | <extra_files type="file" name='rgCleantest1.fam' value="rgtestouts/rgClean/rgCleantest1.fam" compare="diff" /> |
|---|
| 76 | <extra_files type="file" name='rgCleantest1.bed' value="rgtestouts/rgClean/rgCleantest1.bed" compare="diff" /> |
|---|
| 77 | </output> |
|---|
| 78 | </test> |
|---|
| 79 | </tests> |
|---|
| 80 | <help> |
|---|
| 81 | |
|---|
| 82 | .. class:: infomark |
|---|
| 83 | |
|---|
| 84 | **Syntax** |
|---|
| 85 | |
|---|
| 86 | - **Genotype data** is the input genotype file chosen from your current history |
|---|
| 87 | - **Descriptive title** is the name to use for the filtered output file |
|---|
| 88 | - **Missfrac threshold: subjects** is the threshold for missingness by subject. Subjects with more than this fraction missing will be excluded from the import |
|---|
| 89 | - **Missfrac threshold: markers** is the threshold for missingness by marker. Markers with more than this fraction missing will be excluded from the import |
|---|
| 90 | - **MaxMendel Individuals** Mendel error fraction above which to exclude subjects with more than the specified fraction of mendelian errors in transmission (for family data only) |
|---|
| 91 | - **MaxMendel Families** Mendel error fraction above which to exclude families with more than the specified fraction of mendelian errors in transmission (for family data only) |
|---|
| 92 | - **HWE** is the threshold for HWE test p values below which the marker will not be imported. Set this to -1 and all markers will be imported regardless of HWE p value |
|---|
| 93 | - **MAF** is the threshold for minor allele frequency - SNPs with lower MAF will be excluded |
|---|
| 94 | - **Filters** for founders/offspring or affected/unaffected or males/females are optionally available if needed |
|---|
| 95 | - **Change Affection** is only needed if you want to change the affection status for creating new analysis datasets |
|---|
| 96 | |
|---|
| 97 | ----- |
|---|
| 98 | |
|---|
| 99 | **Attribution** |
|---|
| 100 | |
|---|
| 101 | This tool relies on the work of many people. It uses Plink http://pngu.mgh.harvard.edu/~purcell/plink/, |
|---|
| 102 | and the R http://cran.r-project.org/ and |
|---|
| 103 | Bioconductor http://www.bioconductor.org/ projects. |
|---|
| 104 | respectively. |
|---|
| 105 | |
|---|
| 106 | In particular, http://pngu.mgh.harvard.edu/~purcell/plink/ |
|---|
| 107 | has excellent documentation describing the parameters you can set here. |
|---|
| 108 | |
|---|
| 109 | This implementation is a Galaxy tool wrapper around these third party applications. |
|---|
| 110 | It was originally designed and written for family based data from the CAMP Illumina run of 2007 by |
|---|
| 111 | ross lazarus (ross.lazarus@gmail.com) and incorporated into the rgenetics toolkit. |
|---|
| 112 | |
|---|
| 113 | Rgenetics merely exposes them, wrapping Plink so you can use it in Galaxy. |
|---|
| 114 | |
|---|
| 115 | ----- |
|---|
| 116 | |
|---|
| 117 | **Summary** |
|---|
| 118 | |
|---|
| 119 | Reliable statistical inference depends on reliable data. Poor quality samples and markers |
|---|
| 120 | may add more noise than signal, decreasing statistical power. Removing the worst of them |
|---|
| 121 | can be done by setting thresholds for some of the commonly used technical quality measures |
|---|
| 122 | for genotype data. Of course discordant replicate calls are also very informative but are not |
|---|
| 123 | in scope here. |
|---|
| 124 | |
|---|
| 125 | Marker cleaning: Filters are available to remove markers below a specific minor allele |
|---|
| 126 | frequency, beyond a Hardy Wienberg threshold, below a minor allele frequency threshold, |
|---|
| 127 | or above a threshold for missingness. If family data are available, thresholds for Mendelian |
|---|
| 128 | error can be set. |
|---|
| 129 | |
|---|
| 130 | Subject cleaning: Filters are available to remove subjects with many missing calls. Subjects and markers for family data can be filtered by proportions |
|---|
| 131 | of Mendelian errors in observed transmission. Use the QC reporting tool to |
|---|
| 132 | generate a comprehensive series of reports for quality control. |
|---|
| 133 | |
|---|
| 134 | Note that ancestry and cryptic relatedness should also be checked using the relevant tools. |
|---|
| 135 | |
|---|
| 136 | ----- |
|---|
| 137 | |
|---|
| 138 | .. class:: infomark |
|---|
| 139 | |
|---|
| 140 | **Tip** |
|---|
| 141 | |
|---|
| 142 | You can check that you got what you asked for by running the QC tool to ensure that the distributions |
|---|
| 143 | are truncated the way you expect. Note that you do not expect that the thresholds will be exactly |
|---|
| 144 | what you set - some bad assays and subjects are out in multiple QC measures, so you sometimes have |
|---|
| 145 | more samples or markers than you exactly set for each threshold. Finally, the ordering of |
|---|
| 146 | operations matters and Plink is somewhat restrictive about what it will do on each pass |
|---|
| 147 | of the data. At least it's fixed. |
|---|
| 148 | |
|---|
| 149 | ----- |
|---|
| 150 | |
|---|
| 151 | This Galaxy tool was written by Ross Lazarus for the Rgenetics project |
|---|
| 152 | It uses Plink for most calculations - for full Plink attribution, source code and documentation, |
|---|
| 153 | please see http://pngu.mgh.harvard.edu/~purcell/plink/ plus some custom python code |
|---|
| 154 | |
|---|
| 155 | </help> |
|---|
| 156 | </tool> |
|---|