[2] | 1 | <tool id="rgClean1" name="Clean genotypes:"> |
---|
| 2 | <code file="rgClean_code.py"/> |
---|
| 3 | |
---|
| 4 | <description>filter markers, subjects</description> |
---|
| 5 | |
---|
| 6 | <command interpreter="python"> |
---|
| 7 | rgClean.py '$input_file.extra_files_path' '$input_file.metadata.base_name' '$title' '$mind' |
---|
| 8 | '$geno' '$hwe' '$maf' '$mef' '$mei' '$out_file1' '$out_file1.files_path' |
---|
| 9 | '$relfilter' '$afffilter' '$sexfilter' '$fixaff' |
---|
| 10 | </command> |
---|
| 11 | |
---|
| 12 | <inputs> |
---|
| 13 | <param name="input_file" type="data" label="RGenetics genotype library file in compressed Plink format" |
---|
| 14 | size="120" format="pbed" /> |
---|
| 15 | <param name="title" type="text" size="80" label="Descriptive title for cleaned genotype file" value="Cleaned_data"/> |
---|
| 16 | <param name="geno" type="text" label="Maximum Missing Fraction: Markers" value="0.05" /> |
---|
| 17 | <param name="mind" type="text" value="0.1" label="Maximum Missing Fraction: Subjects"/> |
---|
| 18 | <param name="mef" type="text" label="Maximum Mendel Error Rate: Family" value="0.05"/> |
---|
| 19 | <param name="mei" type="text" label="Maximum Mendel Error Rate: Marker" value="0.05"/> |
---|
| 20 | <param name="hwe" type="text" value="0" label="Smallest HWE p value (set to 0 for all)" /> |
---|
| 21 | <param name="maf" type="text" value="0.01" |
---|
| 22 | label="Smallest Minor Allele Frequency (set to 0 for all)"/> |
---|
| 23 | <param name='relfilter' label = "Filter on pedigree relatedness" type="select" |
---|
| 24 | optional="false" size="132" |
---|
| 25 | help="Optionally remove related subjects if pedigree identifies founders and their offspring"> |
---|
| 26 | <option value="all" selected='true'>No filter on relatedness</option> |
---|
| 27 | <option value="fo" >Keep Founders only (pedigree m/f ID = "0")</option> |
---|
| 28 | <option value="oo" >Keep Offspring only (one randomly chosen if >1 sibs in family)</option> |
---|
| 29 | </param> |
---|
| 30 | <param name='afffilter' label = "Filter on affection status" type="select" |
---|
| 31 | optional="false" size="132" |
---|
| 32 | help="Optionally remove affected or non affected subjects"> |
---|
| 33 | <option value="allaff" selected='true'>No filter on affection status</option> |
---|
| 34 | <option value="affonly" >Keep Controls only (affection='1')</option> |
---|
| 35 | <option value="unaffonly" >Keep Cases only (affection='2')</option> |
---|
| 36 | </param> |
---|
| 37 | <param name='sexfilter' label = "Filter on gender" type="select" |
---|
| 38 | optional="false" size="132" |
---|
| 39 | help="Optionally remove all male or all female subjects"> |
---|
| 40 | <option value="allsex" selected='true'>No filter on gender status</option> |
---|
| 41 | <option value="msex" >Keep Males only (pedigree gender='1')</option> |
---|
| 42 | <option value="fsex" >Keep Females only (pedigree gender='2')</option> |
---|
| 43 | </param> |
---|
| 44 | <param name="fixaff" type="text" value="0" |
---|
| 45 | label = "Change ALL subjects affection status to (0=no change,1=unaff,2=aff)" |
---|
| 46 | help="Use this option to switch the affection status to a new value for all output subjects" /> |
---|
| 47 | </inputs> |
---|
| 48 | |
---|
| 49 | <outputs> |
---|
| 50 | <data format="pbed" name="out_file1" metadata_source="input_file" /> |
---|
| 51 | </outputs> |
---|
| 52 | |
---|
| 53 | <tests> |
---|
| 54 | <test> |
---|
| 55 | <param name='input_file' value='tinywga' ftype='pbed' > |
---|
| 56 | <metadata name='base_name' value='tinywga' /> |
---|
| 57 | <composite_data value='tinywga.bim' /> |
---|
| 58 | <composite_data value='tinywga.bed' /> |
---|
| 59 | <composite_data value='tinywga.fam' /> |
---|
| 60 | <edit_attributes type='name' value='tinywga' /> |
---|
| 61 | </param> |
---|
| 62 | <param name='title' value='rgCleantest1' /> |
---|
| 63 | <param name="geno" value="1" /> |
---|
| 64 | <param name="mind" value="1" /> |
---|
| 65 | <param name="mef" value="0" /> |
---|
| 66 | <param name="mei" value="0" /> |
---|
| 67 | <param name="hwe" value="0" /> |
---|
| 68 | <param name="maf" value="0" /> |
---|
| 69 | <param name="relfilter" value="all" /> |
---|
| 70 | <param name="afffilter" value="allaff" /> |
---|
| 71 | <param name="sexfilter" value="allsex" /> |
---|
| 72 | <param name="fixaff" value="0" /> |
---|
| 73 | <output name='out_file1' file='rgtestouts/rgClean/rgCleantest1.pbed' compare="diff" lines_diff="25" > |
---|
| 74 | <extra_files type="file" name='rgCleantest1.bim' value="rgtestouts/rgClean/rgCleantest1.bim" compare="diff" /> |
---|
| 75 | <extra_files type="file" name='rgCleantest1.fam' value="rgtestouts/rgClean/rgCleantest1.fam" compare="diff" /> |
---|
| 76 | <extra_files type="file" name='rgCleantest1.bed' value="rgtestouts/rgClean/rgCleantest1.bed" compare="diff" /> |
---|
| 77 | </output> |
---|
| 78 | </test> |
---|
| 79 | </tests> |
---|
| 80 | <help> |
---|
| 81 | |
---|
| 82 | .. class:: infomark |
---|
| 83 | |
---|
| 84 | **Syntax** |
---|
| 85 | |
---|
| 86 | - **Genotype data** is the input genotype file chosen from your current history |
---|
| 87 | - **Descriptive title** is the name to use for the filtered output file |
---|
| 88 | - **Missfrac threshold: subjects** is the threshold for missingness by subject. Subjects with more than this fraction missing will be excluded from the import |
---|
| 89 | - **Missfrac threshold: markers** is the threshold for missingness by marker. Markers with more than this fraction missing will be excluded from the import |
---|
| 90 | - **MaxMendel Individuals** Mendel error fraction above which to exclude subjects with more than the specified fraction of mendelian errors in transmission (for family data only) |
---|
| 91 | - **MaxMendel Families** Mendel error fraction above which to exclude families with more than the specified fraction of mendelian errors in transmission (for family data only) |
---|
| 92 | - **HWE** is the threshold for HWE test p values below which the marker will not be imported. Set this to -1 and all markers will be imported regardless of HWE p value |
---|
| 93 | - **MAF** is the threshold for minor allele frequency - SNPs with lower MAF will be excluded |
---|
| 94 | - **Filters** for founders/offspring or affected/unaffected or males/females are optionally available if needed |
---|
| 95 | - **Change Affection** is only needed if you want to change the affection status for creating new analysis datasets |
---|
| 96 | |
---|
| 97 | ----- |
---|
| 98 | |
---|
| 99 | **Attribution** |
---|
| 100 | |
---|
| 101 | This tool relies on the work of many people. It uses Plink http://pngu.mgh.harvard.edu/~purcell/plink/, |
---|
| 102 | and the R http://cran.r-project.org/ and |
---|
| 103 | Bioconductor http://www.bioconductor.org/ projects. |
---|
| 104 | respectively. |
---|
| 105 | |
---|
| 106 | In particular, http://pngu.mgh.harvard.edu/~purcell/plink/ |
---|
| 107 | has excellent documentation describing the parameters you can set here. |
---|
| 108 | |
---|
| 109 | This implementation is a Galaxy tool wrapper around these third party applications. |
---|
| 110 | It was originally designed and written for family based data from the CAMP Illumina run of 2007 by |
---|
| 111 | ross lazarus (ross.lazarus@gmail.com) and incorporated into the rgenetics toolkit. |
---|
| 112 | |
---|
| 113 | Rgenetics merely exposes them, wrapping Plink so you can use it in Galaxy. |
---|
| 114 | |
---|
| 115 | ----- |
---|
| 116 | |
---|
| 117 | **Summary** |
---|
| 118 | |
---|
| 119 | Reliable statistical inference depends on reliable data. Poor quality samples and markers |
---|
| 120 | may add more noise than signal, decreasing statistical power. Removing the worst of them |
---|
| 121 | can be done by setting thresholds for some of the commonly used technical quality measures |
---|
| 122 | for genotype data. Of course discordant replicate calls are also very informative but are not |
---|
| 123 | in scope here. |
---|
| 124 | |
---|
| 125 | Marker cleaning: Filters are available to remove markers below a specific minor allele |
---|
| 126 | frequency, beyond a Hardy Wienberg threshold, below a minor allele frequency threshold, |
---|
| 127 | or above a threshold for missingness. If family data are available, thresholds for Mendelian |
---|
| 128 | error can be set. |
---|
| 129 | |
---|
| 130 | Subject cleaning: Filters are available to remove subjects with many missing calls. Subjects and markers for family data can be filtered by proportions |
---|
| 131 | of Mendelian errors in observed transmission. Use the QC reporting tool to |
---|
| 132 | generate a comprehensive series of reports for quality control. |
---|
| 133 | |
---|
| 134 | Note that ancestry and cryptic relatedness should also be checked using the relevant tools. |
---|
| 135 | |
---|
| 136 | ----- |
---|
| 137 | |
---|
| 138 | .. class:: infomark |
---|
| 139 | |
---|
| 140 | **Tip** |
---|
| 141 | |
---|
| 142 | You can check that you got what you asked for by running the QC tool to ensure that the distributions |
---|
| 143 | are truncated the way you expect. Note that you do not expect that the thresholds will be exactly |
---|
| 144 | what you set - some bad assays and subjects are out in multiple QC measures, so you sometimes have |
---|
| 145 | more samples or markers than you exactly set for each threshold. Finally, the ordering of |
---|
| 146 | operations matters and Plink is somewhat restrictive about what it will do on each pass |
---|
| 147 | of the data. At least it's fixed. |
---|
| 148 | |
---|
| 149 | ----- |
---|
| 150 | |
---|
| 151 | This Galaxy tool was written by Ross Lazarus for the Rgenetics project |
---|
| 152 | It uses Plink for most calculations - for full Plink attribution, source code and documentation, |
---|
| 153 | please see http://pngu.mgh.harvard.edu/~purcell/plink/ plus some custom python code |
---|
| 154 | |
---|
| 155 | </help> |
---|
| 156 | </tool> |
---|