1 | <tool id="rgClean1" name="Clean genotypes:"> |
---|
2 | <code file="rgClean_code.py"/> |
---|
3 | |
---|
4 | <description>filter markers, subjects</description> |
---|
5 | |
---|
6 | <command interpreter="python"> |
---|
7 | rgClean.py '$input_file.extra_files_path' '$input_file.metadata.base_name' '$title' '$mind' |
---|
8 | '$geno' '$hwe' '$maf' '$mef' '$mei' '$out_file1' '$out_file1.files_path' |
---|
9 | '$relfilter' '$afffilter' '$sexfilter' '$fixaff' |
---|
10 | </command> |
---|
11 | |
---|
12 | <inputs> |
---|
13 | <param name="input_file" type="data" label="RGenetics genotype library file in compressed Plink format" |
---|
14 | size="120" format="pbed" /> |
---|
15 | <param name="title" type="text" size="80" label="Descriptive title for cleaned genotype file" value="Cleaned_data"/> |
---|
16 | <param name="geno" type="text" label="Maximum Missing Fraction: Markers" value="0.05" /> |
---|
17 | <param name="mind" type="text" value="0.1" label="Maximum Missing Fraction: Subjects"/> |
---|
18 | <param name="mef" type="text" label="Maximum Mendel Error Rate: Family" value="0.05"/> |
---|
19 | <param name="mei" type="text" label="Maximum Mendel Error Rate: Marker" value="0.05"/> |
---|
20 | <param name="hwe" type="text" value="0" label="Smallest HWE p value (set to 0 for all)" /> |
---|
21 | <param name="maf" type="text" value="0.01" |
---|
22 | label="Smallest Minor Allele Frequency (set to 0 for all)"/> |
---|
23 | <param name='relfilter' label = "Filter on pedigree relatedness" type="select" |
---|
24 | optional="false" size="132" |
---|
25 | help="Optionally remove related subjects if pedigree identifies founders and their offspring"> |
---|
26 | <option value="all" selected='true'>No filter on relatedness</option> |
---|
27 | <option value="fo" >Keep Founders only (pedigree m/f ID = "0")</option> |
---|
28 | <option value="oo" >Keep Offspring only (one randomly chosen if >1 sibs in family)</option> |
---|
29 | </param> |
---|
30 | <param name='afffilter' label = "Filter on affection status" type="select" |
---|
31 | optional="false" size="132" |
---|
32 | help="Optionally remove affected or non affected subjects"> |
---|
33 | <option value="allaff" selected='true'>No filter on affection status</option> |
---|
34 | <option value="affonly" >Keep Controls only (affection='1')</option> |
---|
35 | <option value="unaffonly" >Keep Cases only (affection='2')</option> |
---|
36 | </param> |
---|
37 | <param name='sexfilter' label = "Filter on gender" type="select" |
---|
38 | optional="false" size="132" |
---|
39 | help="Optionally remove all male or all female subjects"> |
---|
40 | <option value="allsex" selected='true'>No filter on gender status</option> |
---|
41 | <option value="msex" >Keep Males only (pedigree gender='1')</option> |
---|
42 | <option value="fsex" >Keep Females only (pedigree gender='2')</option> |
---|
43 | </param> |
---|
44 | <param name="fixaff" type="text" value="0" |
---|
45 | label = "Change ALL subjects affection status to (0=no change,1=unaff,2=aff)" |
---|
46 | help="Use this option to switch the affection status to a new value for all output subjects" /> |
---|
47 | </inputs> |
---|
48 | |
---|
49 | <outputs> |
---|
50 | <data format="pbed" name="out_file1" metadata_source="input_file" /> |
---|
51 | </outputs> |
---|
52 | |
---|
53 | <tests> |
---|
54 | <test> |
---|
55 | <param name='input_file' value='tinywga' ftype='pbed' > |
---|
56 | <metadata name='base_name' value='tinywga' /> |
---|
57 | <composite_data value='tinywga.bim' /> |
---|
58 | <composite_data value='tinywga.bed' /> |
---|
59 | <composite_data value='tinywga.fam' /> |
---|
60 | <edit_attributes type='name' value='tinywga' /> |
---|
61 | </param> |
---|
62 | <param name='title' value='rgCleantest1' /> |
---|
63 | <param name="geno" value="1" /> |
---|
64 | <param name="mind" value="1" /> |
---|
65 | <param name="mef" value="0" /> |
---|
66 | <param name="mei" value="0" /> |
---|
67 | <param name="hwe" value="0" /> |
---|
68 | <param name="maf" value="0" /> |
---|
69 | <param name="relfilter" value="all" /> |
---|
70 | <param name="afffilter" value="allaff" /> |
---|
71 | <param name="sexfilter" value="allsex" /> |
---|
72 | <param name="fixaff" value="0" /> |
---|
73 | <output name='out_file1' file='rgtestouts/rgClean/rgCleantest1.pbed' compare="diff" lines_diff="25" > |
---|
74 | <extra_files type="file" name='rgCleantest1.bim' value="rgtestouts/rgClean/rgCleantest1.bim" compare="diff" /> |
---|
75 | <extra_files type="file" name='rgCleantest1.fam' value="rgtestouts/rgClean/rgCleantest1.fam" compare="diff" /> |
---|
76 | <extra_files type="file" name='rgCleantest1.bed' value="rgtestouts/rgClean/rgCleantest1.bed" compare="diff" /> |
---|
77 | </output> |
---|
78 | </test> |
---|
79 | </tests> |
---|
80 | <help> |
---|
81 | |
---|
82 | .. class:: infomark |
---|
83 | |
---|
84 | **Syntax** |
---|
85 | |
---|
86 | - **Genotype data** is the input genotype file chosen from your current history |
---|
87 | - **Descriptive title** is the name to use for the filtered output file |
---|
88 | - **Missfrac threshold: subjects** is the threshold for missingness by subject. Subjects with more than this fraction missing will be excluded from the import |
---|
89 | - **Missfrac threshold: markers** is the threshold for missingness by marker. Markers with more than this fraction missing will be excluded from the import |
---|
90 | - **MaxMendel Individuals** Mendel error fraction above which to exclude subjects with more than the specified fraction of mendelian errors in transmission (for family data only) |
---|
91 | - **MaxMendel Families** Mendel error fraction above which to exclude families with more than the specified fraction of mendelian errors in transmission (for family data only) |
---|
92 | - **HWE** is the threshold for HWE test p values below which the marker will not be imported. Set this to -1 and all markers will be imported regardless of HWE p value |
---|
93 | - **MAF** is the threshold for minor allele frequency - SNPs with lower MAF will be excluded |
---|
94 | - **Filters** for founders/offspring or affected/unaffected or males/females are optionally available if needed |
---|
95 | - **Change Affection** is only needed if you want to change the affection status for creating new analysis datasets |
---|
96 | |
---|
97 | ----- |
---|
98 | |
---|
99 | **Attribution** |
---|
100 | |
---|
101 | This tool relies on the work of many people. It uses Plink http://pngu.mgh.harvard.edu/~purcell/plink/, |
---|
102 | and the R http://cran.r-project.org/ and |
---|
103 | Bioconductor http://www.bioconductor.org/ projects. |
---|
104 | respectively. |
---|
105 | |
---|
106 | In particular, http://pngu.mgh.harvard.edu/~purcell/plink/ |
---|
107 | has excellent documentation describing the parameters you can set here. |
---|
108 | |
---|
109 | This implementation is a Galaxy tool wrapper around these third party applications. |
---|
110 | It was originally designed and written for family based data from the CAMP Illumina run of 2007 by |
---|
111 | ross lazarus (ross.lazarus@gmail.com) and incorporated into the rgenetics toolkit. |
---|
112 | |
---|
113 | Rgenetics merely exposes them, wrapping Plink so you can use it in Galaxy. |
---|
114 | |
---|
115 | ----- |
---|
116 | |
---|
117 | **Summary** |
---|
118 | |
---|
119 | Reliable statistical inference depends on reliable data. Poor quality samples and markers |
---|
120 | may add more noise than signal, decreasing statistical power. Removing the worst of them |
---|
121 | can be done by setting thresholds for some of the commonly used technical quality measures |
---|
122 | for genotype data. Of course discordant replicate calls are also very informative but are not |
---|
123 | in scope here. |
---|
124 | |
---|
125 | Marker cleaning: Filters are available to remove markers below a specific minor allele |
---|
126 | frequency, beyond a Hardy Wienberg threshold, below a minor allele frequency threshold, |
---|
127 | or above a threshold for missingness. If family data are available, thresholds for Mendelian |
---|
128 | error can be set. |
---|
129 | |
---|
130 | Subject cleaning: Filters are available to remove subjects with many missing calls. Subjects and markers for family data can be filtered by proportions |
---|
131 | of Mendelian errors in observed transmission. Use the QC reporting tool to |
---|
132 | generate a comprehensive series of reports for quality control. |
---|
133 | |
---|
134 | Note that ancestry and cryptic relatedness should also be checked using the relevant tools. |
---|
135 | |
---|
136 | ----- |
---|
137 | |
---|
138 | .. class:: infomark |
---|
139 | |
---|
140 | **Tip** |
---|
141 | |
---|
142 | You can check that you got what you asked for by running the QC tool to ensure that the distributions |
---|
143 | are truncated the way you expect. Note that you do not expect that the thresholds will be exactly |
---|
144 | what you set - some bad assays and subjects are out in multiple QC measures, so you sometimes have |
---|
145 | more samples or markers than you exactly set for each threshold. Finally, the ordering of |
---|
146 | operations matters and Plink is somewhat restrictive about what it will do on each pass |
---|
147 | of the data. At least it's fixed. |
---|
148 | |
---|
149 | ----- |
---|
150 | |
---|
151 | This Galaxy tool was written by Ross Lazarus for the Rgenetics project |
---|
152 | It uses Plink for most calculations - for full Plink attribution, source code and documentation, |
---|
153 | please see http://pngu.mgh.harvard.edu/~purcell/plink/ plus some custom python code |
---|
154 | |
---|
155 | </help> |
---|
156 | </tool> |
---|