root/galaxy-central/tools/rgenetics/rgClean.xml

リビジョン 2, 8.1 KB (コミッタ: hatakeyama, 14 年 前)

import galaxy-central

行番号 
1<tool id="rgClean1" name="Clean genotypes:">
2    <code file="rgClean_code.py"/>
3
4    <description>filter markers, subjects</description>
5
6    <command interpreter="python">
7        rgClean.py '$input_file.extra_files_path' '$input_file.metadata.base_name' '$title' '$mind'
8        '$geno' '$hwe' '$maf' '$mef' '$mei' '$out_file1' '$out_file1.files_path'
9        '$relfilter' '$afffilter' '$sexfilter' '$fixaff'
10    </command>
11
12    <inputs>
13       <param name="input_file"  type="data" label="RGenetics genotype library file in compressed Plink format"
14         size="120" format="pbed" />
15       <param name="title" type="text" size="80" label="Descriptive title for cleaned genotype file" value="Cleaned_data"/>
16       <param name="geno"  type="text" label="Maximum Missing Fraction: Markers" value="0.05" />
17       <param name="mind" type="text" value="0.1" label="Maximum Missing Fraction: Subjects"/>
18       <param name="mef"  type="text" label="Maximum Mendel Error Rate: Family" value="0.05"/>
19       <param name="mei"  type="text" label="Maximum Mendel Error Rate: Marker" value="0.05"/>
20       <param name="hwe" type="text" value="0" label="Smallest HWE p value (set to 0 for all)" />
21       <param name="maf" type="text" value="0.01"
22       label="Smallest Minor Allele Frequency (set to 0 for all)"/>
23       <param name='relfilter' label = "Filter on pedigree relatedness" type="select"
24             optional="false" size="132"
25         help="Optionally remove related subjects if pedigree identifies founders and their offspring">
26         <option value="all" selected='true'>No filter on relatedness</option>
27         <option value="fo" >Keep Founders only (pedigree m/f ID = "0")</option>
28         <option value="oo" >Keep Offspring only (one randomly chosen if >1 sibs in family)</option>
29                </param>
30       <param name='afffilter' label = "Filter on affection status" type="select"
31             optional="false" size="132"
32         help="Optionally remove affected or non affected subjects">
33         <option value="allaff" selected='true'>No filter on affection status</option>
34         <option value="affonly" >Keep Controls only (affection='1')</option>
35         <option value="unaffonly" >Keep Cases only (affection='2')</option>
36                </param>
37       <param name='sexfilter' label = "Filter on gender" type="select"
38             optional="false" size="132"
39         help="Optionally remove all male or all female subjects">
40         <option value="allsex" selected='true'>No filter on gender status</option>
41         <option value="msex" >Keep Males only (pedigree gender='1')</option>
42         <option value="fsex" >Keep Females only (pedigree gender='2')</option>
43                </param>
44       <param name="fixaff" type="text" value="0"
45          label = "Change ALL subjects affection status to (0=no change,1=unaff,2=aff)"
46          help="Use this option to switch the affection status to a new value for all output subjects" />
47   </inputs>
48
49   <outputs>
50       <data format="pbed" name="out_file1" metadata_source="input_file"  />
51   </outputs>
52
53<tests>
54 <test>
55    <param name='input_file' value='tinywga' ftype='pbed' >
56    <metadata name='base_name' value='tinywga' />
57    <composite_data value='tinywga.bim' />
58    <composite_data value='tinywga.bed' />
59    <composite_data value='tinywga.fam' />
60    <edit_attributes type='name' value='tinywga' />
61    </param>
62    <param name='title' value='rgCleantest1' />
63    <param name="geno" value="1" />
64    <param name="mind" value="1" />
65    <param name="mef" value="0" />
66    <param name="mei" value="0" />
67    <param name="hwe" value="0" />
68    <param name="maf" value="0" />
69    <param name="relfilter" value="all" />
70    <param name="afffilter" value="allaff" />
71    <param name="sexfilter" value="allsex" />
72    <param name="fixaff" value="0" />
73    <output name='out_file1' file='rgtestouts/rgClean/rgCleantest1.pbed' compare="diff" lines_diff="25" >
74    <extra_files type="file" name='rgCleantest1.bim' value="rgtestouts/rgClean/rgCleantest1.bim" compare="diff" />
75    <extra_files type="file" name='rgCleantest1.fam' value="rgtestouts/rgClean/rgCleantest1.fam" compare="diff" />
76    <extra_files type="file" name='rgCleantest1.bed' value="rgtestouts/rgClean/rgCleantest1.bed" compare="diff" />
77    </output>
78 </test>
79</tests>
80<help>
81
82.. class:: infomark
83
84**Syntax**
85
86- **Genotype data** is the input genotype file chosen from your current history
87- **Descriptive title** is the name to use for the filtered output file
88- **Missfrac threshold: subjects** is the threshold for missingness by subject. Subjects with more than this fraction missing will be excluded from the import
89- **Missfrac threshold: markers** is the threshold for missingness by marker. Markers with more than this fraction missing will be excluded from the import
90- **MaxMendel Individuals** Mendel error fraction above which to exclude subjects with more than the specified fraction of mendelian errors in transmission (for family data only)
91- **MaxMendel Families** Mendel error fraction above which to exclude families with more than the specified fraction of mendelian errors in transmission (for family data only)
92- **HWE** is the threshold for HWE test p values below which the marker will not be imported. Set this to -1 and all markers will be imported regardless of HWE p value
93- **MAF** is the threshold for minor allele frequency - SNPs with lower MAF will be excluded
94- **Filters** for founders/offspring or affected/unaffected or males/females are optionally available if needed
95- **Change Affection** is only needed if you want to change the affection status for creating new analysis datasets
96
97-----
98
99**Attribution**
100
101This tool relies on the work of many people. It uses Plink http://pngu.mgh.harvard.edu/~purcell/plink/,
102and the R http://cran.r-project.org/ and
103Bioconductor http://www.bioconductor.org/ projects.
104respectively.
105
106In particular, http://pngu.mgh.harvard.edu/~purcell/plink/
107has excellent documentation describing the parameters you can set here.
108
109This implementation is a Galaxy tool wrapper around these third party applications.
110It was originally designed and written for family based data from the CAMP Illumina run of 2007 by
111ross lazarus (ross.lazarus@gmail.com) and incorporated into the rgenetics toolkit.
112
113Rgenetics merely exposes them, wrapping Plink so you can use it in Galaxy.
114
115-----
116
117**Summary**
118
119Reliable statistical inference depends on reliable data. Poor quality samples and markers
120may add more noise than signal, decreasing statistical power. Removing the worst of them
121can be done by setting thresholds for some of the commonly used technical quality measures
122for genotype data. Of course discordant replicate calls are also very informative but are not
123in scope here.
124
125Marker cleaning: Filters are available to remove markers below a specific minor allele
126frequency, beyond a Hardy Wienberg threshold, below a minor allele frequency threshold,
127or above a threshold for missingness. If family data are available, thresholds for Mendelian
128error can be set.
129
130Subject cleaning: Filters are available to remove subjects with many missing calls. Subjects and markers for family data can be filtered by proportions
131of Mendelian errors in observed transmission. Use the QC reporting tool to
132generate a comprehensive series of reports for quality control.
133
134Note that ancestry and cryptic relatedness should also be checked using the relevant tools.
135
136-----
137
138.. class:: infomark
139
140**Tip**
141
142You can check that you got what you asked for by running the QC tool to ensure that the distributions
143are truncated the way you expect. Note that you do not expect that the thresholds will be exactly
144what you set - some bad assays and subjects are out in multiple QC measures, so you sometimes have
145more samples or markers than you exactly set for each threshold. Finally, the ordering of
146operations matters and Plink is somewhat restrictive about what it will do on each pass
147of the data. At least it's fixed.
148
149-----
150
151This Galaxy tool was written by Ross Lazarus for the Rgenetics project
152It uses Plink for most calculations - for full Plink attribution, source code and documentation,
153please see http://pngu.mgh.harvard.edu/~purcell/plink/ plus some custom python code
154
155</help>
156</tool>
Note: リポジトリブラウザについてのヘルプは TracBrowser を参照してください。