root/galaxy-central/tools/rgenetics/rgLDIndep.xml

リビジョン 2, 8.3 KB (コミッタ: hatakeyama, 14 年 前)

import galaxy-central

行番号 
1<tool id="rgLDIndep1" name="LD Independent:">
2    <code file="rgLDIndep_code.py"/>
3
4    <description>filter high LD pairs - decrease redundancy</description>
5
6    <command interpreter="python">
7        rgLDIndep.py '$input_file.extra_files_path' '$input_file.metadata.base_name' '$title1' '$mind'
8        '$geno' '$hwe' '$maf' '$mef' '$mei' '$out_file1'
9        '$out_file1.files_path'  '$window' '$step' '$r2'
10    </command>
11
12    <inputs>
13       <param name="input_file"  type="data" label="RGenetics genotype data from your current history"
14         size="80" format="pbed" />
15       <param name="title1" type="text" size="80" label="Descriptive title for cleaned genotype file" value="LD_Independent"/>
16       <param name="r2" type="float" value = "0.1"
17       label="r2 threshold: Select only pairs at or below this r^2 threshold (eg 0.1)"
18       help="LD threshold defining LD independent markers" />
19       <param name="window" type="integer" value = "40" label="Window: Window size to limit LD pairwise"
20       help = "Bigger is better but time taken blows up exponentially as the window grows!" />
21       <param name="step" type="integer" value = "30" label="Step: Move window this far and recompute"
22       help = "Smaller is better but of course, time increases..." />
23       <param name="geno"  type="float" label="Maximum Missing Fraction: Markers" value="1.0" />
24       <param name="mind" type="float" value="1.0" label="Maximum Missing Fraction: Subjects"/>
25       <param name="mef"  type="float" label="Maximum Mendel Error Rate: Family" value="1.0"/>
26       <param name="mei"  type="float" label="Maximum Mendel Error Rate: Marker" value="1.0"/>
27       <param name="hwe" type="float" value="0.0" label="Smallest HWE p value (set to 0 for all)" />
28       <param name="maf" type="float" value="0.0"
29       label="Smallest Allowable Minor Allele Frequency (set to 0.0 for all)"/>
30
31   </inputs>
32
33   <outputs>
34       <data format="pbed" name="out_file1" metadata_source="input_file"  />
35   </outputs>
36<tests>
37 <test>
38
39    <param name='input_file' value='tinywga' ftype='pbed' >
40    <metadata name='base_name' value='tinywga' />
41    <composite_data value='tinywga.bim' />
42    <composite_data value='tinywga.bed' />
43    <composite_data value='tinywga.fam' />
44    <edit_attributes type='name' value='tinywga' />
45    </param>
46    <param name='title1' value='rgLDIndeptest1' />
47    <param name="mind" value="1" />
48    <param name="geno" value="1" />
49    <param name="hwe" value="0" />
50    <param name="maf" value="0" />
51    <param name="mef" value="1" />
52    <param name="mei" value="1" />
53    <param name="window" value="10000" />
54    <param name="step" value="5000" />
55    <param name="r2" value="0.1" />
56    <output name='out_file1' file='rgtestouts/rgLDIndep/rgLDIndeptest1.pbed' ftype='pbed' compare="diff" lines_diff='7'>
57    <extra_files type="file" name='rgLDIndeptest1.bim' value="rgtestouts/rgLDIndep/rgLDIndeptest1.bim" compare="sim_size" delta="1000"/>
58    <extra_files type="file" name='rgLDIndeptest1.fam' value="rgtestouts/rgLDIndep/rgLDIndeptest1.fam" compare="diff" />
59    <extra_files type="file" name='rgLDIndeptest1.bed' value="rgtestouts/rgLDIndep/rgLDIndeptest1.bed" compare="sim_size" delta = "1000" />
60    </output>
61 </test>
62</tests>
63<help>
64
65.. class:: infomark
66
67**Attribution**
68
69This tool relies on Plink from Shaun Purcell. For full documentation, please see his web site
70at http://pngu.mgh.harvard.edu/~purcell/plink/ where there is excellent documentation describing
71the parameters you can set here.
72
73Rgenetics merely exposes them, wrapping Plink so you can use it in Galaxy.
74
75**Summary**
76
77In addition to filtering some marker and sample quality measures,
78this tool reduces the amount of overlapping information, by removing
79most of the duplicate information contained in linkage disequilibrium. This is
80a lossy process and for some methods, signal may be lost. However, this makes
81the dataset far more compact (eg 10% of the original storage size) while still
82being highly informative and less biased for some (note NOT all!) statistical methods.
83This is the Clean tool with additional data reduction via Plink LD pruning.
84Use the Clean tool if you don't want LD pruning - which you don't for most statistical testing.
85For ancestry and relatedness, you may well want LD pruned data as it has
86some specific desirable properties.
87
88**LD**
89
90Pairwise Linkage disequilibrium (LD) measures the extent to which the genotype at one locus
91predicts the state of another locus at the level of an entire population.
92When population LD between a pair of markers is high,
93knowing an individual's genotype at one locus allows confident prediction of the genotype at the other.
94In other words, high LD means information redundancy between markers. For some
95purposes, removing some of this redundancy can improve the performance of some analyses.
96Executing this tool will create a new genotype dataset in your current history containing
97LD independent markers - most of the genetic information is retained but without as much redundancy.
98
99Set a pairwise LD threshold (eg r^2 = 0.2) and the (smaller) resulting dataset will have no
100pairs of marker with r^2 greater than 0.2. Additional filters are available to remove markers
101below a specific minor allele frequency, or above a specific level of missingness,
102and to remove subjects using similar criteria. Subjects and markers for family data can be
103filtered by proportions of Mendelian errors in observed transmission.
104
105-----
106
107**Syntax**
108
109- **Genotype data** is the input pedfile chosen from available library files
110- **New name** is the name to use for the filtered output file
111- **Missfrac threshold: subjects** is the threshold for missingness by subject. Subjects with more than this fraction missing will be excluded from the import
112- **Missfrac threshold: markers** is the threshold for missingness by marker. Markers with more than this fraction missing will be excluded from the import
113- **MaxMendel Individuals** Mendel error fraction above which to exclude subjects with more than the specified fraction of mendelian errors in transmission (for family data only)
114- **MaxMendel Families** Mendel error fraction above which to exclude families with more than the specified fraction of mendelian errors in transmission (for family data only)
115- **HWE** is the threshold for HWE test p values below which the marker will not be imported. Set this to -1 and all markers will be imported regardless of HWE p value
116- **MAF** is the threshold for minor allele frequency - SNPs with lower MAF will be excluded
117- **r^2** is the pairwise LD threshold as r^2. Lower -> less marker redundancy -> fewer markers
118- **Window** is the window width for LD threshold. Bigger -> slower -> more complete
119- **Skip** is the distance to move the window along the genome. Should be window or less.
120
121-----
122
123**Disclaimer**
124
125This tool relies on Plink from Shaun Purcell. For full documentation, please see his web site
126at http://pngu.mgh.harvard.edu/~purcell/plink/ where thereis excellent documentation describing
127the parameters you can set here. Rgenetics merely exposes them, and wraps Plink so you can use it in Galaxy.
128
129This tool is designed to create genotype data files with more or less LD independent sets of markers. These
130reduced genotype data files are particularly useful for purposes such as evaluating
131ancestry (eg eigenstrat) or relatedness (eg rgGRR)
132
133LD pruning decreases redundancy among the genotype data by removing one of each pair of markers
134in strong LD (above the r^2 threshold) over successive genomic windows (the Window parameter),
135skipping (the Skip parameter bases between windows. The defaults should produce useable outputs.
136
137This might be more efficient for rgGRR and
138eigenstrat...The core quote is
139
140    "This generates the same output files as the first version;
141    the only difference is that a simple pairwise threshold is used.
142    The first two parameters (50 and 5) are the same as above (window size and step);
143    the third parameter represents the r^2 threshold.
144    Note: this represents the pairwise SNP-SNP metric now, not the
145    multiple correlation coefficient; also note, this is based on the
146    genotypic correlation, i.e. it does not involve phasing.
147    "
148
149-----
150
151
152
153This Galaxy tool was written by Ross Lazarus for the Rgenetics project
154It uses Plink for most calculations - for full Plink attribution, source code and documentation,
155please see http://pngu.mgh.harvard.edu/~purcell/plink/ plus some custom python code
156
157</help>
158</tool>
Note: リポジトリブラウザについてのヘルプは TracBrowser を参照してください。