Context Navigation

rgLDIndep.xml

リビジョン 2, 8.3 KB (コミッタ: hatakeyama, 15 年前)
import galaxy-central

行番号
1	<tool id="rgLDIndep1" name="LD Independent:">
2	<code file="rgLDIndep_code.py"/>
3
4	<description>filter high LD pairs - decrease redundancy</description>
5
6	<command interpreter="python">
7	rgLDIndep.py '$input_file.extra_files_path' '$input_file.metadata.base_name' '$title1' '$mind'
8	'$geno' '$hwe' '$maf' '$mef' '$mei' '$out_file1'
9	'$out_file1.files_path' '$window' '$step' '$r2'
10	</command>
11
12	<inputs>
13	<param name="input_file" type="data" label="RGenetics genotype data from your current history"
14	size="80" format="pbed" />
15	<param name="title1" type="text" size="80" label="Descriptive title for cleaned genotype file" value="LD_Independent"/>
16	<param name="r2" type="float" value = "0.1"
17	label="r2 threshold: Select only pairs at or below this r^2 threshold (eg 0.1)"
18	help="LD threshold defining LD independent markers" />
19	<param name="window" type="integer" value = "40" label="Window: Window size to limit LD pairwise"
20	help = "Bigger is better but time taken blows up exponentially as the window grows!" />
21	<param name="step" type="integer" value = "30" label="Step: Move window this far and recompute"
22	help = "Smaller is better but of course, time increases..." />
23	<param name="geno" type="float" label="Maximum Missing Fraction: Markers" value="1.0" />
24	<param name="mind" type="float" value="1.0" label="Maximum Missing Fraction: Subjects"/>
25	<param name="mef" type="float" label="Maximum Mendel Error Rate: Family" value="1.0"/>
26	<param name="mei" type="float" label="Maximum Mendel Error Rate: Marker" value="1.0"/>
27	<param name="hwe" type="float" value="0.0" label="Smallest HWE p value (set to 0 for all)" />
28	<param name="maf" type="float" value="0.0"
29	label="Smallest Allowable Minor Allele Frequency (set to 0.0 for all)"/>
30
31	</inputs>
32
33	<outputs>
34	<data format="pbed" name="out_file1" metadata_source="input_file" />
35	</outputs>
36	<tests>
37	<test>
38
39	<param name='input_file' value='tinywga' ftype='pbed' >
40	<metadata name='base_name' value='tinywga' />
41	<composite_data value='tinywga.bim' />
42	<composite_data value='tinywga.bed' />
43	<composite_data value='tinywga.fam' />
44	<edit_attributes type='name' value='tinywga' />
45	</param>
46	<param name='title1' value='rgLDIndeptest1' />
47	<param name="mind" value="1" />
48	<param name="geno" value="1" />
49	<param name="hwe" value="0" />
50	<param name="maf" value="0" />
51	<param name="mef" value="1" />
52	<param name="mei" value="1" />
53	<param name="window" value="10000" />
54	<param name="step" value="5000" />
55	<param name="r2" value="0.1" />
56	<output name='out_file1' file='rgtestouts/rgLDIndep/rgLDIndeptest1.pbed' ftype='pbed' compare="diff" lines_diff='7'>
57	<extra_files type="file" name='rgLDIndeptest1.bim' value="rgtestouts/rgLDIndep/rgLDIndeptest1.bim" compare="sim_size" delta="1000"/>
58	<extra_files type="file" name='rgLDIndeptest1.fam' value="rgtestouts/rgLDIndep/rgLDIndeptest1.fam" compare="diff" />
59	<extra_files type="file" name='rgLDIndeptest1.bed' value="rgtestouts/rgLDIndep/rgLDIndeptest1.bed" compare="sim_size" delta = "1000" />
60	</output>
61	</test>
62	</tests>
63	<help>
64
65	.. class:: infomark
66
67	Attribution
68
69	This tool relies on Plink from Shaun Purcell. For full documentation, please see his web site
70	at http://pngu.mgh.harvard.edu/~purcell/plink/ where there is excellent documentation describing
71	the parameters you can set here.
72
73	Rgenetics merely exposes them, wrapping Plink so you can use it in Galaxy.
74
75	Summary
76
77	In addition to filtering some marker and sample quality measures,
78	this tool reduces the amount of overlapping information, by removing
79	most of the duplicate information contained in linkage disequilibrium. This is
80	a lossy process and for some methods, signal may be lost. However, this makes
81	the dataset far more compact (eg 10% of the original storage size) while still
82	being highly informative and less biased for some (note NOT all!) statistical methods.
83	This is the Clean tool with additional data reduction via Plink LD pruning.
84	Use the Clean tool if you don't want LD pruning - which you don't for most statistical testing.
85	For ancestry and relatedness, you may well want LD pruned data as it has
86	some specific desirable properties.
87
88	LD
89
90	Pairwise Linkage disequilibrium (LD) measures the extent to which the genotype at one locus
91	predicts the state of another locus at the level of an entire population.
92	When population LD between a pair of markers is high,
93	knowing an individual's genotype at one locus allows confident prediction of the genotype at the other.
94	In other words, high LD means information redundancy between markers. For some
95	purposes, removing some of this redundancy can improve the performance of some analyses.
96	Executing this tool will create a new genotype dataset in your current history containing
97	LD independent markers - most of the genetic information is retained but without as much redundancy.
98
99	Set a pairwise LD threshold (eg r^2 = 0.2) and the (smaller) resulting dataset will have no
100	pairs of marker with r^2 greater than 0.2. Additional filters are available to remove markers
101	below a specific minor allele frequency, or above a specific level of missingness,
102	and to remove subjects using similar criteria. Subjects and markers for family data can be
103	filtered by proportions of Mendelian errors in observed transmission.
104
105	-----
106
107	Syntax
108
109	- Genotype data is the input pedfile chosen from available library files
110	- New name is the name to use for the filtered output file
111	- Missfrac threshold: subjects is the threshold for missingness by subject. Subjects with more than this fraction missing will be excluded from the import
112	- Missfrac threshold: markers is the threshold for missingness by marker. Markers with more than this fraction missing will be excluded from the import
113	- MaxMendel Individuals Mendel error fraction above which to exclude subjects with more than the specified fraction of mendelian errors in transmission (for family data only)
114	- MaxMendel Families Mendel error fraction above which to exclude families with more than the specified fraction of mendelian errors in transmission (for family data only)
115	- HWE is the threshold for HWE test p values below which the marker will not be imported. Set this to -1 and all markers will be imported regardless of HWE p value
116	- MAF is the threshold for minor allele frequency - SNPs with lower MAF will be excluded
117	- r^2 is the pairwise LD threshold as r^2. Lower -> less marker redundancy -> fewer markers
118	- Window is the window width for LD threshold. Bigger -> slower -> more complete
119	- Skip is the distance to move the window along the genome. Should be window or less.
120
121	-----
122
123	Disclaimer
124
125	This tool relies on Plink from Shaun Purcell. For full documentation, please see his web site
126	at http://pngu.mgh.harvard.edu/~purcell/plink/ where thereis excellent documentation describing
127	the parameters you can set here. Rgenetics merely exposes them, and wraps Plink so you can use it in Galaxy.
128
129	This tool is designed to create genotype data files with more or less LD independent sets of markers. These
130	reduced genotype data files are particularly useful for purposes such as evaluating
131	ancestry (eg eigenstrat) or relatedness (eg rgGRR)
132
133	LD pruning decreases redundancy among the genotype data by removing one of each pair of markers
134	in strong LD (above the r^2 threshold) over successive genomic windows (the Window parameter),
135	skipping (the Skip parameter bases between windows. The defaults should produce useable outputs.
136
137	This might be more efficient for rgGRR and
138	eigenstrat...The core quote is
139
140	"This generates the same output files as the first version;
141	the only difference is that a simple pairwise threshold is used.
142	The first two parameters (50 and 5) are the same as above (window size and step);
143	the third parameter represents the r^2 threshold.
144	Note: this represents the pairwise SNP-SNP metric now, not the
145	multiple correlation coefficient; also note, this is based on the
146	genotypic correlation, i.e. it does not involve phasing.
147	"
148
149	-----
150
151
152
153	This Galaxy tool was written by Ross Lazarus for the Rgenetics project
154	It uses Plink for most calculations - for full Plink attribution, source code and documentation,
155	please see http://pngu.mgh.harvard.edu/~purcell/plink/ plus some custom python code
156
157	</help>
158	</tool>

Note: リポジトリブラウザについてのヘルプは TracBrowser を参照してください。

Context Navigation

root/galaxy-central/tools/rgenetics/rgLDIndep.xml

異なるフォーマットでダウンロード: