Context Navigation

rgQQ.py

リビジョン 2, 13.0 KB (コミッタ: hatakeyama, 15 年前)
import galaxy-central

行番号
1	"""
2	oct 2009 - multiple output files
3	Dear Matthias,
4
5	Yes, you can define number of outputs dynamically in Galaxy. For doing
6	this, you'll have to declare one output dataset in your xml and pass
7	its ID ($out_file.id) to your python script. Also, set
8	force_history_refresh="True" in your tool tag in xml, like this:
9	<tool id="split1" name="Split" force_history_refresh="True">
10	In your script, if your outputs are named in the following format,
11	primary_associatedWithDatasetID_designation_visibility_extension
12	(_DBKEY), all your datasets will show up in the history pane.
13	associatedWithDatasetID is the $out_file.ID passed from xml,
14	designation will be a unique identifier for each output (set in your
15	script),
16	visibility can be set to visible if you want the dataset visible in
17	your history, or notvisible otherwise
18	extension is the required format for your dataset (bed, tabular, fasta
19	etc)
20	DBKEY is optional, and can be set if required (e.g. hg18, mm9 etc)
21
22	One of our tools "MAF to Interval converter" (tools/maf/
23	maf_to_interval.xml) already uses this feature. You can use it as a
24	reference.
25
26	qq.chisq Quantile-quantile plot for chi-squared tests
27	Description
28	This function plots ranked observed chi-squared test statistics against the corresponding expected
29	order statistics. It also estimates an inflation (or deflation) factor, lambda, by the ratio of the trimmed
30	means of observed and expected values. This is useful for inspecting the results of whole-genome
31	association studies for overdispersion due to population substructure and other sources of bias or
32	confounding.
33	Usage
34	qq.chisq(x, df=1, x.max, main="QQ plot",
35	sub=paste("Expected distribution: chi-squared (",df," df)", sep=""),
36	xlab="Expected", ylab="Observed",
37	conc=c(0.025, 0.975), overdisp=FALSE, trim=0.5,
38	slope.one=FALSE, slope.lambda=FALSE,
39	thin=c(0.25,50), oor.pch=24, col.shade="gray", ...)
40	Arguments
41	x A vector of observed chi-squared test values
42	df The degreees of freedom for the tests
43	x.max If present, truncate the observed value (Y) axis here
44	main The main heading
45	sub The subheading
46	xlab x-axis label (default "Expected")
47	ylab y-axis label (default "Observed")
48	conc Lower and upper probability bounds for concentration band for the plot. Set this
49	to NA to suppress this
50	overdisp If TRUE, an overdispersion factor, lambda, will be estimated and used in calculating
51	concentration band
52	trim Quantile point for trimmed mean calculations for estimation of lambda. Default
53	is to trim at the median
54	slope.one Is a line of slope one to be superimpsed?
55	slope.lambda Is a line of slope lambda to be superimposed?
56	thin A pair of numbers indicating how points will be thinned before plotting (see
57	Details). If NA, no thinning will be carried out
58	oor.pch Observed values greater than x.max are plotted at x.max. This argument sets
59	the plotting symbol to be used for out-of-range observations
60	col.shade The colour with which the concentration band will be filled
61	... Further graphical parameter settings to be passed to points()
62
63	Details
64	To reduce plotting time and the size of plot files, the smallest observed and expected points are
65	thinned so that only a reduced number of (approximately equally spaced) points are plotted. The
66	precise behaviour is controlled by the parameter thin, whose value should be a pair of numbers.
67	The first number must lie between 0 and 1 and sets the proportion of the X axis over which thinning
68	is to be applied. The second number should be an integer and sets the maximum number of points
69	to be plotted in this section.
70	The "concentration band" for the plot is shown in grey. This region is defined by upper and lower
71	probability bounds for each order statistic. The default is to use the 2.5 Note that this is not a
72	simultaneous confidence region; the probability that the plot will stray outside the band at some
73	point exceeds 95
74	When required, he dispersion factor is estimated by the ratio of the observed trimmed mean to its
75	expected value under the chi-squared assumption.
76	Value
77	The function returns the number of tests, the number of values omitted from the plot (greater than
78	x.max), and the estimated dispersion factor, lambda.
79	Note
80	All tests must have the same number of degrees of freedom. If this is not the case, I suggest
81	transforming to p-values and then plotting -2log(p) as chi-squared on 2 df.
82	Author(s)
83	David Clayton hdavid.clayton@cimr.cam.ac.uki
84	References
85	Devlin, B. and Roeder, K. (1999) Genomic control for association studies. Biometrics, 55:997-1004
86	"""
87
88	import sys, random, math, copy,os, subprocess, tempfile
89	from rgutils import RRun, rexe
90
91	rqq = """
92	# modified by ross lazarus for the rgenetics project may 2000
93	# makes a pdf for galaxy from an x vector of chisquare values
94	# from snpMatrix
95	# http://www.bioconductor.org/packages/bioc/html/snpMatrix.html
96	qq.chisq <-
97	function(x, df=1, x.max,
98	main="QQ plot",
99	sub=paste("Expected distribution: chi-squared (",df," df)", sep=""),
100	xlab="Expected", ylab="Observed",
101	conc=c(0.025, 0.975), overdisp=FALSE, trim=0.5,
102	slope.one=T, slope.lambda=FALSE,
103	thin=c(0.5,200), oor.pch=24, col.shade="gray", ofname="qqchi.pdf",
104	h=6,w=6,printpdf=F,...) {
105
106	# Function to shade concentration band
107
108	shade <- function(x1, y1, x2, y2, color=col.shade) {
109	n <- length(x2)
110	polygon(c(x1, x2[n:1]), c(y1, y2[n:1]), border=NA, col=color)
111	}
112
113	# Sort values and see how many out of range
114
115	obsvd <- sort(x, na.last=NA)
116	N <- length(obsvd)
117	if (missing(x.max)) {
118	Np <- N
119	}
120	else {
121	Np <- sum(obsvd<=x.max)
122	}
123	if(Np==0)
124	stop("Nothing to plot")
125
126	# Expected values
127
128	if (df==2) {
129	expctd <- 2*cumsum(1/(N:1))
130	}
131	else {
132	expctd <- qchisq(p=(1:N)/(N+1), df=df)
133	}
134
135	# Concentration bands
136
137	if (!is.null(conc)) {
138	if(conc[1]>0) {
139	e.low <- qchisq(p=qbeta(conc[1], 1:N, N:1), df=df)
140	}
141	else {
142	e.low <- rep(0, N)
143	}
144	if (conc[2]<1) {
145	e.high <- qchisq(p=qbeta(conc[2], 1:N, N:1), df=df)
146	}
147	else {
148	e.high <- 1.1*rep(max(x),N)
149	}
150	}
151
152	# Plot outline
153
154	if (Np < N)
155	top <- x.max
156	else
157	top <- obsvd[N]
158	right <- expctd[N]
159	if (printpdf) {pdf(ofname,h,w)}
160	plot(c(0, right), c(0, top), type="n", xlab=xlab, ylab=ylab,
161	main=main, sub=sub)
162
163	# Thinning
164
165	if (is.na(thin[1])) {
166	show <- 1:Np
167	}
168	else if (length(thin)!=2 \|\| thin[1]<0 \|\| thin[1]>1 \|\| thin[2]<1) {
169	warning("invalid thin parameter; no thinning carried out")
170	show <- 1:Np
171	}
172	else {
173	space <- right*thin[1]/floor(thin[2])
174	iat <- round((N+1)pchisq(q=(1:floor(thin[2]))space, df=df))
175	if (max(iat)>thin[2])
176	show <- unique(c(iat, (1+max(iat)):Np))
177	else
178	show <- 1:Np
179	}
180	Nu <- floor(trim*N)
181	if (Nu>0)
182	lambda <- mean(obsvd[1:Nu])/mean(expctd[1:Nu])
183	if (!is.null(conc)) {
184	if (Np<N)
185	vert <- c(show, (Np+1):N)
186	else
187	vert <- show
188	if (overdisp)
189	shade(expctd[vert], lambda*e.low[vert],
190	expctd[vert], lambda*e.high[vert])
191	else
192	shade(expctd[vert], e.low[vert], expctd[vert], e.high[vert])
193	}
194	points(expctd[show], obsvd[show], ...)
195	# Overflow
196	if (Np<N) {
197	over <- (Np+1):N
198	points(expctd[over], rep(x.max, N-Np), pch=oor.pch)
199	}
200	# Lines
201	line.types <- c("solid", "dashed", "dotted")
202	key <- NULL
203	txt <- NULL
204	if (slope.one) {
205	key <- c(key, line.types[1])
206	txt <- c(txt, "y = x")
207	abline(a=0, b=1, lty=line.types[1])
208	}
209	if (slope.lambda && Nu>0) {
210	key <- c(key, line.types[2])
211	txt <- c(txt, paste("y = ", format(lambda, digits=4), "x", sep=""))
212	if (!is.null(conc)) {
213	if (Np<N)
214	vert <- c(show, (Np+1):N)
215	else
216	vert <- show
217	}
218	abline(a=0, b=lambda, lty=line.types[2])
219	}
220	if (printpdf) {dev.off()}
221	# Returned value
222
223	if (!is.null(key))
224	legend(0, top, legend=txt, lty=key)
225	c(N=N, omitted=N-Np, lambda=lambda)
226
227	}
228
229	"""
230
231
232
233
234	def makeQQ(dat=[], sample=1.0, maxveclen=4000, fname='fname',title='title',
235	xvar='Sample',h=8,w=8,logscale=True,outdir=None):
236	"""
237	y is data for a qq plot and ends up on the x axis go figure
238	if sampling, oversample low values - all the top 1% ?
239	assume we have 0-1 p values
240	"""
241	R = []
242	colour="maroon"
243	nrows = len(dat)
244	dat.sort() # small to large
245	fn = float(nrows)
246	unifx = [x/fn for x in range(1,(nrows+1))]
247	if logscale:
248	unifx = [-math.log10(x) for x in unifx] # uniform distribution
249	if sample < 1.0 and len(dat) > maxveclen:
250	# now have half a million markers eg - too many to plot all for a pdf - sample to get 10k or so points
251	# oversample part of the distribution
252	always = min(1000,nrows/20) # oversample smaller of lowest few hundred items or 5%
253	skip = int(nrows/float(maxveclen)) # take 1 in skip to get about maxveclen points
254	if skip <= 1:
255	skip = 2
256	samplei = [i for i in range(nrows) if (i < always) or (i % skip == 0)]
257	# always oversample first sorted (here lowest) values
258	yvec = [dat[i] for i in samplei] # always get first and last
259	xvec = [unifx[i] for i in samplei] # and sample xvec same way
260	maint='QQ %s (random %d of %d)' % (title,len(yvec),nrows)
261	else:
262	yvec = [x for x in dat]
263	maint='QQ %s (n=%d)' % (title,nrows)
264	xvec = unifx
265	if logscale:
266	maint = 'Log%s' % maint
267	mx = [0,math.log10(nrows)] # if 1000, becomes 3 for the null line
268	ylab = '-log10(%s) Quantiles' % title
269	xlab = '-log10(Uniform 0-1) Quantiles'
270	yvec = [-math.log10(x) for x in yvec if x > 0.0]
271	else:
272	mx = [0,1]
273	ylab = '%s Quantiles' % title
274	xlab = 'Uniform 0-1 Quantiles'
275
276	xv = ['%f' % x for x in xvec]
277	R.append('xvec = c(%s)' % ','.join(xv))
278	yv = ['%f' % x for x in yvec]
279	R.append('yvec = c(%s)' % ','.join(yv))
280	R.append('mx = c(0,%f)' % (math.log10(fn)))
281	R.append('pdf("%s",h=%d,w=%d)' % (fname,h,w))
282	R.append("par(lab=c(10,10,10))")
283	R.append('qqplot(xvec,yvec,xlab="%s",ylab="%s",main="%s",sub="%s",pch=19,col="%s",cex=0.8)' % (xlab,ylab,maint,title,colour))
284	R.append('points(mx,mx,type="l")')
285	R.append('grid(col="lightgray",lty="dotted")')
286	R.append('dev.off()')
287	RRun(rcmd=R,title='makeQQplot',outdir=outdir)
288
289
290
291	def main():
292	u = """
293	"""
294	u = """<command interpreter="python">
295	rgQQ.py "$input1" "$name" $sample "$cols" $allqq $height $width $logtrans $allqq.id $__new_file_path__
296	</command>
297
298	</command>
299	"""
300	print >> sys.stdout,'## rgQQ.py. cl=',sys.argv
301	npar = 11
302	if len(sys.argv) < npar:
303	print >> sys.stdout, '## error - too few command line parameters - wanting %d' % npar
304	print >> sys.stdout, u
305	sys.exit(1)
306	in_fname = sys.argv[1]
307	name = sys.argv[2]
308	sample = float(sys.argv[3])
309	head = None
310	columns = [int(x) for x in sys.argv[4].strip().split(',')] # work with python columns!
311	allout = sys.argv[5]
312	height = int(sys.argv[6])
313	width = int(sys.argv[7])
314	logscale = (sys.argv[8].lower() == 'true')
315	outid = sys.argv[9] # this is used to allow multiple output files
316	outdir = sys.argv[10]
317	nan_row = False
318	rows = []
319	for i, line in enumerate( file( sys.argv[1] ) ):
320	# Skip comments
321	if line.startswith( '#' ) or ( i == 0 ):
322	if i == 0:
323	head = line.strip().split("\t")
324	continue
325	if len(line.strip()) == 0:
326	continue
327	# Extract values and convert to floats
328	fields = line.strip().split( "\t" )
329	row = []
330	nan_row = False
331	for column in columns:
332	if len( fields ) <= column:
333	return fail( "No column %d on line %d: %s" % ( column, i, fields ) )
334	val = fields[column]
335	if val.lower() == "na":
336	nan_row = True
337	else:
338	try:
339	row.append( float( fields[column] ) )
340	except ValueError:
341	return fail( "Value '%s' in column %d on line %d is not numeric" % ( fields[column], column+1, i ) )
342	if not nan_row:
343	rows.append( row )
344	if i > 1:
345	i = i-1 # remove header row from count
346	if head == None:
347	head = ['Col%d' % (x+1) for x in columns]
348	R = []
349	for c,column in enumerate(columns): # we appended each column in turn
350	outname = allout
351	if c > 0: # after first time
352	outname = 'primary_%s_col%s_visible_pdf' % (outid,column)
353	outname = os.path.join(outdir,outname)
354	dat = []
355	nrows = len(rows) # sometimes lots of NA's!!
356	for arow in rows:
357	dat.append(arow[c]) # remember, we appended each col in turn!
358	cname = head[column]
359	makeQQ(dat=dat,sample=sample,fname=outname,title='%s_%s' % (name,cname),
360	xvar=cname,h=height,w=width,logscale=logscale,outdir=outdir)
361
362
363
364	if __name__ == "__main__":
365	main()

Note: リポジトリブラウザについてのヘルプは TracBrowser を参照してください。

Context Navigation

root/galaxy-central/tools/rgenetics/rgQQ.py

異なるフォーマットでダウンロード: