Context Navigation

rgQQ.py @ 2

リビジョン 2, 13.0 KB (コミッタ: hatakeyama, 14 年前)
import galaxy-central

Rev	行番号
[2]	1	"""
	2	oct 2009 - multiple output files
	3	Dear Matthias,
	4
	5	Yes, you can define number of outputs dynamically in Galaxy. For doing
	6	this, you'll have to declare one output dataset in your xml and pass
	7	its ID ($out_file.id) to your python script. Also, set
	8	force_history_refresh="True" in your tool tag in xml, like this:
	9	<tool id="split1" name="Split" force_history_refresh="True">
	10	In your script, if your outputs are named in the following format,
	11	primary_associatedWithDatasetID_designation_visibility_extension
	12	(_DBKEY), all your datasets will show up in the history pane.
	13	associatedWithDatasetID is the $out_file.ID passed from xml,
	14	designation will be a unique identifier for each output (set in your
	15	script),
	16	visibility can be set to visible if you want the dataset visible in
	17	your history, or notvisible otherwise
	18	extension is the required format for your dataset (bed, tabular, fasta
	19	etc)
	20	DBKEY is optional, and can be set if required (e.g. hg18, mm9 etc)
	21
	22	One of our tools "MAF to Interval converter" (tools/maf/
	23	maf_to_interval.xml) already uses this feature. You can use it as a
	24	reference.
	25
	26	qq.chisq Quantile-quantile plot for chi-squared tests
	27	Description
	28	This function plots ranked observed chi-squared test statistics against the corresponding expected
	29	order statistics. It also estimates an inflation (or deflation) factor, lambda, by the ratio of the trimmed
	30	means of observed and expected values. This is useful for inspecting the results of whole-genome
	31	association studies for overdispersion due to population substructure and other sources of bias or
	32	confounding.
	33	Usage
	34	qq.chisq(x, df=1, x.max, main="QQ plot",
	35	sub=paste("Expected distribution: chi-squared (",df," df)", sep=""),
	36	xlab="Expected", ylab="Observed",
	37	conc=c(0.025, 0.975), overdisp=FALSE, trim=0.5,
	38	slope.one=FALSE, slope.lambda=FALSE,
	39	thin=c(0.25,50), oor.pch=24, col.shade="gray", ...)
	40	Arguments
	41	x A vector of observed chi-squared test values
	42	df The degreees of freedom for the tests
	43	x.max If present, truncate the observed value (Y) axis here
	44	main The main heading
	45	sub The subheading
	46	xlab x-axis label (default "Expected")
	47	ylab y-axis label (default "Observed")
	48	conc Lower and upper probability bounds for concentration band for the plot. Set this
	49	to NA to suppress this
	50	overdisp If TRUE, an overdispersion factor, lambda, will be estimated and used in calculating
	51	concentration band
	52	trim Quantile point for trimmed mean calculations for estimation of lambda. Default
	53	is to trim at the median
	54	slope.one Is a line of slope one to be superimpsed?
	55	slope.lambda Is a line of slope lambda to be superimposed?
	56	thin A pair of numbers indicating how points will be thinned before plotting (see
	57	Details). If NA, no thinning will be carried out
	58	oor.pch Observed values greater than x.max are plotted at x.max. This argument sets
	59	the plotting symbol to be used for out-of-range observations
	60	col.shade The colour with which the concentration band will be filled
	61	... Further graphical parameter settings to be passed to points()
	62
	63	Details
	64	To reduce plotting time and the size of plot files, the smallest observed and expected points are
	65	thinned so that only a reduced number of (approximately equally spaced) points are plotted. The
	66	precise behaviour is controlled by the parameter thin, whose value should be a pair of numbers.
	67	The first number must lie between 0 and 1 and sets the proportion of the X axis over which thinning
	68	is to be applied. The second number should be an integer and sets the maximum number of points
	69	to be plotted in this section.
	70	The "concentration band" for the plot is shown in grey. This region is defined by upper and lower
	71	probability bounds for each order statistic. The default is to use the 2.5 Note that this is not a
	72	simultaneous confidence region; the probability that the plot will stray outside the band at some
	73	point exceeds 95
	74	When required, he dispersion factor is estimated by the ratio of the observed trimmed mean to its
	75	expected value under the chi-squared assumption.
	76	Value
	77	The function returns the number of tests, the number of values omitted from the plot (greater than
	78	x.max), and the estimated dispersion factor, lambda.
	79	Note
	80	All tests must have the same number of degrees of freedom. If this is not the case, I suggest
	81	transforming to p-values and then plotting -2log(p) as chi-squared on 2 df.
	82	Author(s)
	83	David Clayton hdavid.clayton@cimr.cam.ac.uki
	84	References
	85	Devlin, B. and Roeder, K. (1999) Genomic control for association studies. Biometrics, 55:997-1004
	86	"""
	87
	88	import sys, random, math, copy,os, subprocess, tempfile
	89	from rgutils import RRun, rexe
	90
	91	rqq = """
	92	# modified by ross lazarus for the rgenetics project may 2000
	93	# makes a pdf for galaxy from an x vector of chisquare values
	94	# from snpMatrix
	95	# http://www.bioconductor.org/packages/bioc/html/snpMatrix.html
	96	qq.chisq <-
	97	function(x, df=1, x.max,
	98	main="QQ plot",
	99	sub=paste("Expected distribution: chi-squared (",df," df)", sep=""),
	100	xlab="Expected", ylab="Observed",
	101	conc=c(0.025, 0.975), overdisp=FALSE, trim=0.5,
	102	slope.one=T, slope.lambda=FALSE,
	103	thin=c(0.5,200), oor.pch=24, col.shade="gray", ofname="qqchi.pdf",
	104	h=6,w=6,printpdf=F,...) {
	105
	106	# Function to shade concentration band
	107
	108	shade <- function(x1, y1, x2, y2, color=col.shade) {
	109	n <- length(x2)
	110	polygon(c(x1, x2[n:1]), c(y1, y2[n:1]), border=NA, col=color)
	111	}
	112
	113	# Sort values and see how many out of range
	114
	115	obsvd <- sort(x, na.last=NA)
	116	N <- length(obsvd)
	117	if (missing(x.max)) {
	118	Np <- N
	119	}
	120	else {
	121	Np <- sum(obsvd<=x.max)
	122	}
	123	if(Np==0)
	124	stop("Nothing to plot")
	125
	126	# Expected values
	127
	128	if (df==2) {
	129	expctd <- 2*cumsum(1/(N:1))
	130	}
	131	else {
	132	expctd <- qchisq(p=(1:N)/(N+1), df=df)
	133	}
	134
	135	# Concentration bands
	136
	137	if (!is.null(conc)) {
	138	if(conc[1]>0) {
	139	e.low <- qchisq(p=qbeta(conc[1], 1:N, N:1), df=df)
	140	}
	141	else {
	142	e.low <- rep(0, N)
	143	}
	144	if (conc[2]<1) {
	145	e.high <- qchisq(p=qbeta(conc[2], 1:N, N:1), df=df)
	146	}
	147	else {
	148	e.high <- 1.1*rep(max(x),N)
	149	}
	150	}
	151
	152	# Plot outline
	153
	154	if (Np < N)
	155	top <- x.max
	156	else
	157	top <- obsvd[N]
	158	right <- expctd[N]
	159	if (printpdf) {pdf(ofname,h,w)}
	160	plot(c(0, right), c(0, top), type="n", xlab=xlab, ylab=ylab,
	161	main=main, sub=sub)
	162
	163	# Thinning
	164
	165	if (is.na(thin[1])) {
	166	show <- 1:Np
	167	}
	168	else if (length(thin)!=2 \|\| thin[1]<0 \|\| thin[1]>1 \|\| thin[2]<1) {
	169	warning("invalid thin parameter; no thinning carried out")
	170	show <- 1:Np
	171	}
	172	else {
	173	space <- right*thin[1]/floor(thin[2])
	174	iat <- round((N+1)pchisq(q=(1:floor(thin[2]))space, df=df))
	175	if (max(iat)>thin[2])
	176	show <- unique(c(iat, (1+max(iat)):Np))
	177	else
	178	show <- 1:Np
	179	}
	180	Nu <- floor(trim*N)
	181	if (Nu>0)
	182	lambda <- mean(obsvd[1:Nu])/mean(expctd[1:Nu])
	183	if (!is.null(conc)) {
	184	if (Np<N)
	185	vert <- c(show, (Np+1):N)
	186	else
	187	vert <- show
	188	if (overdisp)
	189	shade(expctd[vert], lambda*e.low[vert],
	190	expctd[vert], lambda*e.high[vert])
	191	else
	192	shade(expctd[vert], e.low[vert], expctd[vert], e.high[vert])
	193	}
	194	points(expctd[show], obsvd[show], ...)
	195	# Overflow
	196	if (Np<N) {
	197	over <- (Np+1):N
	198	points(expctd[over], rep(x.max, N-Np), pch=oor.pch)
	199	}
	200	# Lines
	201	line.types <- c("solid", "dashed", "dotted")
	202	key <- NULL
	203	txt <- NULL
	204	if (slope.one) {
	205	key <- c(key, line.types[1])
	206	txt <- c(txt, "y = x")
	207	abline(a=0, b=1, lty=line.types[1])
	208	}
	209	if (slope.lambda && Nu>0) {
	210	key <- c(key, line.types[2])
	211	txt <- c(txt, paste("y = ", format(lambda, digits=4), "x", sep=""))
	212	if (!is.null(conc)) {
	213	if (Np<N)
	214	vert <- c(show, (Np+1):N)
	215	else
	216	vert <- show
	217	}
	218	abline(a=0, b=lambda, lty=line.types[2])
	219	}
	220	if (printpdf) {dev.off()}
	221	# Returned value
	222
	223	if (!is.null(key))
	224	legend(0, top, legend=txt, lty=key)
	225	c(N=N, omitted=N-Np, lambda=lambda)
	226
	227	}
	228
	229	"""
	230
	231
	232
	233
	234	def makeQQ(dat=[], sample=1.0, maxveclen=4000, fname='fname',title='title',
	235	xvar='Sample',h=8,w=8,logscale=True,outdir=None):
	236	"""
	237	y is data for a qq plot and ends up on the x axis go figure
	238	if sampling, oversample low values - all the top 1% ?
	239	assume we have 0-1 p values
	240	"""
	241	R = []
	242	colour="maroon"
	243	nrows = len(dat)
	244	dat.sort() # small to large
	245	fn = float(nrows)
	246	unifx = [x/fn for x in range(1,(nrows+1))]
	247	if logscale:
	248	unifx = [-math.log10(x) for x in unifx] # uniform distribution
	249	if sample < 1.0 and len(dat) > maxveclen:
	250	# now have half a million markers eg - too many to plot all for a pdf - sample to get 10k or so points
	251	# oversample part of the distribution
	252	always = min(1000,nrows/20) # oversample smaller of lowest few hundred items or 5%
	253	skip = int(nrows/float(maxveclen)) # take 1 in skip to get about maxveclen points
	254	if skip <= 1:
	255	skip = 2
	256	samplei = [i for i in range(nrows) if (i < always) or (i % skip == 0)]
	257	# always oversample first sorted (here lowest) values
	258	yvec = [dat[i] for i in samplei] # always get first and last
	259	xvec = [unifx[i] for i in samplei] # and sample xvec same way
	260	maint='QQ %s (random %d of %d)' % (title,len(yvec),nrows)
	261	else:
	262	yvec = [x for x in dat]
	263	maint='QQ %s (n=%d)' % (title,nrows)
	264	xvec = unifx
	265	if logscale:
	266	maint = 'Log%s' % maint
	267	mx = [0,math.log10(nrows)] # if 1000, becomes 3 for the null line
	268	ylab = '-log10(%s) Quantiles' % title
	269	xlab = '-log10(Uniform 0-1) Quantiles'
	270	yvec = [-math.log10(x) for x in yvec if x > 0.0]
	271	else:
	272	mx = [0,1]
	273	ylab = '%s Quantiles' % title
	274	xlab = 'Uniform 0-1 Quantiles'
	275
	276	xv = ['%f' % x for x in xvec]
	277	R.append('xvec = c(%s)' % ','.join(xv))
	278	yv = ['%f' % x for x in yvec]
	279	R.append('yvec = c(%s)' % ','.join(yv))
	280	R.append('mx = c(0,%f)' % (math.log10(fn)))
	281	R.append('pdf("%s",h=%d,w=%d)' % (fname,h,w))
	282	R.append("par(lab=c(10,10,10))")
	283	R.append('qqplot(xvec,yvec,xlab="%s",ylab="%s",main="%s",sub="%s",pch=19,col="%s",cex=0.8)' % (xlab,ylab,maint,title,colour))
	284	R.append('points(mx,mx,type="l")')
	285	R.append('grid(col="lightgray",lty="dotted")')
	286	R.append('dev.off()')
	287	RRun(rcmd=R,title='makeQQplot',outdir=outdir)
	288
	289
	290
	291	def main():
	292	u = """
	293	"""
	294	u = """<command interpreter="python">
	295	rgQQ.py "$input1" "$name" $sample "$cols" $allqq $height $width $logtrans $allqq.id $__new_file_path__
	296	</command>
	297
	298	</command>
	299	"""
	300	print >> sys.stdout,'## rgQQ.py. cl=',sys.argv
	301	npar = 11
	302	if len(sys.argv) < npar:
	303	print >> sys.stdout, '## error - too few command line parameters - wanting %d' % npar
	304	print >> sys.stdout, u
	305	sys.exit(1)
	306	in_fname = sys.argv[1]
	307	name = sys.argv[2]
	308	sample = float(sys.argv[3])
	309	head = None
	310	columns = [int(x) for x in sys.argv[4].strip().split(',')] # work with python columns!
	311	allout = sys.argv[5]
	312	height = int(sys.argv[6])
	313	width = int(sys.argv[7])
	314	logscale = (sys.argv[8].lower() == 'true')
	315	outid = sys.argv[9] # this is used to allow multiple output files
	316	outdir = sys.argv[10]
	317	nan_row = False
	318	rows = []
	319	for i, line in enumerate( file( sys.argv[1] ) ):
	320	# Skip comments
	321	if line.startswith( '#' ) or ( i == 0 ):
	322	if i == 0:
	323	head = line.strip().split("\t")
	324	continue
	325	if len(line.strip()) == 0:
	326	continue
	327	# Extract values and convert to floats
	328	fields = line.strip().split( "\t" )
	329	row = []
	330	nan_row = False
	331	for column in columns:
	332	if len( fields ) <= column:
	333	return fail( "No column %d on line %d: %s" % ( column, i, fields ) )
	334	val = fields[column]
	335	if val.lower() == "na":
	336	nan_row = True
	337	else:
	338	try:
	339	row.append( float( fields[column] ) )
	340	except ValueError:
	341	return fail( "Value '%s' in column %d on line %d is not numeric" % ( fields[column], column+1, i ) )
	342	if not nan_row:
	343	rows.append( row )
	344	if i > 1:
	345	i = i-1 # remove header row from count
	346	if head == None:
	347	head = ['Col%d' % (x+1) for x in columns]
	348	R = []
	349	for c,column in enumerate(columns): # we appended each column in turn
	350	outname = allout
	351	if c > 0: # after first time
	352	outname = 'primary_%s_col%s_visible_pdf' % (outid,column)
	353	outname = os.path.join(outdir,outname)
	354	dat = []
	355	nrows = len(rows) # sometimes lots of NA's!!
	356	for arow in rows:
	357	dat.append(arow[c]) # remember, we appended each col in turn!
	358	cname = head[column]
	359	makeQQ(dat=dat,sample=sample,fname=outname,title='%s_%s' % (name,cname),
	360	xvar=cname,h=height,w=width,logscale=logscale,outdir=outdir)
	361
	362
	363
	364	if __name__ == "__main__":
	365	main()

Note: リポジトリブラウザについてのヘルプは TracBrowser を参照してください。

Context Navigation

root/galaxy-central/tools/rgenetics/rgQQ.py @ 2

異なるフォーマットでダウンロード: