[2] | 1 | """ |
---|
| 2 | oct 2009 - multiple output files |
---|
| 3 | Dear Matthias, |
---|
| 4 | |
---|
| 5 | Yes, you can define number of outputs dynamically in Galaxy. For doing |
---|
| 6 | this, you'll have to declare one output dataset in your xml and pass |
---|
| 7 | its ID ($out_file.id) to your python script. Also, set |
---|
| 8 | force_history_refresh="True" in your tool tag in xml, like this: |
---|
| 9 | <tool id="split1" name="Split" force_history_refresh="True"> |
---|
| 10 | In your script, if your outputs are named in the following format, |
---|
| 11 | primary_associatedWithDatasetID_designation_visibility_extension |
---|
| 12 | (_DBKEY), all your datasets will show up in the history pane. |
---|
| 13 | associatedWithDatasetID is the $out_file.ID passed from xml, |
---|
| 14 | designation will be a unique identifier for each output (set in your |
---|
| 15 | script), |
---|
| 16 | visibility can be set to visible if you want the dataset visible in |
---|
| 17 | your history, or notvisible otherwise |
---|
| 18 | extension is the required format for your dataset (bed, tabular, fasta |
---|
| 19 | etc) |
---|
| 20 | DBKEY is optional, and can be set if required (e.g. hg18, mm9 etc) |
---|
| 21 | |
---|
| 22 | One of our tools "MAF to Interval converter" (tools/maf/ |
---|
| 23 | maf_to_interval.xml) already uses this feature. You can use it as a |
---|
| 24 | reference. |
---|
| 25 | |
---|
| 26 | qq.chisq Quantile-quantile plot for chi-squared tests |
---|
| 27 | Description |
---|
| 28 | This function plots ranked observed chi-squared test statistics against the corresponding expected |
---|
| 29 | order statistics. It also estimates an inflation (or deflation) factor, lambda, by the ratio of the trimmed |
---|
| 30 | means of observed and expected values. This is useful for inspecting the results of whole-genome |
---|
| 31 | association studies for overdispersion due to population substructure and other sources of bias or |
---|
| 32 | confounding. |
---|
| 33 | Usage |
---|
| 34 | qq.chisq(x, df=1, x.max, main="QQ plot", |
---|
| 35 | sub=paste("Expected distribution: chi-squared (",df," df)", sep=""), |
---|
| 36 | xlab="Expected", ylab="Observed", |
---|
| 37 | conc=c(0.025, 0.975), overdisp=FALSE, trim=0.5, |
---|
| 38 | slope.one=FALSE, slope.lambda=FALSE, |
---|
| 39 | thin=c(0.25,50), oor.pch=24, col.shade="gray", ...) |
---|
| 40 | Arguments |
---|
| 41 | x A vector of observed chi-squared test values |
---|
| 42 | df The degreees of freedom for the tests |
---|
| 43 | x.max If present, truncate the observed value (Y) axis here |
---|
| 44 | main The main heading |
---|
| 45 | sub The subheading |
---|
| 46 | xlab x-axis label (default "Expected") |
---|
| 47 | ylab y-axis label (default "Observed") |
---|
| 48 | conc Lower and upper probability bounds for concentration band for the plot. Set this |
---|
| 49 | to NA to suppress this |
---|
| 50 | overdisp If TRUE, an overdispersion factor, lambda, will be estimated and used in calculating |
---|
| 51 | concentration band |
---|
| 52 | trim Quantile point for trimmed mean calculations for estimation of lambda. Default |
---|
| 53 | is to trim at the median |
---|
| 54 | slope.one Is a line of slope one to be superimpsed? |
---|
| 55 | slope.lambda Is a line of slope lambda to be superimposed? |
---|
| 56 | thin A pair of numbers indicating how points will be thinned before plotting (see |
---|
| 57 | Details). If NA, no thinning will be carried out |
---|
| 58 | oor.pch Observed values greater than x.max are plotted at x.max. This argument sets |
---|
| 59 | the plotting symbol to be used for out-of-range observations |
---|
| 60 | col.shade The colour with which the concentration band will be filled |
---|
| 61 | ... Further graphical parameter settings to be passed to points() |
---|
| 62 | |
---|
| 63 | Details |
---|
| 64 | To reduce plotting time and the size of plot files, the smallest observed and expected points are |
---|
| 65 | thinned so that only a reduced number of (approximately equally spaced) points are plotted. The |
---|
| 66 | precise behaviour is controlled by the parameter thin, whose value should be a pair of numbers. |
---|
| 67 | The first number must lie between 0 and 1 and sets the proportion of the X axis over which thinning |
---|
| 68 | is to be applied. The second number should be an integer and sets the maximum number of points |
---|
| 69 | to be plotted in this section. |
---|
| 70 | The "concentration band" for the plot is shown in grey. This region is defined by upper and lower |
---|
| 71 | probability bounds for each order statistic. The default is to use the 2.5 Note that this is not a |
---|
| 72 | simultaneous confidence region; the probability that the plot will stray outside the band at some |
---|
| 73 | point exceeds 95 |
---|
| 74 | When required, he dispersion factor is estimated by the ratio of the observed trimmed mean to its |
---|
| 75 | expected value under the chi-squared assumption. |
---|
| 76 | Value |
---|
| 77 | The function returns the number of tests, the number of values omitted from the plot (greater than |
---|
| 78 | x.max), and the estimated dispersion factor, lambda. |
---|
| 79 | Note |
---|
| 80 | All tests must have the same number of degrees of freedom. If this is not the case, I suggest |
---|
| 81 | transforming to p-values and then plotting -2log(p) as chi-squared on 2 df. |
---|
| 82 | Author(s) |
---|
| 83 | David Clayton hdavid.clayton@cimr.cam.ac.uki |
---|
| 84 | References |
---|
| 85 | Devlin, B. and Roeder, K. (1999) Genomic control for association studies. Biometrics, 55:997-1004 |
---|
| 86 | """ |
---|
| 87 | |
---|
| 88 | import sys, random, math, copy,os, subprocess, tempfile |
---|
| 89 | from rgutils import RRun, rexe |
---|
| 90 | |
---|
| 91 | rqq = """ |
---|
| 92 | # modified by ross lazarus for the rgenetics project may 2000 |
---|
| 93 | # makes a pdf for galaxy from an x vector of chisquare values |
---|
| 94 | # from snpMatrix |
---|
| 95 | # http://www.bioconductor.org/packages/bioc/html/snpMatrix.html |
---|
| 96 | qq.chisq <- |
---|
| 97 | function(x, df=1, x.max, |
---|
| 98 | main="QQ plot", |
---|
| 99 | sub=paste("Expected distribution: chi-squared (",df," df)", sep=""), |
---|
| 100 | xlab="Expected", ylab="Observed", |
---|
| 101 | conc=c(0.025, 0.975), overdisp=FALSE, trim=0.5, |
---|
| 102 | slope.one=T, slope.lambda=FALSE, |
---|
| 103 | thin=c(0.5,200), oor.pch=24, col.shade="gray", ofname="qqchi.pdf", |
---|
| 104 | h=6,w=6,printpdf=F,...) { |
---|
| 105 | |
---|
| 106 | # Function to shade concentration band |
---|
| 107 | |
---|
| 108 | shade <- function(x1, y1, x2, y2, color=col.shade) { |
---|
| 109 | n <- length(x2) |
---|
| 110 | polygon(c(x1, x2[n:1]), c(y1, y2[n:1]), border=NA, col=color) |
---|
| 111 | } |
---|
| 112 | |
---|
| 113 | # Sort values and see how many out of range |
---|
| 114 | |
---|
| 115 | obsvd <- sort(x, na.last=NA) |
---|
| 116 | N <- length(obsvd) |
---|
| 117 | if (missing(x.max)) { |
---|
| 118 | Np <- N |
---|
| 119 | } |
---|
| 120 | else { |
---|
| 121 | Np <- sum(obsvd<=x.max) |
---|
| 122 | } |
---|
| 123 | if(Np==0) |
---|
| 124 | stop("Nothing to plot") |
---|
| 125 | |
---|
| 126 | # Expected values |
---|
| 127 | |
---|
| 128 | if (df==2) { |
---|
| 129 | expctd <- 2*cumsum(1/(N:1)) |
---|
| 130 | } |
---|
| 131 | else { |
---|
| 132 | expctd <- qchisq(p=(1:N)/(N+1), df=df) |
---|
| 133 | } |
---|
| 134 | |
---|
| 135 | # Concentration bands |
---|
| 136 | |
---|
| 137 | if (!is.null(conc)) { |
---|
| 138 | if(conc[1]>0) { |
---|
| 139 | e.low <- qchisq(p=qbeta(conc[1], 1:N, N:1), df=df) |
---|
| 140 | } |
---|
| 141 | else { |
---|
| 142 | e.low <- rep(0, N) |
---|
| 143 | } |
---|
| 144 | if (conc[2]<1) { |
---|
| 145 | e.high <- qchisq(p=qbeta(conc[2], 1:N, N:1), df=df) |
---|
| 146 | } |
---|
| 147 | else { |
---|
| 148 | e.high <- 1.1*rep(max(x),N) |
---|
| 149 | } |
---|
| 150 | } |
---|
| 151 | |
---|
| 152 | # Plot outline |
---|
| 153 | |
---|
| 154 | if (Np < N) |
---|
| 155 | top <- x.max |
---|
| 156 | else |
---|
| 157 | top <- obsvd[N] |
---|
| 158 | right <- expctd[N] |
---|
| 159 | if (printpdf) {pdf(ofname,h,w)} |
---|
| 160 | plot(c(0, right), c(0, top), type="n", xlab=xlab, ylab=ylab, |
---|
| 161 | main=main, sub=sub) |
---|
| 162 | |
---|
| 163 | # Thinning |
---|
| 164 | |
---|
| 165 | if (is.na(thin[1])) { |
---|
| 166 | show <- 1:Np |
---|
| 167 | } |
---|
| 168 | else if (length(thin)!=2 || thin[1]<0 || thin[1]>1 || thin[2]<1) { |
---|
| 169 | warning("invalid thin parameter; no thinning carried out") |
---|
| 170 | show <- 1:Np |
---|
| 171 | } |
---|
| 172 | else { |
---|
| 173 | space <- right*thin[1]/floor(thin[2]) |
---|
| 174 | iat <- round((N+1)*pchisq(q=(1:floor(thin[2]))*space, df=df)) |
---|
| 175 | if (max(iat)>thin[2]) |
---|
| 176 | show <- unique(c(iat, (1+max(iat)):Np)) |
---|
| 177 | else |
---|
| 178 | show <- 1:Np |
---|
| 179 | } |
---|
| 180 | Nu <- floor(trim*N) |
---|
| 181 | if (Nu>0) |
---|
| 182 | lambda <- mean(obsvd[1:Nu])/mean(expctd[1:Nu]) |
---|
| 183 | if (!is.null(conc)) { |
---|
| 184 | if (Np<N) |
---|
| 185 | vert <- c(show, (Np+1):N) |
---|
| 186 | else |
---|
| 187 | vert <- show |
---|
| 188 | if (overdisp) |
---|
| 189 | shade(expctd[vert], lambda*e.low[vert], |
---|
| 190 | expctd[vert], lambda*e.high[vert]) |
---|
| 191 | else |
---|
| 192 | shade(expctd[vert], e.low[vert], expctd[vert], e.high[vert]) |
---|
| 193 | } |
---|
| 194 | points(expctd[show], obsvd[show], ...) |
---|
| 195 | # Overflow |
---|
| 196 | if (Np<N) { |
---|
| 197 | over <- (Np+1):N |
---|
| 198 | points(expctd[over], rep(x.max, N-Np), pch=oor.pch) |
---|
| 199 | } |
---|
| 200 | # Lines |
---|
| 201 | line.types <- c("solid", "dashed", "dotted") |
---|
| 202 | key <- NULL |
---|
| 203 | txt <- NULL |
---|
| 204 | if (slope.one) { |
---|
| 205 | key <- c(key, line.types[1]) |
---|
| 206 | txt <- c(txt, "y = x") |
---|
| 207 | abline(a=0, b=1, lty=line.types[1]) |
---|
| 208 | } |
---|
| 209 | if (slope.lambda && Nu>0) { |
---|
| 210 | key <- c(key, line.types[2]) |
---|
| 211 | txt <- c(txt, paste("y = ", format(lambda, digits=4), "x", sep="")) |
---|
| 212 | if (!is.null(conc)) { |
---|
| 213 | if (Np<N) |
---|
| 214 | vert <- c(show, (Np+1):N) |
---|
| 215 | else |
---|
| 216 | vert <- show |
---|
| 217 | } |
---|
| 218 | abline(a=0, b=lambda, lty=line.types[2]) |
---|
| 219 | } |
---|
| 220 | if (printpdf) {dev.off()} |
---|
| 221 | # Returned value |
---|
| 222 | |
---|
| 223 | if (!is.null(key)) |
---|
| 224 | legend(0, top, legend=txt, lty=key) |
---|
| 225 | c(N=N, omitted=N-Np, lambda=lambda) |
---|
| 226 | |
---|
| 227 | } |
---|
| 228 | |
---|
| 229 | """ |
---|
| 230 | |
---|
| 231 | |
---|
| 232 | |
---|
| 233 | |
---|
| 234 | def makeQQ(dat=[], sample=1.0, maxveclen=4000, fname='fname',title='title', |
---|
| 235 | xvar='Sample',h=8,w=8,logscale=True,outdir=None): |
---|
| 236 | """ |
---|
| 237 | y is data for a qq plot and ends up on the x axis go figure |
---|
| 238 | if sampling, oversample low values - all the top 1% ? |
---|
| 239 | assume we have 0-1 p values |
---|
| 240 | """ |
---|
| 241 | R = [] |
---|
| 242 | colour="maroon" |
---|
| 243 | nrows = len(dat) |
---|
| 244 | dat.sort() # small to large |
---|
| 245 | fn = float(nrows) |
---|
| 246 | unifx = [x/fn for x in range(1,(nrows+1))] |
---|
| 247 | if logscale: |
---|
| 248 | unifx = [-math.log10(x) for x in unifx] # uniform distribution |
---|
| 249 | if sample < 1.0 and len(dat) > maxveclen: |
---|
| 250 | # now have half a million markers eg - too many to plot all for a pdf - sample to get 10k or so points |
---|
| 251 | # oversample part of the distribution |
---|
| 252 | always = min(1000,nrows/20) # oversample smaller of lowest few hundred items or 5% |
---|
| 253 | skip = int(nrows/float(maxveclen)) # take 1 in skip to get about maxveclen points |
---|
| 254 | if skip <= 1: |
---|
| 255 | skip = 2 |
---|
| 256 | samplei = [i for i in range(nrows) if (i < always) or (i % skip == 0)] |
---|
| 257 | # always oversample first sorted (here lowest) values |
---|
| 258 | yvec = [dat[i] for i in samplei] # always get first and last |
---|
| 259 | xvec = [unifx[i] for i in samplei] # and sample xvec same way |
---|
| 260 | maint='QQ %s (random %d of %d)' % (title,len(yvec),nrows) |
---|
| 261 | else: |
---|
| 262 | yvec = [x for x in dat] |
---|
| 263 | maint='QQ %s (n=%d)' % (title,nrows) |
---|
| 264 | xvec = unifx |
---|
| 265 | if logscale: |
---|
| 266 | maint = 'Log%s' % maint |
---|
| 267 | mx = [0,math.log10(nrows)] # if 1000, becomes 3 for the null line |
---|
| 268 | ylab = '-log10(%s) Quantiles' % title |
---|
| 269 | xlab = '-log10(Uniform 0-1) Quantiles' |
---|
| 270 | yvec = [-math.log10(x) for x in yvec if x > 0.0] |
---|
| 271 | else: |
---|
| 272 | mx = [0,1] |
---|
| 273 | ylab = '%s Quantiles' % title |
---|
| 274 | xlab = 'Uniform 0-1 Quantiles' |
---|
| 275 | |
---|
| 276 | xv = ['%f' % x for x in xvec] |
---|
| 277 | R.append('xvec = c(%s)' % ','.join(xv)) |
---|
| 278 | yv = ['%f' % x for x in yvec] |
---|
| 279 | R.append('yvec = c(%s)' % ','.join(yv)) |
---|
| 280 | R.append('mx = c(0,%f)' % (math.log10(fn))) |
---|
| 281 | R.append('pdf("%s",h=%d,w=%d)' % (fname,h,w)) |
---|
| 282 | R.append("par(lab=c(10,10,10))") |
---|
| 283 | R.append('qqplot(xvec,yvec,xlab="%s",ylab="%s",main="%s",sub="%s",pch=19,col="%s",cex=0.8)' % (xlab,ylab,maint,title,colour)) |
---|
| 284 | R.append('points(mx,mx,type="l")') |
---|
| 285 | R.append('grid(col="lightgray",lty="dotted")') |
---|
| 286 | R.append('dev.off()') |
---|
| 287 | RRun(rcmd=R,title='makeQQplot',outdir=outdir) |
---|
| 288 | |
---|
| 289 | |
---|
| 290 | |
---|
| 291 | def main(): |
---|
| 292 | u = """ |
---|
| 293 | """ |
---|
| 294 | u = """<command interpreter="python"> |
---|
| 295 | rgQQ.py "$input1" "$name" $sample "$cols" $allqq $height $width $logtrans $allqq.id $__new_file_path__ |
---|
| 296 | </command> |
---|
| 297 | |
---|
| 298 | </command> |
---|
| 299 | """ |
---|
| 300 | print >> sys.stdout,'## rgQQ.py. cl=',sys.argv |
---|
| 301 | npar = 11 |
---|
| 302 | if len(sys.argv) < npar: |
---|
| 303 | print >> sys.stdout, '## error - too few command line parameters - wanting %d' % npar |
---|
| 304 | print >> sys.stdout, u |
---|
| 305 | sys.exit(1) |
---|
| 306 | in_fname = sys.argv[1] |
---|
| 307 | name = sys.argv[2] |
---|
| 308 | sample = float(sys.argv[3]) |
---|
| 309 | head = None |
---|
| 310 | columns = [int(x) for x in sys.argv[4].strip().split(',')] # work with python columns! |
---|
| 311 | allout = sys.argv[5] |
---|
| 312 | height = int(sys.argv[6]) |
---|
| 313 | width = int(sys.argv[7]) |
---|
| 314 | logscale = (sys.argv[8].lower() == 'true') |
---|
| 315 | outid = sys.argv[9] # this is used to allow multiple output files |
---|
| 316 | outdir = sys.argv[10] |
---|
| 317 | nan_row = False |
---|
| 318 | rows = [] |
---|
| 319 | for i, line in enumerate( file( sys.argv[1] ) ): |
---|
| 320 | # Skip comments |
---|
| 321 | if line.startswith( '#' ) or ( i == 0 ): |
---|
| 322 | if i == 0: |
---|
| 323 | head = line.strip().split("\t") |
---|
| 324 | continue |
---|
| 325 | if len(line.strip()) == 0: |
---|
| 326 | continue |
---|
| 327 | # Extract values and convert to floats |
---|
| 328 | fields = line.strip().split( "\t" ) |
---|
| 329 | row = [] |
---|
| 330 | nan_row = False |
---|
| 331 | for column in columns: |
---|
| 332 | if len( fields ) <= column: |
---|
| 333 | return fail( "No column %d on line %d: %s" % ( column, i, fields ) ) |
---|
| 334 | val = fields[column] |
---|
| 335 | if val.lower() == "na": |
---|
| 336 | nan_row = True |
---|
| 337 | else: |
---|
| 338 | try: |
---|
| 339 | row.append( float( fields[column] ) ) |
---|
| 340 | except ValueError: |
---|
| 341 | return fail( "Value '%s' in column %d on line %d is not numeric" % ( fields[column], column+1, i ) ) |
---|
| 342 | if not nan_row: |
---|
| 343 | rows.append( row ) |
---|
| 344 | if i > 1: |
---|
| 345 | i = i-1 # remove header row from count |
---|
| 346 | if head == None: |
---|
| 347 | head = ['Col%d' % (x+1) for x in columns] |
---|
| 348 | R = [] |
---|
| 349 | for c,column in enumerate(columns): # we appended each column in turn |
---|
| 350 | outname = allout |
---|
| 351 | if c > 0: # after first time |
---|
| 352 | outname = 'primary_%s_col%s_visible_pdf' % (outid,column) |
---|
| 353 | outname = os.path.join(outdir,outname) |
---|
| 354 | dat = [] |
---|
| 355 | nrows = len(rows) # sometimes lots of NA's!! |
---|
| 356 | for arow in rows: |
---|
| 357 | dat.append(arow[c]) # remember, we appended each col in turn! |
---|
| 358 | cname = head[column] |
---|
| 359 | makeQQ(dat=dat,sample=sample,fname=outname,title='%s_%s' % (name,cname), |
---|
| 360 | xvar=cname,h=height,w=width,logscale=logscale,outdir=outdir) |
---|
| 361 | |
---|
| 362 | |
---|
| 363 | |
---|
| 364 | if __name__ == "__main__": |
---|
| 365 | main() |
---|