root/galaxy-central/tools/rgenetics/rgEigPCA.py @ 2

リビジョン 2, 16.1 KB (コミッタ: hatakeyama, 14 年 前)

import galaxy-central

行番号 
1"""
2run smartpca
3
4This uses galaxy code developed by Dan to deal with
5arbitrary output files using an html dataset with it's own
6subdirectory containing the arbitrary files
7We create that html file and add all the links we need
8
9Note that we execute the smartpca.perl program in the output subdirectory
10to avoid having to clear out the job directory after running
11
12Code to convert linkage format ped files into eigenstratgeno format is left here
13in case we decide to autoconvert
14
15Added a plot in R with better labels than the default eigensoft plot december 26 2007
16
17DOCUMENTATION OF smartpca program:
18
19smartpca runs Principal Components Analysis on input genotype data and
20  outputs principal components (eigenvectors) and eigenvalues.
21  The method assumes that samples are unrelated.  (However, a small number
22  of cryptically related individuals is usually not a problem in practice
23  as they will typically be discarded as outliers.)
24
255 different input formats are supported.  See ../CONVERTF/README
26for documentation on using the convertf program to convert between formats.
27
28The syntax of smartpca is "../bin/smartpca -p parfile".  We illustrate
29how parfile works via a toy example (see example.perl in this directory).
30This example takes input in EIGENSTRAT format.  The syntax of how to take input
31in other formats is analogous to the convertf program, see ../CONVERTF/README.
32
33The smartpca program prints various statistics to standard output.
34To redirect this information to a file, change the above syntax to
35"../bin/smartpca -p parfile >logfile".  For a description of these
36statistics, see the documentation file smartpca.info in this directory.
37
38Estimated running time of the smartpca program is
39  2.5e-12 * nSNP * NSAMPLES^2 hours            if not removing outliers.
40  2.5e-12 * nSNP * NSAMPLES^2 hours * (1+m)    if m outlier removal iterations.
41Thus, under the default of up to 5 outlier removal iterations, running time is
42  up to 1.5e-11 * nSNP * NSAMPLES^2 hours.
43
44------------------------------------------------------------------------
45
46DESCRIPTION OF EACH PARAMETER in parfile for smartpca:
47
48genotypename: input genotype file (in any format: see ../CONVERTF/README)
49snpname:      input snp file      (in any format: see ../CONVERTF/README)
50indivname:    input indiv file    (in any format: see ../CONVERTF/README)
51evecoutname:  output file of eigenvectors.  See numoutevec parameter below.
52evaloutname:  output file of all eigenvalues
53
54OPTIONAL PARAMETERS:
55
56numoutevec:     number of eigenvectors to output.  Default is 10.
57numoutlieriter: maximum number of outlier removal iterations.
58  Default is 5.  To turn off outlier removal, set this parameter to 0.
59numoutlierevec: number of principal components along which to
60  remove outliers during each outlier removal iteration.  Default is 10.
61outliersigmathresh: number of standard deviations which an individual must
62  exceed, along one of the top (numoutlierevec) principal components, in
63  order for that individual to be removed as an outlier.  Default is 6.0.
64outlieroutname: output logfile of outlier individuals removed. If not specified,
65  smartpca will print this information to stdout, which is the default.
66usenorm: Whether to normalize each SNP by a quantity related to allele freq.
67  Default is YES.  (When analyzing microsatellite data, should be set to NO.
68  See Patterson et al. 2006.)
69altnormstyle: Affects very subtle details in normalization formula.
70  Default is YES (normalization formulas of Patterson et al. 2006)
71  To match EIGENSTRAT (normalization formulas of Price et al. 2006), set to NO.
72missingmode: If set to YES, then instead of doing PCA on # reference alleles,
73  do PCA on whether each data point is missing or nonmissing.  Default is NO.
74  May be useful for detecting informative missingness (Clayton et al. 2005).
75nsnpldregress: If set to a positive integer, then LD correction is turned on,
76  and input to PCA will be the residual of a regression involving that many
77  previous SNPs, according to physical location.  See Patterson et al. 2006.
78  Default is 0 (no LD correction).  If desiring LD correction, we recommend 2.
79maxdistldregress: If doing LD correction, this is the maximum genetic distance
80  (in Morgans) for previous SNPs used in LD correction.  Default is no maximum.
81poplistname:   If wishing to infer eigenvectors using only individuals from a
82  subset of populations, and then project individuals from all populations
83  onto those eigenvectors, this input file contains a list of population names,
84  one population name per line, which will be used to infer eigenvectors.
85  It is assumed that the population of each individual is specified in the
86  indiv file.  Default is to use individuals from all populations.
87phylipoutname: output file containing an fst matrix which can be used as input
88  to programs in the PHYLIP package, such as the "fitch" program for
89  constructing phylogenetic trees.
90noxdata:    if set to YES, all SNPs on X chr are excluded from the data set.
91  The smartpca default for this parameter is YES, since different variances
92  for males vs. females on X chr may confound PCA analysis.
93nomalexhet: if set to YES, any het genotypes on X chr for males are changed
94  to missing data.  The smartpca default for this parameter is YES.
95badsnpname: specifies a list of SNPs which should be excluded from the data set.
96  Same format as example.snp.  Cannot be used if input is in
97  PACKEDPED or PACKEDANCESTRYMAP format.
98popsizelimit: If set to a positive integer, the result is that only the first
99  popsizelimit individuals from each population will be included in the
100  analysis. It is assumed that the population of each individual is specified
101  in the indiv file.  Default is to use all individuals in the analysis.
102
103The next 5 optional parameters allow the user to output genotype, snp and
104  indiv files which will be identical to the input files except that:
105    Any individuals set to Ignore in the input indiv file will be
106      removed from the data set (see ../CONVERTF/README)
107    Any data excluded or set to missing based on noxdata, nomalexhet and
108      badsnpname parameters (see above) will be removed from the data set.
109    The user may decide to output these files in any format.
110outputformat:    ANCESTRYMAP,  EIGENSTRAT, PED, PACKEDPED or PACKEDANCESTRYMAP
111genotypeoutname: output genotype file
112snpoutname:      output snp file
113indivoutname:    output indiv file
114outputgroup: see documentation in ../CONVERTF/README
115"""
116import sys,os,time,subprocess,string,glob
117from rgutils import RRun, galhtmlprefix, galhtmlpostfix, timenow, smartpca, rexe, plinke
118verbose = False
119
120def makePlot(eigpca='test.pca',title='test',pdfname='test.pdf',h=8,w=10,nfp=None,rexe=''):
121    """
122    the eigenvec file has a # row with the eigenvectors, then subject ids, eigenvecs and lastly
123    the subject class
124    Rpy not being used here. Write a real R script and run it. Sadly, this means putting numbers
125    somewhere - like in the code as monster R vector constructor c(99.3,2.14) strings
126    At least you have the data and the analysis in one single place. Highly reproducible little
127    piece of research.
128    """
129    debug=False
130    f = file(eigpca,'r')
131    R = []
132    if debug:
133      R.append('sessionInfo()')
134      R.append("print('dir()=:')")
135      R.append('dir()')
136      R.append("print('pdfname=%s')" % pdfname)
137    gvec = []
138    pca1 = []
139    pca2 = []
140    groups = {}
141    glist = [] # list for legend
142    ngroup = 1 # increment for each new group encountered for pch vector
143    for n,row in enumerate(f):
144        if n > 1:
145            rowlist = row.strip().split()
146            group = rowlist[-1]
147            v1 = rowlist[1]
148            v2 = rowlist[2]
149            try:
150                v1 = float(v1)
151            except:
152                v1 = 0.0
153            try:
154                v2 = float(v2)
155            except:
156                v2 = 0.0
157            if not groups.get(group,None):
158                groups[group] = ngroup
159                glist.append(group)
160                ngroup += 1 # for next group
161            gvec.append(groups[group]) # lookup group number
162            pca1.append('%f' % v1)
163            pca2.append('%f' % v2)
164    # now have vectors of group,pca1 and pca2
165    llist = [x.encode('ascii') for x in glist] # remove label unicode - eesh
166    llist = ['"%s"' % x for x in llist] # need to quote for R
167    R.append('llist=c(%s)' % ','.join(llist))
168
169    plist = range(2,len(llist)+2) # pch - avoid black circles
170    R.append('glist=c(%s)' % ','.join(['%d' % x for x in plist]))
171    pgvec = ['%d' % (plist[i-1]) for i in gvec] # plot symbol/colour for each point
172    R.append("par(lab=c(10,10,10))") # so our grid is denser than the default 5
173    R.append("par(mai=c(1,1,1,0.5))")
174    maint = title
175    R.append('pdf("%s",h=%d,w=%d)' % (pdfname,h,w))
176    R.append("par(lab=c(10,10,10))")
177    R.append('pca1 = c(%s)' % ','.join(pca1))
178    R.append('pca2 = c(%s)' % ','.join(pca2))
179    R.append('pgvec = c(%s)' % ','.join(pgvec))
180    s = "plot(pca1,pca2,type='p',main='%s', ylab='Second ancestry eigenvector'," % maint
181    s += "xlab='First ancestry eigenvector',col=pgvec,cex=0.8,pch=pgvec)"
182    R.append(s)
183    R.append('legend("top",legend=llist,pch=glist,col=glist,title="Sample")')
184    R.append('grid(nx = 10, ny = 10, col = "lightgray", lty = "dotted")')
185    R.append('dev.off()')
186    R.append('png("%s.png",h=%d,w=%d,units="in",res=72)' % (pdfname,h,w))
187    s = "plot(pca1,pca2,type='p',main='%s', ylab='Second ancestry eigenvector'," % maint
188    s += "xlab='First ancestry eigenvector',col=pgvec,cex=0.8,pch=pgvec)"
189    R.append(s)
190    R.append('legend("top",legend=llist,pch=glist,col=glist,title="Sample")')
191    R.append('grid(nx = 10, ny = 10, col = "lightgray", lty = "dotted")')
192    R.append('dev.off()')
193    rlog,flist = RRun(rcmd=R,title=title,outdir=nfp)
194    print >> sys.stdout, '\n'.join(R)
195    print >> sys.stdout, rlog
196
197
198def getfSize(fpath,outpath):
199    """
200    format a nice file size string
201    """
202    size = ''
203    fp = os.path.join(outpath,fpath)
204    if os.path.isfile(fp):
205        n = float(os.path.getsize(fp))
206        if n > 2**20:
207            size = ' (%1.1f MB)' % (n/2**20)
208        elif n > 2**10:
209            size = ' (%1.1f KB)' % (n/2**10)
210        elif n > 0:
211            size = ' (%d B)' % (int(n))
212    return size
213
214
215def runEigen():
216    """ run the smartpca prog - documentation follows
217
218    smartpca.perl -i fakeped_100.eigenstratgeno -a fakeped_100.map -b fakeped_100.ind -p fakeped_100 -e fakeped_100.eigenvals -l
219        fakeped_100.eigenlog -o fakeped_100.eigenout
220
221DOCUMENTATION OF smartpca.perl program:
222
223This program calls the smartpca program (see ../POPGEN/README).
224For this to work, the bin directory containing smartpca MUST be in your path.
225See ./example.perl for a toy example.
226
227../bin/smartpca.perl
228-i example.geno  : genotype file in EIGENSTRAT format (see ../CONVERTF/README)
229-a example.snp   : snp file   (see ../CONVERTF/README)
230-b example.ind   : indiv file (see ../CONVERTF/README)
231-k k             : (Default is 10) number of principal components to output
232-o example.pca   : output file of principal components.  Individuals removed
233                   as outliers will have all values set to 0.0 in this file.
234-p example.plot  : prefix of output plot files of top 2 principal components.
235                   (labeling individuals according to labels in indiv file)
236-e example.eval  : output file of all eigenvalues
237-l example.log   : output logfile
238-m maxiter       : (Default is 5) maximum number of outlier removal iterations.
239                   To turn off outlier removal, set -m 0.
240-t topk          : (Default is 10) number of principal components along which
241                   to remove outliers during each outlier removal iteration.
242-s sigma         : (Default is 6.0) number of standard deviations which an
243                   individual must exceed, along one of topk top principal
244                   components, in order to be removed as an outlier.
245
246    now uses https://www.bx.psu.edu/cgi-bin/trac.cgi/galaxy/changeset/1832
247
248All files can be viewed however, by making links in the primary (HTML) history item like:
249<img src="display_child?parent_id=2&designation=SomeImage?" alt="Some Image"/>
250<a href="display_child?parent_id=2&designation=SomeText?">Some Text</a>
251
252    <command interpreter="python">
253    rgEigPCA.py "$i.extra_files_path/$i.metadata.base_name" "$title" "$out_file1"
254    "$out_file1.files_path" "$k" "$m" "$t" "$s" "$pca"
255    </command>
256
257    """
258    if len(sys.argv) < 9:
259        print 'Need an input genotype file root, a title, a temp id and the temp file path for outputs,'
260        print ' and the 4 integer tuning parameters k,m,t and s in order. Given that, will run smartpca for eigensoft'
261        sys.exit(1)
262    else:
263        print >> sys.stdout, 'rgEigPCA.py got %s' % (' '.join(sys.argv))
264    skillme = ' %s' % string.punctuation
265    trantab = string.maketrans(skillme,'_'*len(skillme))
266    ofname = sys.argv[5]
267    progname = os.path.basename(sys.argv[0])
268    infile = sys.argv[1]
269    infpath,base_name = os.path.split(infile) # now takes precomputed or autoconverted ldreduced dataset
270    title = sys.argv[2].translate(trantab) # must replace all of these for urls containing title
271    outfile1 = sys.argv[3]
272    newfilepath = sys.argv[4]
273    try:
274       os.mkdirs(newfilepath)
275    except:
276       pass
277    op = os.path.split(outfile1)[0]
278    try: # for test - needs this done
279        os.makedirs(op)
280    except:
281        pass
282    eigen_k = sys.argv[5]
283    eigen_m = sys.argv[6]
284    eigen_t = sys.argv[7]
285    eigen_s = sys.argv[8]
286    eigpca = sys.argv[9] # path to new dataset for pca results - for later adjustment
287    eigentitle = os.path.join(newfilepath,title)
288    explanations=['Samples plotted in first 2 eigenvector space','Principle components','Eigenvalues',
289    'Smartpca log (contents shown below)']
290    rplotname = 'PCAPlot.pdf'
291    eigenexts = [rplotname, "pca.xls", "eval.xls"]
292    newfiles = ['%s_%s' % (title,x) for x in eigenexts] # produced by eigenstrat
293    rplotout = os.path.join(newfilepath,newfiles[0]) # for R plots
294    eigenouts = [x for x in newfiles]
295    eigenlogf = '%s_log.txt' % title
296    newfiles.append(eigenlogf) # so it will also appear in the links
297    lfname = outfile1
298    lf = file(lfname,'w')
299    lf.write(galhtmlprefix % progname)
300    try:
301        os.makedirs(newfilepath)
302    except:
303        pass
304    smartCL = '%s -i %s.bed -a %s.bim -b %s.fam -o %s -p %s -e %s -l %s -k %s -m %s -t %s -s %s' % \
305          (smartpca,infile, infile, infile, eigenouts[1],'%s_eigensoftplot.pdf' % title,eigenouts[2],eigenlogf, \
306           eigen_k, eigen_m, eigen_t, eigen_s)
307    env = os.environ
308    p=subprocess.Popen(smartCL,shell=True,cwd=newfilepath)
309    retval = p.wait()
310    # copy the eigenvector output file needed for adjustment to the user's eigenstrat library directory
311    elog = file(os.path.join(newfilepath,eigenlogf),'r').read()
312    eeigen = os.path.join(newfilepath,'%s.evec' % eigenouts[1]) # need these for adjusting
313    try:
314        eigpcaRes = file(eeigen,'r').read()
315    except:
316        eigpcaRes = ''
317    file(eigpca,'w').write(eigpcaRes)
318    makePlot(eigpca=eigpca,pdfname=newfiles[0],title=title,nfp=newfilepath,rexe=rexe)
319    s = 'Output from %s run at %s<br/>\n' % (progname,timenow())
320    lf.write('<h4>%s</h4>\n' % s)
321    lf.write('newfilepath=%s, rexe=%s' % (newfilepath,rexe))
322    lf.write('(click on the image below to see a much higher quality PDF version)')
323    thumbnail = '%s.png' % newfiles[0] # foo.pdf.png - who cares?
324    if os.path.exists(os.path.join(newfilepath,thumbnail)):
325        lf.write('<table border="0" cellpadding="10" cellspacing="10"><tr><td>\n')
326        lf.write('<a href="%s"><img src="%s" alt="%s" hspace="10" align="left" /></a></td></tr></table><br/>\n' \
327            % (newfiles[0],thumbnail,explanations[0]))
328    allfiles = os.listdir(newfilepath)
329    allfiles.sort()
330    sizes = [getfSize(x,newfilepath) for x in allfiles]
331    lallfiles = ['<li><a href="%s">%s %s</a></li>\n' % (x,x,sizes[i]) for i,x in enumerate(allfiles)] # html list
332    lf.write('<div class="document">All Files:<ol>%s</ol></div>' % ''.join(lallfiles))
333    lf.write('<div class="document">Log %s contents follow below<p/>' % eigenlogf)
334    lf.write('<pre>%s</pre></div>' % elog) # the eigenlog
335    s = 'If you need to rerun this analysis, the command line used was\n%s\n<p/>' % (smartCL)
336    lf.write(s)
337    lf.write(galhtmlpostfix) # end galhtmlprefix div
338    lf.close()
339
340
341if __name__ == "__main__":
342   runEigen()
Note: リポジトリブラウザについてのヘルプは TracBrowser を参照してください。