1 | """ |
---|
2 | run smartpca |
---|
3 | |
---|
4 | This uses galaxy code developed by Dan to deal with |
---|
5 | arbitrary output files using an html dataset with it's own |
---|
6 | subdirectory containing the arbitrary files |
---|
7 | We create that html file and add all the links we need |
---|
8 | |
---|
9 | Note that we execute the smartpca.perl program in the output subdirectory |
---|
10 | to avoid having to clear out the job directory after running |
---|
11 | |
---|
12 | Code to convert linkage format ped files into eigenstratgeno format is left here |
---|
13 | in case we decide to autoconvert |
---|
14 | |
---|
15 | Added a plot in R with better labels than the default eigensoft plot december 26 2007 |
---|
16 | |
---|
17 | DOCUMENTATION OF smartpca program: |
---|
18 | |
---|
19 | smartpca runs Principal Components Analysis on input genotype data and |
---|
20 | outputs principal components (eigenvectors) and eigenvalues. |
---|
21 | The method assumes that samples are unrelated. (However, a small number |
---|
22 | of cryptically related individuals is usually not a problem in practice |
---|
23 | as they will typically be discarded as outliers.) |
---|
24 | |
---|
25 | 5 different input formats are supported. See ../CONVERTF/README |
---|
26 | for documentation on using the convertf program to convert between formats. |
---|
27 | |
---|
28 | The syntax of smartpca is "../bin/smartpca -p parfile". We illustrate |
---|
29 | how parfile works via a toy example (see example.perl in this directory). |
---|
30 | This example takes input in EIGENSTRAT format. The syntax of how to take input |
---|
31 | in other formats is analogous to the convertf program, see ../CONVERTF/README. |
---|
32 | |
---|
33 | The smartpca program prints various statistics to standard output. |
---|
34 | To redirect this information to a file, change the above syntax to |
---|
35 | "../bin/smartpca -p parfile >logfile". For a description of these |
---|
36 | statistics, see the documentation file smartpca.info in this directory. |
---|
37 | |
---|
38 | Estimated running time of the smartpca program is |
---|
39 | 2.5e-12 * nSNP * NSAMPLES^2 hours if not removing outliers. |
---|
40 | 2.5e-12 * nSNP * NSAMPLES^2 hours * (1+m) if m outlier removal iterations. |
---|
41 | Thus, under the default of up to 5 outlier removal iterations, running time is |
---|
42 | up to 1.5e-11 * nSNP * NSAMPLES^2 hours. |
---|
43 | |
---|
44 | ------------------------------------------------------------------------ |
---|
45 | |
---|
46 | DESCRIPTION OF EACH PARAMETER in parfile for smartpca: |
---|
47 | |
---|
48 | genotypename: input genotype file (in any format: see ../CONVERTF/README) |
---|
49 | snpname: input snp file (in any format: see ../CONVERTF/README) |
---|
50 | indivname: input indiv file (in any format: see ../CONVERTF/README) |
---|
51 | evecoutname: output file of eigenvectors. See numoutevec parameter below. |
---|
52 | evaloutname: output file of all eigenvalues |
---|
53 | |
---|
54 | OPTIONAL PARAMETERS: |
---|
55 | |
---|
56 | numoutevec: number of eigenvectors to output. Default is 10. |
---|
57 | numoutlieriter: maximum number of outlier removal iterations. |
---|
58 | Default is 5. To turn off outlier removal, set this parameter to 0. |
---|
59 | numoutlierevec: number of principal components along which to |
---|
60 | remove outliers during each outlier removal iteration. Default is 10. |
---|
61 | outliersigmathresh: number of standard deviations which an individual must |
---|
62 | exceed, along one of the top (numoutlierevec) principal components, in |
---|
63 | order for that individual to be removed as an outlier. Default is 6.0. |
---|
64 | outlieroutname: output logfile of outlier individuals removed. If not specified, |
---|
65 | smartpca will print this information to stdout, which is the default. |
---|
66 | usenorm: Whether to normalize each SNP by a quantity related to allele freq. |
---|
67 | Default is YES. (When analyzing microsatellite data, should be set to NO. |
---|
68 | See Patterson et al. 2006.) |
---|
69 | altnormstyle: Affects very subtle details in normalization formula. |
---|
70 | Default is YES (normalization formulas of Patterson et al. 2006) |
---|
71 | To match EIGENSTRAT (normalization formulas of Price et al. 2006), set to NO. |
---|
72 | missingmode: If set to YES, then instead of doing PCA on # reference alleles, |
---|
73 | do PCA on whether each data point is missing or nonmissing. Default is NO. |
---|
74 | May be useful for detecting informative missingness (Clayton et al. 2005). |
---|
75 | nsnpldregress: If set to a positive integer, then LD correction is turned on, |
---|
76 | and input to PCA will be the residual of a regression involving that many |
---|
77 | previous SNPs, according to physical location. See Patterson et al. 2006. |
---|
78 | Default is 0 (no LD correction). If desiring LD correction, we recommend 2. |
---|
79 | maxdistldregress: If doing LD correction, this is the maximum genetic distance |
---|
80 | (in Morgans) for previous SNPs used in LD correction. Default is no maximum. |
---|
81 | poplistname: If wishing to infer eigenvectors using only individuals from a |
---|
82 | subset of populations, and then project individuals from all populations |
---|
83 | onto those eigenvectors, this input file contains a list of population names, |
---|
84 | one population name per line, which will be used to infer eigenvectors. |
---|
85 | It is assumed that the population of each individual is specified in the |
---|
86 | indiv file. Default is to use individuals from all populations. |
---|
87 | phylipoutname: output file containing an fst matrix which can be used as input |
---|
88 | to programs in the PHYLIP package, such as the "fitch" program for |
---|
89 | constructing phylogenetic trees. |
---|
90 | noxdata: if set to YES, all SNPs on X chr are excluded from the data set. |
---|
91 | The smartpca default for this parameter is YES, since different variances |
---|
92 | for males vs. females on X chr may confound PCA analysis. |
---|
93 | nomalexhet: if set to YES, any het genotypes on X chr for males are changed |
---|
94 | to missing data. The smartpca default for this parameter is YES. |
---|
95 | badsnpname: specifies a list of SNPs which should be excluded from the data set. |
---|
96 | Same format as example.snp. Cannot be used if input is in |
---|
97 | PACKEDPED or PACKEDANCESTRYMAP format. |
---|
98 | popsizelimit: If set to a positive integer, the result is that only the first |
---|
99 | popsizelimit individuals from each population will be included in the |
---|
100 | analysis. It is assumed that the population of each individual is specified |
---|
101 | in the indiv file. Default is to use all individuals in the analysis. |
---|
102 | |
---|
103 | The next 5 optional parameters allow the user to output genotype, snp and |
---|
104 | indiv files which will be identical to the input files except that: |
---|
105 | Any individuals set to Ignore in the input indiv file will be |
---|
106 | removed from the data set (see ../CONVERTF/README) |
---|
107 | Any data excluded or set to missing based on noxdata, nomalexhet and |
---|
108 | badsnpname parameters (see above) will be removed from the data set. |
---|
109 | The user may decide to output these files in any format. |
---|
110 | outputformat: ANCESTRYMAP, EIGENSTRAT, PED, PACKEDPED or PACKEDANCESTRYMAP |
---|
111 | genotypeoutname: output genotype file |
---|
112 | snpoutname: output snp file |
---|
113 | indivoutname: output indiv file |
---|
114 | outputgroup: see documentation in ../CONVERTF/README |
---|
115 | """ |
---|
116 | import sys,os,time,subprocess,string,glob |
---|
117 | from rgutils import RRun, galhtmlprefix, galhtmlpostfix, timenow, smartpca, rexe, plinke |
---|
118 | verbose = False |
---|
119 | |
---|
120 | def makePlot(eigpca='test.pca',title='test',pdfname='test.pdf',h=8,w=10,nfp=None,rexe=''): |
---|
121 | """ |
---|
122 | the eigenvec file has a # row with the eigenvectors, then subject ids, eigenvecs and lastly |
---|
123 | the subject class |
---|
124 | Rpy not being used here. Write a real R script and run it. Sadly, this means putting numbers |
---|
125 | somewhere - like in the code as monster R vector constructor c(99.3,2.14) strings |
---|
126 | At least you have the data and the analysis in one single place. Highly reproducible little |
---|
127 | piece of research. |
---|
128 | """ |
---|
129 | debug=False |
---|
130 | f = file(eigpca,'r') |
---|
131 | R = [] |
---|
132 | if debug: |
---|
133 | R.append('sessionInfo()') |
---|
134 | R.append("print('dir()=:')") |
---|
135 | R.append('dir()') |
---|
136 | R.append("print('pdfname=%s')" % pdfname) |
---|
137 | gvec = [] |
---|
138 | pca1 = [] |
---|
139 | pca2 = [] |
---|
140 | groups = {} |
---|
141 | glist = [] # list for legend |
---|
142 | ngroup = 1 # increment for each new group encountered for pch vector |
---|
143 | for n,row in enumerate(f): |
---|
144 | if n > 1: |
---|
145 | rowlist = row.strip().split() |
---|
146 | group = rowlist[-1] |
---|
147 | v1 = rowlist[1] |
---|
148 | v2 = rowlist[2] |
---|
149 | try: |
---|
150 | v1 = float(v1) |
---|
151 | except: |
---|
152 | v1 = 0.0 |
---|
153 | try: |
---|
154 | v2 = float(v2) |
---|
155 | except: |
---|
156 | v2 = 0.0 |
---|
157 | if not groups.get(group,None): |
---|
158 | groups[group] = ngroup |
---|
159 | glist.append(group) |
---|
160 | ngroup += 1 # for next group |
---|
161 | gvec.append(groups[group]) # lookup group number |
---|
162 | pca1.append('%f' % v1) |
---|
163 | pca2.append('%f' % v2) |
---|
164 | # now have vectors of group,pca1 and pca2 |
---|
165 | llist = [x.encode('ascii') for x in glist] # remove label unicode - eesh |
---|
166 | llist = ['"%s"' % x for x in llist] # need to quote for R |
---|
167 | R.append('llist=c(%s)' % ','.join(llist)) |
---|
168 | |
---|
169 | plist = range(2,len(llist)+2) # pch - avoid black circles |
---|
170 | R.append('glist=c(%s)' % ','.join(['%d' % x for x in plist])) |
---|
171 | pgvec = ['%d' % (plist[i-1]) for i in gvec] # plot symbol/colour for each point |
---|
172 | R.append("par(lab=c(10,10,10))") # so our grid is denser than the default 5 |
---|
173 | R.append("par(mai=c(1,1,1,0.5))") |
---|
174 | maint = title |
---|
175 | R.append('pdf("%s",h=%d,w=%d)' % (pdfname,h,w)) |
---|
176 | R.append("par(lab=c(10,10,10))") |
---|
177 | R.append('pca1 = c(%s)' % ','.join(pca1)) |
---|
178 | R.append('pca2 = c(%s)' % ','.join(pca2)) |
---|
179 | R.append('pgvec = c(%s)' % ','.join(pgvec)) |
---|
180 | s = "plot(pca1,pca2,type='p',main='%s', ylab='Second ancestry eigenvector'," % maint |
---|
181 | s += "xlab='First ancestry eigenvector',col=pgvec,cex=0.8,pch=pgvec)" |
---|
182 | R.append(s) |
---|
183 | R.append('legend("top",legend=llist,pch=glist,col=glist,title="Sample")') |
---|
184 | R.append('grid(nx = 10, ny = 10, col = "lightgray", lty = "dotted")') |
---|
185 | R.append('dev.off()') |
---|
186 | R.append('png("%s.png",h=%d,w=%d,units="in",res=72)' % (pdfname,h,w)) |
---|
187 | s = "plot(pca1,pca2,type='p',main='%s', ylab='Second ancestry eigenvector'," % maint |
---|
188 | s += "xlab='First ancestry eigenvector',col=pgvec,cex=0.8,pch=pgvec)" |
---|
189 | R.append(s) |
---|
190 | R.append('legend("top",legend=llist,pch=glist,col=glist,title="Sample")') |
---|
191 | R.append('grid(nx = 10, ny = 10, col = "lightgray", lty = "dotted")') |
---|
192 | R.append('dev.off()') |
---|
193 | rlog,flist = RRun(rcmd=R,title=title,outdir=nfp) |
---|
194 | print >> sys.stdout, '\n'.join(R) |
---|
195 | print >> sys.stdout, rlog |
---|
196 | |
---|
197 | |
---|
198 | def getfSize(fpath,outpath): |
---|
199 | """ |
---|
200 | format a nice file size string |
---|
201 | """ |
---|
202 | size = '' |
---|
203 | fp = os.path.join(outpath,fpath) |
---|
204 | if os.path.isfile(fp): |
---|
205 | n = float(os.path.getsize(fp)) |
---|
206 | if n > 2**20: |
---|
207 | size = ' (%1.1f MB)' % (n/2**20) |
---|
208 | elif n > 2**10: |
---|
209 | size = ' (%1.1f KB)' % (n/2**10) |
---|
210 | elif n > 0: |
---|
211 | size = ' (%d B)' % (int(n)) |
---|
212 | return size |
---|
213 | |
---|
214 | |
---|
215 | def runEigen(): |
---|
216 | """ run the smartpca prog - documentation follows |
---|
217 | |
---|
218 | smartpca.perl -i fakeped_100.eigenstratgeno -a fakeped_100.map -b fakeped_100.ind -p fakeped_100 -e fakeped_100.eigenvals -l |
---|
219 | fakeped_100.eigenlog -o fakeped_100.eigenout |
---|
220 | |
---|
221 | DOCUMENTATION OF smartpca.perl program: |
---|
222 | |
---|
223 | This program calls the smartpca program (see ../POPGEN/README). |
---|
224 | For this to work, the bin directory containing smartpca MUST be in your path. |
---|
225 | See ./example.perl for a toy example. |
---|
226 | |
---|
227 | ../bin/smartpca.perl |
---|
228 | -i example.geno : genotype file in EIGENSTRAT format (see ../CONVERTF/README) |
---|
229 | -a example.snp : snp file (see ../CONVERTF/README) |
---|
230 | -b example.ind : indiv file (see ../CONVERTF/README) |
---|
231 | -k k : (Default is 10) number of principal components to output |
---|
232 | -o example.pca : output file of principal components. Individuals removed |
---|
233 | as outliers will have all values set to 0.0 in this file. |
---|
234 | -p example.plot : prefix of output plot files of top 2 principal components. |
---|
235 | (labeling individuals according to labels in indiv file) |
---|
236 | -e example.eval : output file of all eigenvalues |
---|
237 | -l example.log : output logfile |
---|
238 | -m maxiter : (Default is 5) maximum number of outlier removal iterations. |
---|
239 | To turn off outlier removal, set -m 0. |
---|
240 | -t topk : (Default is 10) number of principal components along which |
---|
241 | to remove outliers during each outlier removal iteration. |
---|
242 | -s sigma : (Default is 6.0) number of standard deviations which an |
---|
243 | individual must exceed, along one of topk top principal |
---|
244 | components, in order to be removed as an outlier. |
---|
245 | |
---|
246 | now uses https://www.bx.psu.edu/cgi-bin/trac.cgi/galaxy/changeset/1832 |
---|
247 | |
---|
248 | All files can be viewed however, by making links in the primary (HTML) history item like: |
---|
249 | <img src="display_child?parent_id=2&designation=SomeImage?" alt="Some Image"/> |
---|
250 | <a href="display_child?parent_id=2&designation=SomeText?">Some Text</a> |
---|
251 | |
---|
252 | <command interpreter="python"> |
---|
253 | rgEigPCA.py "$i.extra_files_path/$i.metadata.base_name" "$title" "$out_file1" |
---|
254 | "$out_file1.files_path" "$k" "$m" "$t" "$s" "$pca" |
---|
255 | </command> |
---|
256 | |
---|
257 | """ |
---|
258 | if len(sys.argv) < 9: |
---|
259 | print 'Need an input genotype file root, a title, a temp id and the temp file path for outputs,' |
---|
260 | print ' and the 4 integer tuning parameters k,m,t and s in order. Given that, will run smartpca for eigensoft' |
---|
261 | sys.exit(1) |
---|
262 | else: |
---|
263 | print >> sys.stdout, 'rgEigPCA.py got %s' % (' '.join(sys.argv)) |
---|
264 | skillme = ' %s' % string.punctuation |
---|
265 | trantab = string.maketrans(skillme,'_'*len(skillme)) |
---|
266 | ofname = sys.argv[5] |
---|
267 | progname = os.path.basename(sys.argv[0]) |
---|
268 | infile = sys.argv[1] |
---|
269 | infpath,base_name = os.path.split(infile) # now takes precomputed or autoconverted ldreduced dataset |
---|
270 | title = sys.argv[2].translate(trantab) # must replace all of these for urls containing title |
---|
271 | outfile1 = sys.argv[3] |
---|
272 | newfilepath = sys.argv[4] |
---|
273 | try: |
---|
274 | os.mkdirs(newfilepath) |
---|
275 | except: |
---|
276 | pass |
---|
277 | op = os.path.split(outfile1)[0] |
---|
278 | try: # for test - needs this done |
---|
279 | os.makedirs(op) |
---|
280 | except: |
---|
281 | pass |
---|
282 | eigen_k = sys.argv[5] |
---|
283 | eigen_m = sys.argv[6] |
---|
284 | eigen_t = sys.argv[7] |
---|
285 | eigen_s = sys.argv[8] |
---|
286 | eigpca = sys.argv[9] # path to new dataset for pca results - for later adjustment |
---|
287 | eigentitle = os.path.join(newfilepath,title) |
---|
288 | explanations=['Samples plotted in first 2 eigenvector space','Principle components','Eigenvalues', |
---|
289 | 'Smartpca log (contents shown below)'] |
---|
290 | rplotname = 'PCAPlot.pdf' |
---|
291 | eigenexts = [rplotname, "pca.xls", "eval.xls"] |
---|
292 | newfiles = ['%s_%s' % (title,x) for x in eigenexts] # produced by eigenstrat |
---|
293 | rplotout = os.path.join(newfilepath,newfiles[0]) # for R plots |
---|
294 | eigenouts = [x for x in newfiles] |
---|
295 | eigenlogf = '%s_log.txt' % title |
---|
296 | newfiles.append(eigenlogf) # so it will also appear in the links |
---|
297 | lfname = outfile1 |
---|
298 | lf = file(lfname,'w') |
---|
299 | lf.write(galhtmlprefix % progname) |
---|
300 | try: |
---|
301 | os.makedirs(newfilepath) |
---|
302 | except: |
---|
303 | pass |
---|
304 | smartCL = '%s -i %s.bed -a %s.bim -b %s.fam -o %s -p %s -e %s -l %s -k %s -m %s -t %s -s %s' % \ |
---|
305 | (smartpca,infile, infile, infile, eigenouts[1],'%s_eigensoftplot.pdf' % title,eigenouts[2],eigenlogf, \ |
---|
306 | eigen_k, eigen_m, eigen_t, eigen_s) |
---|
307 | env = os.environ |
---|
308 | p=subprocess.Popen(smartCL,shell=True,cwd=newfilepath) |
---|
309 | retval = p.wait() |
---|
310 | # copy the eigenvector output file needed for adjustment to the user's eigenstrat library directory |
---|
311 | elog = file(os.path.join(newfilepath,eigenlogf),'r').read() |
---|
312 | eeigen = os.path.join(newfilepath,'%s.evec' % eigenouts[1]) # need these for adjusting |
---|
313 | try: |
---|
314 | eigpcaRes = file(eeigen,'r').read() |
---|
315 | except: |
---|
316 | eigpcaRes = '' |
---|
317 | file(eigpca,'w').write(eigpcaRes) |
---|
318 | makePlot(eigpca=eigpca,pdfname=newfiles[0],title=title,nfp=newfilepath,rexe=rexe) |
---|
319 | s = 'Output from %s run at %s<br/>\n' % (progname,timenow()) |
---|
320 | lf.write('<h4>%s</h4>\n' % s) |
---|
321 | lf.write('newfilepath=%s, rexe=%s' % (newfilepath,rexe)) |
---|
322 | lf.write('(click on the image below to see a much higher quality PDF version)') |
---|
323 | thumbnail = '%s.png' % newfiles[0] # foo.pdf.png - who cares? |
---|
324 | if os.path.exists(os.path.join(newfilepath,thumbnail)): |
---|
325 | lf.write('<table border="0" cellpadding="10" cellspacing="10"><tr><td>\n') |
---|
326 | lf.write('<a href="%s"><img src="%s" alt="%s" hspace="10" align="left" /></a></td></tr></table><br/>\n' \ |
---|
327 | % (newfiles[0],thumbnail,explanations[0])) |
---|
328 | allfiles = os.listdir(newfilepath) |
---|
329 | allfiles.sort() |
---|
330 | sizes = [getfSize(x,newfilepath) for x in allfiles] |
---|
331 | lallfiles = ['<li><a href="%s">%s %s</a></li>\n' % (x,x,sizes[i]) for i,x in enumerate(allfiles)] # html list |
---|
332 | lf.write('<div class="document">All Files:<ol>%s</ol></div>' % ''.join(lallfiles)) |
---|
333 | lf.write('<div class="document">Log %s contents follow below<p/>' % eigenlogf) |
---|
334 | lf.write('<pre>%s</pre></div>' % elog) # the eigenlog |
---|
335 | s = 'If you need to rerun this analysis, the command line used was\n%s\n<p/>' % (smartCL) |
---|
336 | lf.write(s) |
---|
337 | lf.write(galhtmlpostfix) # end galhtmlprefix div |
---|
338 | lf.close() |
---|
339 | |
---|
340 | |
---|
341 | if __name__ == "__main__": |
---|
342 | runEigen() |
---|