Context Navigation

rgEigPCA.py @ 2

リビジョン 2, 16.1 KB (コミッタ: hatakeyama, 14 年前)
import galaxy-central

行番号
1	"""
2	run smartpca
3
4	This uses galaxy code developed by Dan to deal with
5	arbitrary output files using an html dataset with it's own
6	subdirectory containing the arbitrary files
7	We create that html file and add all the links we need
8
9	Note that we execute the smartpca.perl program in the output subdirectory
10	to avoid having to clear out the job directory after running
11
12	Code to convert linkage format ped files into eigenstratgeno format is left here
13	in case we decide to autoconvert
14
15	Added a plot in R with better labels than the default eigensoft plot december 26 2007
16
17	DOCUMENTATION OF smartpca program:
18
19	smartpca runs Principal Components Analysis on input genotype data and
20	outputs principal components (eigenvectors) and eigenvalues.
21	The method assumes that samples are unrelated. (However, a small number
22	of cryptically related individuals is usually not a problem in practice
23	as they will typically be discarded as outliers.)
24
25	5 different input formats are supported. See ../CONVERTF/README
26	for documentation on using the convertf program to convert between formats.
27
28	The syntax of smartpca is "../bin/smartpca -p parfile". We illustrate
29	how parfile works via a toy example (see example.perl in this directory).
30	This example takes input in EIGENSTRAT format. The syntax of how to take input
31	in other formats is analogous to the convertf program, see ../CONVERTF/README.
32
33	The smartpca program prints various statistics to standard output.
34	To redirect this information to a file, change the above syntax to
35	"../bin/smartpca -p parfile >logfile". For a description of these
36	statistics, see the documentation file smartpca.info in this directory.
37
38	Estimated running time of the smartpca program is
39	2.5e-12 * nSNP * NSAMPLES^2 hours if not removing outliers.
40	2.5e-12 * nSNP * NSAMPLES^2 hours * (1+m) if m outlier removal iterations.
41	Thus, under the default of up to 5 outlier removal iterations, running time is
42	up to 1.5e-11 * nSNP * NSAMPLES^2 hours.
43
44	------------------------------------------------------------------------
45
46	DESCRIPTION OF EACH PARAMETER in parfile for smartpca:
47
48	genotypename: input genotype file (in any format: see ../CONVERTF/README)
49	snpname: input snp file (in any format: see ../CONVERTF/README)
50	indivname: input indiv file (in any format: see ../CONVERTF/README)
51	evecoutname: output file of eigenvectors. See numoutevec parameter below.
52	evaloutname: output file of all eigenvalues
53
54	OPTIONAL PARAMETERS:
55
56	numoutevec: number of eigenvectors to output. Default is 10.
57	numoutlieriter: maximum number of outlier removal iterations.
58	Default is 5. To turn off outlier removal, set this parameter to 0.
59	numoutlierevec: number of principal components along which to
60	remove outliers during each outlier removal iteration. Default is 10.
61	outliersigmathresh: number of standard deviations which an individual must
62	exceed, along one of the top (numoutlierevec) principal components, in
63	order for that individual to be removed as an outlier. Default is 6.0.
64	outlieroutname: output logfile of outlier individuals removed. If not specified,
65	smartpca will print this information to stdout, which is the default.
66	usenorm: Whether to normalize each SNP by a quantity related to allele freq.
67	Default is YES. (When analyzing microsatellite data, should be set to NO.
68	See Patterson et al. 2006.)
69	altnormstyle: Affects very subtle details in normalization formula.
70	Default is YES (normalization formulas of Patterson et al. 2006)
71	To match EIGENSTRAT (normalization formulas of Price et al. 2006), set to NO.
72	missingmode: If set to YES, then instead of doing PCA on # reference alleles,
73	do PCA on whether each data point is missing or nonmissing. Default is NO.
74	May be useful for detecting informative missingness (Clayton et al. 2005).
75	nsnpldregress: If set to a positive integer, then LD correction is turned on,
76	and input to PCA will be the residual of a regression involving that many
77	previous SNPs, according to physical location. See Patterson et al. 2006.
78	Default is 0 (no LD correction). If desiring LD correction, we recommend 2.
79	maxdistldregress: If doing LD correction, this is the maximum genetic distance
80	(in Morgans) for previous SNPs used in LD correction. Default is no maximum.
81	poplistname: If wishing to infer eigenvectors using only individuals from a
82	subset of populations, and then project individuals from all populations
83	onto those eigenvectors, this input file contains a list of population names,
84	one population name per line, which will be used to infer eigenvectors.
85	It is assumed that the population of each individual is specified in the
86	indiv file. Default is to use individuals from all populations.
87	phylipoutname: output file containing an fst matrix which can be used as input
88	to programs in the PHYLIP package, such as the "fitch" program for
89	constructing phylogenetic trees.
90	noxdata: if set to YES, all SNPs on X chr are excluded from the data set.
91	The smartpca default for this parameter is YES, since different variances
92	for males vs. females on X chr may confound PCA analysis.
93	nomalexhet: if set to YES, any het genotypes on X chr for males are changed
94	to missing data. The smartpca default for this parameter is YES.
95	badsnpname: specifies a list of SNPs which should be excluded from the data set.
96	Same format as example.snp. Cannot be used if input is in
97	PACKEDPED or PACKEDANCESTRYMAP format.
98	popsizelimit: If set to a positive integer, the result is that only the first
99	popsizelimit individuals from each population will be included in the
100	analysis. It is assumed that the population of each individual is specified
101	in the indiv file. Default is to use all individuals in the analysis.
102
103	The next 5 optional parameters allow the user to output genotype, snp and
104	indiv files which will be identical to the input files except that:
105	Any individuals set to Ignore in the input indiv file will be
106	removed from the data set (see ../CONVERTF/README)
107	Any data excluded or set to missing based on noxdata, nomalexhet and
108	badsnpname parameters (see above) will be removed from the data set.
109	The user may decide to output these files in any format.
110	outputformat: ANCESTRYMAP, EIGENSTRAT, PED, PACKEDPED or PACKEDANCESTRYMAP
111	genotypeoutname: output genotype file
112	snpoutname: output snp file
113	indivoutname: output indiv file
114	outputgroup: see documentation in ../CONVERTF/README
115	"""
116	import sys,os,time,subprocess,string,glob
117	from rgutils import RRun, galhtmlprefix, galhtmlpostfix, timenow, smartpca, rexe, plinke
118	verbose = False
119
120	def makePlot(eigpca='test.pca',title='test',pdfname='test.pdf',h=8,w=10,nfp=None,rexe=''):
121	"""
122	the eigenvec file has a # row with the eigenvectors, then subject ids, eigenvecs and lastly
123	the subject class
124	Rpy not being used here. Write a real R script and run it. Sadly, this means putting numbers
125	somewhere - like in the code as monster R vector constructor c(99.3,2.14) strings
126	At least you have the data and the analysis in one single place. Highly reproducible little
127	piece of research.
128	"""
129	debug=False
130	f = file(eigpca,'r')
131	R = []
132	if debug:
133	R.append('sessionInfo()')
134	R.append("print('dir()=:')")
135	R.append('dir()')
136	R.append("print('pdfname=%s')" % pdfname)
137	gvec = []
138	pca1 = []
139	pca2 = []
140	groups = {}
141	glist = [] # list for legend
142	ngroup = 1 # increment for each new group encountered for pch vector
143	for n,row in enumerate(f):
144	if n > 1:
145	rowlist = row.strip().split()
146	group = rowlist[-1]
147	v1 = rowlist[1]
148	v2 = rowlist[2]
149	try:
150	v1 = float(v1)
151	except:
152	v1 = 0.0
153	try:
154	v2 = float(v2)
155	except:
156	v2 = 0.0
157	if not groups.get(group,None):
158	groups[group] = ngroup
159	glist.append(group)
160	ngroup += 1 # for next group
161	gvec.append(groups[group]) # lookup group number
162	pca1.append('%f' % v1)
163	pca2.append('%f' % v2)
164	# now have vectors of group,pca1 and pca2
165	llist = [x.encode('ascii') for x in glist] # remove label unicode - eesh
166	llist = ['"%s"' % x for x in llist] # need to quote for R
167	R.append('llist=c(%s)' % ','.join(llist))
168
169	plist = range(2,len(llist)+2) # pch - avoid black circles
170	R.append('glist=c(%s)' % ','.join(['%d' % x for x in plist]))
171	pgvec = ['%d' % (plist[i-1]) for i in gvec] # plot symbol/colour for each point
172	R.append("par(lab=c(10,10,10))") # so our grid is denser than the default 5
173	R.append("par(mai=c(1,1,1,0.5))")
174	maint = title
175	R.append('pdf("%s",h=%d,w=%d)' % (pdfname,h,w))
176	R.append("par(lab=c(10,10,10))")
177	R.append('pca1 = c(%s)' % ','.join(pca1))
178	R.append('pca2 = c(%s)' % ','.join(pca2))
179	R.append('pgvec = c(%s)' % ','.join(pgvec))
180	s = "plot(pca1,pca2,type='p',main='%s', ylab='Second ancestry eigenvector'," % maint
181	s += "xlab='First ancestry eigenvector',col=pgvec,cex=0.8,pch=pgvec)"
182	R.append(s)
183	R.append('legend("top",legend=llist,pch=glist,col=glist,title="Sample")')
184	R.append('grid(nx = 10, ny = 10, col = "lightgray", lty = "dotted")')
185	R.append('dev.off()')
186	R.append('png("%s.png",h=%d,w=%d,units="in",res=72)' % (pdfname,h,w))
187	s = "plot(pca1,pca2,type='p',main='%s', ylab='Second ancestry eigenvector'," % maint
188	s += "xlab='First ancestry eigenvector',col=pgvec,cex=0.8,pch=pgvec)"
189	R.append(s)
190	R.append('legend("top",legend=llist,pch=glist,col=glist,title="Sample")')
191	R.append('grid(nx = 10, ny = 10, col = "lightgray", lty = "dotted")')
192	R.append('dev.off()')
193	rlog,flist = RRun(rcmd=R,title=title,outdir=nfp)
194	print >> sys.stdout, '\n'.join(R)
195	print >> sys.stdout, rlog
196
197
198	def getfSize(fpath,outpath):
199	"""
200	format a nice file size string
201	"""
202	size = ''
203	fp = os.path.join(outpath,fpath)
204	if os.path.isfile(fp):
205	n = float(os.path.getsize(fp))
206	if n > 2**20:
207	size = ' (%1.1f MB)' % (n/2**20)
208	elif n > 2**10:
209	size = ' (%1.1f KB)' % (n/2**10)
210	elif n > 0:
211	size = ' (%d B)' % (int(n))
212	return size
213
214
215	def runEigen():
216	""" run the smartpca prog - documentation follows
217
218	smartpca.perl -i fakeped_100.eigenstratgeno -a fakeped_100.map -b fakeped_100.ind -p fakeped_100 -e fakeped_100.eigenvals -l
219	fakeped_100.eigenlog -o fakeped_100.eigenout
220
221	DOCUMENTATION OF smartpca.perl program:
222
223	This program calls the smartpca program (see ../POPGEN/README).
224	For this to work, the bin directory containing smartpca MUST be in your path.
225	See ./example.perl for a toy example.
226
227	../bin/smartpca.perl
228	-i example.geno : genotype file in EIGENSTRAT format (see ../CONVERTF/README)
229	-a example.snp : snp file (see ../CONVERTF/README)
230	-b example.ind : indiv file (see ../CONVERTF/README)
231	-k k : (Default is 10) number of principal components to output
232	-o example.pca : output file of principal components. Individuals removed
233	as outliers will have all values set to 0.0 in this file.
234	-p example.plot : prefix of output plot files of top 2 principal components.
235	(labeling individuals according to labels in indiv file)
236	-e example.eval : output file of all eigenvalues
237	-l example.log : output logfile
238	-m maxiter : (Default is 5) maximum number of outlier removal iterations.
239	To turn off outlier removal, set -m 0.
240	-t topk : (Default is 10) number of principal components along which
241	to remove outliers during each outlier removal iteration.
242	-s sigma : (Default is 6.0) number of standard deviations which an
243	individual must exceed, along one of topk top principal
244	components, in order to be removed as an outlier.
245
246	now uses https://www.bx.psu.edu/cgi-bin/trac.cgi/galaxy/changeset/1832
247
248	All files can be viewed however, by making links in the primary (HTML) history item like:
249	<img src="display_child?parent_id=2&designation=SomeImage?" alt="Some Image"/>
250	<a href="display_child?parent_id=2&designation=SomeText?">Some Text</a>
251
252	<command interpreter="python">
253	rgEigPCA.py "$i.extra_files_path/$i.metadata.base_name" "$title" "$out_file1"
254	"$out_file1.files_path" "$k" "$m" "$t" "$s" "$pca"
255	</command>
256
257	"""
258	if len(sys.argv) < 9:
259	print 'Need an input genotype file root, a title, a temp id and the temp file path for outputs,'
260	print ' and the 4 integer tuning parameters k,m,t and s in order. Given that, will run smartpca for eigensoft'
261	sys.exit(1)
262	else:
263	print >> sys.stdout, 'rgEigPCA.py got %s' % (' '.join(sys.argv))
264	skillme = ' %s' % string.punctuation
265	trantab = string.maketrans(skillme,'_'*len(skillme))
266	ofname = sys.argv[5]
267	progname = os.path.basename(sys.argv[0])
268	infile = sys.argv[1]
269	infpath,base_name = os.path.split(infile) # now takes precomputed or autoconverted ldreduced dataset
270	title = sys.argv[2].translate(trantab) # must replace all of these for urls containing title
271	outfile1 = sys.argv[3]
272	newfilepath = sys.argv[4]
273	try:
274	os.mkdirs(newfilepath)
275	except:
276	pass
277	op = os.path.split(outfile1)[0]
278	try: # for test - needs this done
279	os.makedirs(op)
280	except:
281	pass
282	eigen_k = sys.argv[5]
283	eigen_m = sys.argv[6]
284	eigen_t = sys.argv[7]
285	eigen_s = sys.argv[8]
286	eigpca = sys.argv[9] # path to new dataset for pca results - for later adjustment
287	eigentitle = os.path.join(newfilepath,title)
288	explanations=['Samples plotted in first 2 eigenvector space','Principle components','Eigenvalues',
289	'Smartpca log (contents shown below)']
290	rplotname = 'PCAPlot.pdf'
291	eigenexts = [rplotname, "pca.xls", "eval.xls"]
292	newfiles = ['%s_%s' % (title,x) for x in eigenexts] # produced by eigenstrat
293	rplotout = os.path.join(newfilepath,newfiles[0]) # for R plots
294	eigenouts = [x for x in newfiles]
295	eigenlogf = '%s_log.txt' % title
296	newfiles.append(eigenlogf) # so it will also appear in the links
297	lfname = outfile1
298	lf = file(lfname,'w')
299	lf.write(galhtmlprefix % progname)
300	try:
301	os.makedirs(newfilepath)
302	except:
303	pass
304	smartCL = '%s -i %s.bed -a %s.bim -b %s.fam -o %s -p %s -e %s -l %s -k %s -m %s -t %s -s %s' % \
305	(smartpca,infile, infile, infile, eigenouts[1],'%s_eigensoftplot.pdf' % title,eigenouts[2],eigenlogf, \
306	eigen_k, eigen_m, eigen_t, eigen_s)
307	env = os.environ
308	p=subprocess.Popen(smartCL,shell=True,cwd=newfilepath)
309	retval = p.wait()
310	# copy the eigenvector output file needed for adjustment to the user's eigenstrat library directory
311	elog = file(os.path.join(newfilepath,eigenlogf),'r').read()
312	eeigen = os.path.join(newfilepath,'%s.evec' % eigenouts[1]) # need these for adjusting
313	try:
314	eigpcaRes = file(eeigen,'r').read()
315	except:
316	eigpcaRes = ''
317	file(eigpca,'w').write(eigpcaRes)
318	makePlot(eigpca=eigpca,pdfname=newfiles[0],title=title,nfp=newfilepath,rexe=rexe)
319	s = 'Output from %s run at %s<br/>\n' % (progname,timenow())
320	lf.write('<h4>%s</h4>\n' % s)
321	lf.write('newfilepath=%s, rexe=%s' % (newfilepath,rexe))
322	lf.write('(click on the image below to see a much higher quality PDF version)')
323	thumbnail = '%s.png' % newfiles[0] # foo.pdf.png - who cares?
324	if os.path.exists(os.path.join(newfilepath,thumbnail)):
325	lf.write('<table border="0" cellpadding="10" cellspacing="10"><tr><td>\n')
326	lf.write('<a href="%s"><img src="%s" alt="%s" hspace="10" align="left" /></a></td></tr></table><br/>\n' \
327	% (newfiles[0],thumbnail,explanations[0]))
328	allfiles = os.listdir(newfilepath)
329	allfiles.sort()
330	sizes = [getfSize(x,newfilepath) for x in allfiles]
331	lallfiles = ['<li><a href="%s">%s %s</a></li>\n' % (x,x,sizes[i]) for i,x in enumerate(allfiles)] # html list
332	lf.write('<div class="document">All Files:<ol>%s</ol></div>' % ''.join(lallfiles))
333	lf.write('<div class="document">Log %s contents follow below<p/>' % eigenlogf)
334	lf.write('<pre>%s</pre></div>' % elog) # the eigenlog
335	s = 'If you need to rerun this analysis, the command line used was\n%s\n<p/>' % (smartCL)
336	lf.write(s)
337	lf.write(galhtmlpostfix) # end galhtmlprefix div
338	lf.close()
339
340
341	if __name__ == "__main__":
342	runEigen()

Note: リポジトリブラウザについてのヘルプは TracBrowser を参照してください。

Context Navigation

root/galaxy-central/tools/rgenetics/rgEigPCA.py @ 2

異なるフォーマットでダウンロード: