Context Navigation

formatHelp.html

リビジョン 2, 17.6 KB (コミッタ: hatakeyama, 15 年前)
import galaxy-central

行番号
1	<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
2	"http://www.w3.org/TR/html4/loose.dtd">
3	<html>
4	<head>
5	<title>Galaxy Data Formats</title>
6	<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
7	<meta http-equiv="Content-Style-Type" content="text/css">
8	<style type="text/css">
9	hr { margin-top: 3ex; margin-bottom: 1ex; border: 1px inset }
10	</style>
11	</head>
12	<body>
13	<h2>Galaxy Data Formats</h2>
14	<p>
15	<br>
16
17	<h3>Dataset missing?</h3>
18	<p>
19	If you have a dataset in your history that is not appearing in the
20	drop-down selector for a tool, the most common reason is that it has
21	the wrong format. Each Galaxy dataset has an associated file format
22	recorded in its metadata, and tools will only list datasets from your
23	history that have a format compatible with that particular tool. Of
24	course some of these datasets might not actually contain relevant
25	data, or even the correct columns needed by the tool, but filtering
26	by format at least makes the list to select from a bit shorter.
27	<p>
28	Some of the formats are defined hierarchically, going from very
29	general ones like <a href="#tab">Tabular</a> (which includes any text
30	file with tab-separated columns), to more restrictive sub-formats
31	like <a href="#interval">Interval</a> (where three of the columns
32	must be the chromosome, start position, and end position), and on
33	to even more specific ones such as <a href="#bed">BED</a> that have
34	additional requirements. So for example if a tool's required input
35	format is Tabular, then all of your history items whose format is
36	recorded as Tabular will be listed, along with those in all
37	sub-formats that also qualify as Tabular (Interval, BED, GFF, etc.).
38	<p>
39	There are two usual methods for changing a dataset's format in
40	Galaxy: if the file contents are already in the required format but
41	the metadata is wrong (perhaps because the Auto-detect feature of the
42	Upload File tool guessed it incorrectly), you can fix the metadata
43	manually by clicking on the pencil icon beside that dataset in your
44	history. Or, if the file contents really are in a different format,
45	Galaxy provides a number of format conversion tools (e.g. in the
46	Text Manipulation and Convert Formats categories). For instance,
47	if the tool you want to run requires Tabular but your columns are
48	delimited by spaces or commas, you can use the "Convert delimiters
49	to TAB" tool under Text Manipulation to reformat your data. However
50	if your files are in a completely unsupported format, then you need
51	to convert them yourself before uploading.
52	<p>
53	<hr>
54
55	<h3>Format Descriptions</h3>
56	<ul>
57	<li><a href="#ab1">AB1</a>
58	<li><a href="#axt">AXT</a>
59	<li><a href="#bam">BAM</a>
60	<li><a href="#bed">BED</a>
61	<li><a href="#bedgraph">BedGraph</a>
62	<li><a href="#binseq">Binseq.zip</a>
63	<li><a href="#fasta">FASTA</a>
64	<li><a href="#fastqsolexa">FastqSolexa</a>
65	<li><a href="#fped">FPED</a>
66	<li><a href="#gff">GFF</a>
67	<li><a href="#gff3">GFF3</a>
68	<li><a href="#gtf">GTF</a>
69	<li><a href="#html">HTML</a>
70	<li><a href="#interval">Interval</a>
71	<li><a href="#lav">LAV</a>
72	<li><a href="#lped">LPED</a>
73	<li><a href="#maf">MAF</a>
74	<li><a href="#pbed">PBED</a>
75	<li><a href="#psl">PSL</a>
76	<li><a href="#scf">SCF</a>
77	<li><a href="#sff">SFF</a>
78	<li><a href="#table">Table</a>
79	<li><a href="#tab">Tabular</a>
80	<li><a href="#txtseqzip">Txtseq.zip</a>
81	<li><a href="#wig">Wiggle custom track</a>
82	<li><a href="#text">Other text type</a>
83	</ul>
84	<p>
85
86	<div><a name="ab1"></a></div>
87	<hr>
88	<strong>AB1</strong>
89	<p>
90	This is one of the ABIF family of binary sequence formats from
91	Applied Biosystems Inc.
92	<!-- Their PDF
93	<a href="http://www.appliedbiosystems.com/support/software_community/ABIF_File_Format.pdf"
94	>format specification</a> is unfortunately password-protected. -->
95	Files should have a '<code>.ab1</code>' file extension. You must
96	manually select this file format when uploading the file.
97	<p>
98
99	<div><a name="axt"></a></div>
100	<hr>
101	<strong>AXT</strong>
102	<p>
103	Used for pairwise alignment output from BLASTZ, after post-processing.
104	Each alignment block contains three lines: a summary line and two
105	sequence lines. Blocks are separated from one another by blank lines.
106	The summary line contains chromosomal position and size information
107	about the alignment, and consists of nine required fields.
108	<a href="http://main.genome-browser.bx.psu.edu/goldenPath/help/axt.html"
109	>More information</a>
110	<!-- (not available on Main)
111	<dl><dt>Can be converted to:
112	<dd><ul>
113	<li>FASTA<br>
114	Convert Formats → AXT to FASTA
115	<li>LAV<br>
116	Convert Formats → AXT to LAV
117	</ul></dl>
118	-->
119	<p>
120
121	<div><a name="bam"></a></div>
122	<hr>
123	<strong>BAM</strong>
124	<p>
125	A binary alignment file compressed in the BGZF format with a
126	'<code>.bam</code>' file extension.
127	<!-- You must manually select this file format when uploading the file. -->
128	<a href="http://samtools.sourceforge.net/SAM1.pdf">SAM</a>
129	is the human-readable text version of this format.
130	<dl><dt>Can be converted to:
131	<dd><ul>
132	<li>SAM<br>
133	NGS: SAM Tools → BAM-to-SAM
134	<li>Pileup<br>
135	NGS: SAM Tools → Generate pileup
136	<li>Interval<br>
137	First convert to Pileup as above, then use
138	NGS: SAM Tools → Pileup-to-Interval
139	</ul></dl>
140	<p>
141
142	<div><a name="bed"></a></div>
143	<hr>
144	<strong>BED</strong>
145	<p>
146	<ul>
147	<li> also qualifies as Tabular
148	<li> also qualifies as Interval
149	</ul>
150	This tab-separated format describes a genomic interval, but has
151	strict field specifications for use in genome browsers. BED files
152	can have from 3 to 12 columns, but the order of the columns matters,
153	and only the end ones can be omitted. Some groups of columns must
154	be all present or all absent. As in Interval format (but unlike
155	GFF and its relatives), the interval endpoints use a 0-based,
156	half-open numbering system.
157	<a href="http://main.genome-browser.bx.psu.edu/goldenPath/help/hgTracksHelp.html#BED"
158	>Field specifications</a>
159	<p>
160	Example:
161	<pre>
162	chr22 1000 5000 cloneA 960 + 1000 5000 0 2 567,488, 0,3512
163	chr22 2000 6000 cloneB 900 - 2000 6000 0 2 433,399, 0,3601
164	</pre>
165	<dl><dt>Can be converted to:
166	<dd><ul>
167	<li>GFF<br>
168	Convert Formats → BED-to-GFF
169	</ul></dl>
170	<p>
171
172	<div><a name="bedgraph"></a></div>
173	<hr>
174	<strong>BedGraph</strong>
175	<p>
176	<ul>
177	<li> also qualifies as Tabular
178	<li> also qualifies as Interval
179	<li> also qualifies as BED
180	</ul>
181	<a href="http://main.genome-browser.bx.psu.edu/goldenPath/help/bedgraph.html"
182	>BedGraph</a> is a BED file with the name column being a float value
183	that is displayed as a wiggle score in tracks. Unlike in Wiggle
184	format, the exact value of this score can be retrieved after being
185	loaded as a track.
186	<p>
187
188	<div><a name="binseq"></a></div>
189	<hr>
190	<strong>Binseq.zip</strong>
191	<p>
192	A zipped archive consisting of binary sequence files in either AB1
193	or SCF format. All files in this archive must have the same file
194	extension which is one of '<code>.ab1</code>' or '<code>.scf</code>'.
195	You must manually select this file format when uploading the file.
196	<p>
197
198	<div><a name="fasta"></a></div>
199	<hr>
200	<strong>FASTA</strong>
201	<p>
202	A sequence in
203	<a href="http://www.ncbi.nlm.nih.gov/blast/fasta.shtml">FASTA</a>
204	format consists of a single-line description, followed by lines of
205	sequence data. The first character of the description line is a
206	greater-than ('<code>></code>') symbol. All lines should be
207	shorter than 80 characters.
208	<pre>
209	>sequence1
210	atgcgtttgcgtgc
211	gtcggtttcgttgc
212	>sequence2
213	tttcgtgcgtatag
214	tggcgcggtga
215	</pre>
216	<dl><dt>Can be converted to:
217	<dd><ul>
218	<li>Tabular<br>
219	Convert Formats → FASTA-to-Tabular
220	</ul></dl>
221	<p>
222
223	<div><a name="fastqsolexa"></a></div>
224	<hr>
225	<strong>FastqSolexa</strong>
226	<p>
227	<a href="http://maq.sourceforge.net/fastq.shtml">FastqSolexa</a>
228	is the Illumina (Solexa) variant of the FASTQ format, which stores
229	sequences and quality scores in a single file.
230	<pre>
231	@seq1
232	GACAGCTTGGTTTTTAGTGAGTTGTTCCTTTCTTT
233	+seq1
234	hhhhhhhhhhhhhhhhhhhhhhhhhhPW@hhhhhh
235	@seq2
236	GCAATGACGGCAGCAATAAACTCAACAGGTGCTGG
237	+seq2
238	hhhhhhhhhhhhhhYhhahhhhWhAhFhSIJGChO
239	</pre>
240	Or
241	<pre>
242	@seq1
243	GAATTGATCAGGACATAGGACAACTGTAGGCACCAT
244	+seq1
245	40 40 40 40 35 40 40 40 25 40 40 26 40 9 33 11 40 35 17 40 40 33 40 7 9 15 3 22 15 30 11 17 9 4 9 4
246	@seq2
247	GAGTTCTCGTCGCCTGTAGGCACCATCAATCGTATG
248	+seq2
249	40 15 40 17 6 36 40 40 40 25 40 9 35 33 40 14 14 18 15 17 19 28 31 4 24 18 27 14 15 18 2 8 12 8 11 9
250	</pre>
251	<dl><dt>Can be converted to:
252	<dd><ul>
253	<li>FASTA<br>
254	NGS: QC and manipulation → Generic FASTQ manipulation → FASTQ to FASTA
255	<li>Tabular<br>
256	NGS: QC and manipulation → Generic FASTQ manipulation → FASTQ to Tabular
257	</ul></dl>
258	<p>
259
260	<div><a name="fped"></a></div>
261	<hr>
262	<strong>FPED</strong>
263	<p>
264	Also known as the FBAT format, for use with the
265	<a href="http://biosun1.harvard.edu/~fbat/fbat.htm">FBAT</a> program.
266	It consists of a pedigree file and a phenotype file.
267	<p>
268
269	<div><a name="gff"></a></div>
270	<hr>
271	<strong>GFF</strong>
272	<p>
273	<ul>
274	<li> also qualifies as Tabular
275	</ul>
276	GFF is a tab-separated format somewhat similar to BED, but it has
277	different columns and is more flexible. There are
278	<a href="http://main.genome-browser.bx.psu.edu/FAQ/FAQformat#format3"
279	>nine required fields</a>.
280	Note that unlike Interval and BED, GFF and its relatives (GFF3, GTF)
281	use 1-based inclusive coordinates to specify genomic intervals.
282	<dl><dt>Can be converted to:
283	<dd><ul>
284	<li>BED<br>
285	Convert Formats → GFF-to-BED
286	</ul></dl>
287	<p>
288
289	<div><a name="gff3"></a></div>
290	<hr>
291	<strong>GFF3</strong>
292	<p>
293	<ul>
294	<li> also qualifies as Tabular
295	</ul>
296	The <a href="http://www.sequenceontology.org/gff3.shtml">GFF3</a>
297	format addresses the most common extensions to GFF, while attempting
298	to preserve compatibility with previous formats.
299	Note that unlike Interval and BED, GFF and its relatives (GFF3, GTF)
300	use 1-based inclusive coordinates to specify genomic intervals.
301	<p>
302
303	<div><a name="gtf"></a></div>
304	<hr>
305	<strong>GTF</strong>
306	<p>
307	<ul>
308	<li> also qualifies as Tabular
309	</ul>
310	<a href="http://main.genome-browser.bx.psu.edu/FAQ/FAQformat#format4"
311	>GTF</a> is a format for describing genes and other features associated
312	with DNA, RNA, and protein sequences. It is a refinement to GFF that
313	tightens the specification.
314	Note that unlike Interval and BED, GFF and its relatives (GFF3, GTF)
315	use 1-based inclusive coordinates to specify genomic intervals.
316	<!-- (not available on Main)
317	<dl><dt>Can be converted to:
318	<dd><ul>
319	<li>BedGraph<br>
320	Convert Formats → GTF-to-BEDGraph
321	</ul></dl>
322	-->
323	<p>
324
325	<div><a name="html"></a></div>
326	<hr>
327	<strong>HTML</strong>
328	<p>
329	This format is an HTML web page. Click the eye icon next to the
330	dataset to view it in your browser.
331	<p>
332
333	<div><a name="interval"></a></div>
334	<hr>
335	<strong>Interval</strong>
336	<p>
337	<ul>
338	<li> also qualifies as Tabular
339	</ul>
340	This Galaxy format represents genomic intervals. It is tab-separated,
341	but has the added requirement that three of the columns must be the
342	chromosome name, start position, and end position, where the positions
343	use a 0-based, half-open numbering system (see below). An optional
344	strand column can also be specified, and an initial header row can
345	be used to label the columns, which do not have to be in any special
346	order. Arbitrary additional columns can also be present.
347	<p>
348	Required fields:
349	<ul>
350	<li>CHROM - The name of the chromosome (e.g. chr3, chrY, chr2_random)
351	or contig (e.g. ctgY1).
352	<li>START - The starting position of the feature in the chromosome or
353	contig. The first base in a chromosome is numbered 0.
354	<li>END - The ending position of the feature in the chromosome or
355	contig. This base is not included in the feature. For example,
356	the first 100 bases of a chromosome are described as START=0,
357	END=100, and span the bases numbered 0-99.
358	</ul>
359	Optional:
360	<ul>
361	<li>STRAND - Defines the strand, either '<code>+</code>' or
362	'<code>-</code>'.
363	<li>Header row
364	</ul>
365	Example:
366	<pre>
367	#CHROM START END STRAND NAME COMMENT
368	chr1 10 100 + exon myExon
369	chrX 1000 10050 - gene myGene
370	</pre>
371	<dl><dt>Can be converted to:
372	<dd><ul>
373	<li>BED<br>
374	The exact changes needed and tools to run will vary with what fields
375	are in the Interval file and what type of BED you are converting to.
376	In general you will likely use Text Manipulation → Compute, Cut,
377	or Merge Columns.
378	</ul></dl>
379	<p>
380
381	<div><a name="lav"></a></div>
382	<hr>
383	<strong>LAV</strong>
384	<p>
385	<a href="http://www.bx.psu.edu/miller_lab/dist/lav_format.html">LAV</a>
386	is the raw pairwise alignment format that is output by BLASTZ. The
387	first line begins with <code>#:lav</code>.
388	<!-- (not available on Main)
389	<dl><dt>Can be converted to:
390	<dd><ul>
391	<li>BED<br>
392	Convert Formats → LAV to BED
393	</ul></dl>
394	-->
395	<p>
396
397	<div><a name="lped"></a></div>
398	<hr>
399	<strong>LPED</strong>
400	<p>
401	This is the linkage pedigree format, which consists of separate MAP and PED
402	files. Together these files describe SNPs; the map file contains the position
403	and an identifier for the SNP, while the pedigree file has the alleles. To
404	upload this format into Galaxy, do not use Auto-detect for the file format;
405	instead select <code>lped</code>. You will then be given two sections for
406	uploading files, one for the pedigree file and one for the map file. For more
407	information, see
408	<a href="http://www.broadinstitute.org/science/programs/medical-and-population-genetics/haploview/input-file-formats-0"
409	>linkage pedigree</a>,
410	<a href="http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml#map">MAP</a>,
411	and/or <a href="http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml#ped">PED</a>.
412	<dl><dt>Can be converted to:
413	<dd><ul>
414	<li>PBED<br>Automatic
415	<li>FPED<br>Automatic
416	</ul></dl>
417	<p>
418
419	<div><a name="maf"></a></div>
420	<hr>
421	<strong>MAF</strong>
422	<p>
423	<a href="http://main.genome-browser.bx.psu.edu/FAQ/FAQformat#format5"
424	>MAF</a> is the multi-sequence alignment format that is output by TBA
425	and Multiz. The first line begins with '<code>##maf</code>'. This
426	word is followed by whitespace-separated "variable<code>=</code>value"
427	pairs. There should be no whitespace surrounding the '<code>=</code>'.
428	<dl><dt>Can be converted to:
429	<dd><ul>
430	<li>BED<br>
431	Convert Formats → MAF to BED
432	<li>Interval<br>
433	Convert Formats → MAF to Interval
434	<li>FASTA<br>
435	Convert Formats → MAF to FASTA
436	</ul></dl>
437	<p>
438
439	<div><a name="pbed"></a></div>
440	<hr>
441	<strong>PBED</strong>
442	<p>
443	This is the binary version of the LPED format.
444	<dl><dt>Can be converted to:
445	<dd><ul>
446	<li>LPED<br>Automatic
447	</ul></dl>
448	<p>
449
450	<div><a name="psl"></a></div>
451	<hr>
452	<strong>PSL</strong>
453	<p>
454	<a href="http://main.genome-browser.bx.psu.edu/FAQ/FAQformat#format2">PSL</a>
455	format is used for alignments returned by
456	<a href="http://genome.ucsc.edu/cgi-bin/hgBlat?command=start">BLAT</a>.
457	It does not include any sequence.
458	<p>
459
460	<div><a name="scf"></a></div>
461	<hr>
462	<strong>SCF</strong>
463	<p>
464	This is a binary sequence format originally designed for the Staden
465	sequence handling software package. Files should have a
466	'<code>.scf</code>' file extension. You must manually select this
467	file format when uploading the file.
468	<a href="http://staden.sourceforge.net/manual/formats_unix_2.html"
469	>More information</a>
470	<p>
471
472	<div><a name="sff"></a></div>
473	<hr>
474	<strong>SFF</strong>
475	<p>
476	This is a binary sequence format used by the Roche 454 GS FLX
477	sequencing machine, and is documented on p. 528 of their
478	<a href="http://sequence.otago.ac.nz/download/GS_FLX_Software_Manual.pdf"
479	>software manual</a>. Files should have a '<code>.sff</code>' file
480	extension.
481	<!-- You must manually select this file format when uploading the file. -->
482	<dl><dt>Can be converted to:
483	<dd><ul>
484	<li>FASTA<br>
485	Convert Formats → SFF converter
486	<li>FASTQ<br>
487	Convert Formats → SFF converter
488	</ul></dl>
489	<p>
490
491	<div><a name="table"></a></div>
492	<hr>
493	<strong>Table</strong>
494	<p>
495	Text data separated into columns by something other than tabs.
496	<p>
497
498	<div><a name="tab"></a></div>
499	<hr>
500	<strong>Tabular (tab-delimited)</strong>
501	<p>
502	One or more columns of text data separated by tabs.
503	<dl><dt>Can be converted to:
504	<dd><ul>
505	<li>FASTA<br>
506	Convert Formats → Tabular-to-FASTA<br>
507	The Tabular file must have a title and sequence column.
508	<li>FASTQ<br>
509	NGS: QC and manipulation → Generic FASTQ manipulation → Tabular to FASTQ
510	<li>Interval<br>
511	If the Tabular file has a chromosome column (or is all on one
512	chromosome) and has a position column, you can create an Interval
513	file (e.g. for SNPs). If it is all on one chromosome, use
514	Text Manipulation → Add column to add a CHROM column.
515	If the given position is 1-based, use
516	Text Manipulation → Compute with the position column minus 1 to
517	get the START, and use the original given column for the END.
518	If the given position is 0-based, use it as the START, and compute
519	that plus 1 to get the END.
520	</ul></dl>
521	<p>
522
523	<div><a name="txtseqzip"></a></div>
524	<hr>
525	<strong>Txtseq.zip</strong>
526	<p>
527	A zipped archive consisting of flat text sequence files. All files
528	in this archive must have the same file extension of
529	'<code>.txt</code>'. You must manually select this file format when
530	uploading the file.
531	<p>
532
533	<div><a name="wig"></a></div>
534	<hr>
535	<strong>Wiggle custom track</strong>
536	<p>
537	Wiggle tracks are typically used to display per-nucleotide scores
538	in a genome browser. The Wiggle format for custom tracks is
539	line-oriented, and the wiggle data is preceded by a track definition
540	line that specifies which of three different types is being used.
541	<a href="http://main.genome-browser.bx.psu.edu/goldenPath/help/wiggle.html"
542	>More information</a>
543	<dl><dt>Can be converted to:
544	<dd><ul>
545	<li>Interval<br>
546	Get Genomic Scores → Wiggle-to-Interval
547	<li>As a second step this could be converted to 3- or 4-column BED,
548	by removing extra columns using
549	Text Manipulation → Cut columns from a table.
550	</ul></dl>
551	<p>
552
553	<div><a name="text"></a></div>
554	<hr>
555	<strong>Other text type</strong>
556	<p>
557	Any text file.
558	<dl><dt>Can be converted to:
559	<dd><ul>
560	<li>Tabular<br>
561	If the text has fields separated by spaces, commas, or some other
562	delimiter, it can be converted to Tabular by using
563	Text Manipulation → Convert delimiters to TAB.
564	</ul></dl>
565	<p>
566
567	<!-- blank lines so internal links will jump farther to end -->
568	<br><br><br><br><br><br><br><br><br><br><br><br>
569	<br><br><br><br><br><br><br><br><br><br><br><br>
570	</body>
571	</html>

Note: リポジトリブラウザについてのヘルプは TracBrowser を参照してください。

Context Navigation

root/galaxy-central/static/formatHelp.html

異なるフォーマットでダウンロード: