root/galaxy-central/static/formatHelp.html

リビジョン 2, 17.6 KB (コミッタ: hatakeyama, 14 年 前)

import galaxy-central

行番号 
1<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
2        "http://www.w3.org/TR/html4/loose.dtd">
3<html>
4<head>
5<title>Galaxy Data Formats</title>
6<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
7<meta http-equiv="Content-Style-Type" content="text/css">
8<style type="text/css">
9        hr  { margin-top: 3ex; margin-bottom: 1ex; border: 1px inset }
10</style>
11</head>
12<body>
13<h2>Galaxy Data Formats</h2>
14<p>
15<br>
16
17<h3>Dataset missing?</h3>
18<p>
19If you have a dataset in your history that is not appearing in the
20drop-down selector for a tool, the most common reason is that it has
21the wrong format.  Each Galaxy dataset has an associated file format
22recorded in its metadata, and tools will only list datasets from your
23history that have a format compatible with that particular tool.  Of
24course some of these datasets might not actually contain relevant
25data, or even the correct columns needed by the tool, but filtering
26by format at least makes the list to select from a bit shorter.
27<p>
28Some of the formats are defined hierarchically, going from very
29general ones like <a href="#tab">Tabular</a> (which includes any text
30file with tab-separated columns), to more restrictive sub-formats
31like <a href="#interval">Interval</a> (where three of the columns
32must be the chromosome, start position, and end position), and on
33to even more specific ones such as <a href="#bed">BED</a> that have
34additional requirements.  So for example if a tool's required input
35format is Tabular, then all of your history items whose format is
36recorded as Tabular will be listed, along with those in all
37sub-formats that also qualify as Tabular (Interval, BED, GFF, etc.).
38<p>
39There are two usual methods for changing a dataset's format in
40Galaxy: if the file contents are already in the required format but
41the metadata is wrong (perhaps because the Auto-detect feature of the
42Upload File tool guessed it incorrectly), you can fix the metadata
43manually by clicking on the pencil icon beside that dataset in your
44history.  Or, if the file contents really are in a different format,
45Galaxy provides a number of format conversion tools (e.g. in the
46Text Manipulation and Convert Formats categories).  For instance,
47if the tool you want to run requires Tabular but your columns are
48delimited by spaces or commas, you can use the "Convert delimiters
49to TAB" tool under Text Manipulation to reformat your data.  However
50if your files are in a completely unsupported format, then you need
51to convert them yourself before uploading.
52<p>
53<hr>
54
55<h3>Format Descriptions</h3>
56<ul>
57<li><a href="#ab1">AB1</a>
58<li><a href="#axt">AXT</a>
59<li><a href="#bam">BAM</a>
60<li><a href="#bed">BED</a>
61<li><a href="#bedgraph">BedGraph</a>
62<li><a href="#binseq">Binseq.zip</a>
63<li><a href="#fasta">FASTA</a>
64<li><a href="#fastqsolexa">FastqSolexa</a>
65<li><a href="#fped">FPED</a>
66<li><a href="#gff">GFF</a>
67<li><a href="#gff3">GFF3</a>
68<li><a href="#gtf">GTF</a>
69<li><a href="#html">HTML</a>
70<li><a href="#interval">Interval</a>
71<li><a href="#lav">LAV</a>
72<li><a href="#lped">LPED</a>
73<li><a href="#maf">MAF</a>
74<li><a href="#pbed">PBED</a>
75<li><a href="#psl">PSL</a>
76<li><a href="#scf">SCF</a>
77<li><a href="#sff">SFF</a>
78<li><a href="#table">Table</a>
79<li><a href="#tab">Tabular</a>
80<li><a href="#txtseqzip">Txtseq.zip</a>
81<li><a href="#wig">Wiggle custom track</a>
82<li><a href="#text">Other text type</a>
83</ul>
84<p>
85
86<div><a name="ab1"></a></div>
87<hr>
88<strong>AB1</strong>
89<p>
90This is one of the ABIF family of binary sequence formats from
91Applied Biosystems Inc.
92<!-- Their PDF
93<a href="http://www.appliedbiosystems.com/support/software_community/ABIF_File_Format.pdf"
94>format specification</a> is unfortunately password-protected. -->
95Files should have a '<code>.ab1</code>' file extension.  You must
96manually select this file format when uploading the file.
97<p>
98
99<div><a name="axt"></a></div>
100<hr>
101<strong>AXT</strong>
102<p>
103Used for pairwise alignment output from BLASTZ, after post-processing.
104Each alignment block contains three lines: a summary line and two
105sequence lines.  Blocks are separated from one another by blank lines.
106The summary line contains chromosomal position and size information
107about the alignment, and consists of nine required fields.
108<a href="http://main.genome-browser.bx.psu.edu/goldenPath/help/axt.html"
109>More information</a>
110<!-- (not available on Main)
111<dl><dt>Can be converted to:
112<dd><ul>
113<li>FASTA<br>
114Convert Formats &rarr; AXT to FASTA
115<li>LAV<br>
116Convert Formats &rarr; AXT to LAV
117</ul></dl>
118-->
119<p>
120
121<div><a name="bam"></a></div>
122<hr>
123<strong>BAM</strong>
124<p>
125A binary alignment file compressed in the BGZF format with a
126'<code>.bam</code>' file extension.
127<!-- You must manually select this file format when uploading the file. -->
128<a href="http://samtools.sourceforge.net/SAM1.pdf">SAM</a>
129is the human-readable text version of this format.
130<dl><dt>Can be converted to:
131<dd><ul>
132<li>SAM<br>
133NGS: SAM Tools &rarr; BAM-to-SAM
134<li>Pileup<br>
135NGS: SAM Tools &rarr; Generate pileup
136<li>Interval<br>
137First convert to Pileup as above, then use
138NGS: SAM Tools &rarr; Pileup-to-Interval
139</ul></dl>
140<p>
141
142<div><a name="bed"></a></div>
143<hr>
144<strong>BED</strong>
145<p>
146<ul>
147<li> also qualifies as Tabular
148<li> also qualifies as Interval
149</ul>
150This tab-separated format describes a genomic interval, but has
151strict field specifications for use in genome browsers.  BED files
152can have from 3 to 12 columns, but the order of the columns matters,
153and only the end ones can be omitted.  Some groups of columns must
154be all present or all absent.  As in Interval format (but unlike
155GFF and its relatives), the interval endpoints use a 0-based,
156half-open numbering system.
157<a href="http://main.genome-browser.bx.psu.edu/goldenPath/help/hgTracksHelp.html#BED"
158>Field specifications</a>
159<p>
160Example:
161<pre>
162chr22 1000 5000 cloneA 960 + 1000 5000 0 2 567,488, 0,3512
163chr22 2000 6000 cloneB 900 - 2000 6000 0 2 433,399, 0,3601
164</pre>
165<dl><dt>Can be converted to:
166<dd><ul>
167<li>GFF<br>
168Convert Formats &rarr; BED-to-GFF
169</ul></dl>
170<p>
171
172<div><a name="bedgraph"></a></div>
173<hr>
174<strong>BedGraph</strong>
175<p>
176<ul>
177<li> also qualifies as Tabular
178<li> also qualifies as Interval
179<li> also qualifies as BED
180</ul>
181<a href="http://main.genome-browser.bx.psu.edu/goldenPath/help/bedgraph.html"
182>BedGraph</a> is a BED file with the name column being a float value
183that is displayed as a wiggle score in tracks.  Unlike in Wiggle
184format, the exact value of this score can be retrieved after being
185loaded as a track.
186<p>
187
188<div><a name="binseq"></a></div>
189<hr>
190<strong>Binseq.zip</strong>
191<p>
192A zipped archive consisting of binary sequence files in either AB1
193or SCF format.  All files in this archive must have the same file
194extension which is one of '<code>.ab1</code>' or '<code>.scf</code>'.
195You must manually select this file format when uploading the file.
196<p>
197
198<div><a name="fasta"></a></div>
199<hr>
200<strong>FASTA</strong>
201<p>
202A sequence in
203<a href="http://www.ncbi.nlm.nih.gov/blast/fasta.shtml">FASTA</a>
204format consists of a single-line description, followed by lines of
205sequence data.  The first character of the description line is a
206greater-than ('<code>&gt;</code>') symbol.  All lines should be
207shorter than 80 characters.
208<pre>
209>sequence1
210atgcgtttgcgtgc
211gtcggtttcgttgc
212>sequence2
213tttcgtgcgtatag
214tggcgcggtga
215</pre>
216<dl><dt>Can be converted to:
217<dd><ul>
218<li>Tabular<br>
219Convert Formats &rarr; FASTA-to-Tabular
220</ul></dl>
221<p>
222
223<div><a name="fastqsolexa"></a></div>
224<hr>
225<strong>FastqSolexa</strong>
226<p>
227<a href="http://maq.sourceforge.net/fastq.shtml">FastqSolexa</a>
228is the Illumina (Solexa) variant of the FASTQ format, which stores
229sequences and quality scores in a single file.
230<pre>
231@seq1 
232GACAGCTTGGTTTTTAGTGAGTTGTTCCTTTCTTT 
233+seq1 
234hhhhhhhhhhhhhhhhhhhhhhhhhhPW@hhhhhh 
235@seq2 
236GCAATGACGGCAGCAATAAACTCAACAGGTGCTGG 
237+seq2 
238hhhhhhhhhhhhhhYhhahhhhWhAhFhSIJGChO
239</pre>
240Or
241<pre>
242@seq1
243GAATTGATCAGGACATAGGACAACTGTAGGCACCAT
244+seq1
24540 40 40 40 35 40 40 40 25 40 40 26 40 9 33 11 40 35 17 40 40 33 40 7 9 15 3 22 15 30 11 17 9 4 9 4
246@seq2
247GAGTTCTCGTCGCCTGTAGGCACCATCAATCGTATG
248+seq2
24940 15 40 17 6 36 40 40 40 25 40 9 35 33 40 14 14 18 15 17 19 28 31 4 24 18 27 14 15 18 2 8 12 8 11 9
250</pre>
251<dl><dt>Can be converted to:
252<dd><ul>
253<li>FASTA<br>
254NGS: QC and manipulation &rarr; Generic FASTQ manipulation &rarr; FASTQ to FASTA
255<li>Tabular<br>
256NGS: QC and manipulation &rarr; Generic FASTQ manipulation &rarr; FASTQ to Tabular
257</ul></dl>
258<p>
259
260<div><a name="fped"></a></div>
261<hr>
262<strong>FPED</strong>
263<p>
264Also known as the FBAT format, for use with the
265<a href="http://biosun1.harvard.edu/~fbat/fbat.htm">FBAT</a> program.
266It consists of a pedigree file and a phenotype file.
267<p>
268
269<div><a name="gff"></a></div>
270<hr>
271<strong>GFF</strong>
272<p>
273<ul>
274<li> also qualifies as Tabular
275</ul>
276GFF is a tab-separated format somewhat similar to BED, but it has
277different columns and is more flexible.  There are
278<a href="http://main.genome-browser.bx.psu.edu/FAQ/FAQformat#format3"
279>nine required fields</a>.
280Note that unlike Interval and BED, GFF and its relatives (GFF3, GTF)
281use 1-based inclusive coordinates to specify genomic intervals.
282<dl><dt>Can be converted to:
283<dd><ul>
284<li>BED<br>
285Convert Formats &rarr; GFF-to-BED
286</ul></dl>
287<p>
288
289<div><a name="gff3"></a></div>
290<hr>
291<strong>GFF3</strong>
292<p>
293<ul>
294<li> also qualifies as Tabular
295</ul>
296The <a href="http://www.sequenceontology.org/gff3.shtml">GFF3</a>
297format addresses the most common extensions to GFF, while attempting
298to preserve compatibility with previous formats.
299Note that unlike Interval and BED, GFF and its relatives (GFF3, GTF)
300use 1-based inclusive coordinates to specify genomic intervals.
301<p>
302
303<div><a name="gtf"></a></div>
304<hr>
305<strong>GTF</strong>
306<p>
307<ul>
308<li> also qualifies as Tabular
309</ul>
310<a href="http://main.genome-browser.bx.psu.edu/FAQ/FAQformat#format4"
311>GTF</a> is a format for describing genes and other features associated
312with DNA, RNA, and protein sequences.  It is a refinement to GFF that
313tightens the specification.
314Note that unlike Interval and BED, GFF and its relatives (GFF3, GTF)
315use 1-based inclusive coordinates to specify genomic intervals.
316<!-- (not available on Main)
317<dl><dt>Can be converted to:
318<dd><ul>
319<li>BedGraph<br>
320Convert Formats &rarr; GTF-to-BEDGraph
321</ul></dl>
322-->
323<p>
324
325<div><a name="html"></a></div>
326<hr>
327<strong>HTML</strong>
328<p>
329This format is an HTML web page.  Click the eye icon next to the
330dataset to view it in your browser.
331<p>
332
333<div><a name="interval"></a></div>
334<hr>
335<strong>Interval</strong>
336<p>
337<ul>
338<li> also qualifies as Tabular
339</ul>
340This Galaxy format represents genomic intervals.  It is tab-separated,
341but has the added requirement that three of the columns must be the
342chromosome name, start position, and end position, where the positions
343use a 0-based, half-open numbering system (see below).  An optional
344strand column can also be specified, and an initial header row can
345be used to label the columns, which do not have to be in any special
346order.  Arbitrary additional columns can also be present.
347<p>
348Required fields:
349<ul>
350<li>CHROM - The name of the chromosome (e.g. chr3, chrY, chr2_random)
351    or contig (e.g. ctgY1).
352<li>START - The starting position of the feature in the chromosome or
353    contig.  The first base in a chromosome is numbered 0.
354<li>END - The ending position of the feature in the chromosome or
355    contig.  This base is not included in the feature.  For example,
356    the first 100 bases of a chromosome are described as START=0,
357    END=100, and span the bases numbered 0-99.
358</ul>
359Optional:
360<ul>
361<li>STRAND - Defines the strand, either '<code>+</code>' or
362'<code>-</code>'.
363<li>Header row
364</ul>
365Example:
366<pre>
367    #CHROM  START  END    STRAND  NAME  COMMENT
368    chr1    10     100    +       exon  myExon
369    chrX    1000   10050  -       gene  myGene
370</pre>
371<dl><dt>Can be converted to:
372<dd><ul>
373<li>BED<br>
374The exact changes needed and tools to run will vary with what fields
375are in the Interval file and what type of BED you are converting to.
376In general you will likely use Text Manipulation &rarr; Compute, Cut,
377or Merge Columns.
378</ul></dl>
379<p>
380
381<div><a name="lav"></a></div>
382<hr>
383<strong>LAV</strong>
384<p>
385<a href="http://www.bx.psu.edu/miller_lab/dist/lav_format.html">LAV</a>
386is the raw pairwise alignment format that is output by BLASTZ.  The
387first line begins with <code>#:lav</code>.
388<!-- (not available on Main)
389<dl><dt>Can be converted to:
390<dd><ul>
391<li>BED<br>
392Convert Formats &rarr; LAV to BED
393</ul></dl>
394-->
395<p>
396
397<div><a name="lped"></a></div>
398<hr>
399<strong>LPED</strong>
400<p>
401This is the linkage pedigree format, which consists of separate MAP and PED
402files.  Together these files describe SNPs; the map file contains the position
403and an identifier for the SNP, while the pedigree file has the alleles.  To
404upload this format into Galaxy, do not use Auto-detect for the file format;
405instead select <code>lped</code>.  You will then be given two sections for
406uploading files, one for the pedigree file and one for the map file.  For more
407information, see
408<a href="http://www.broadinstitute.org/science/programs/medical-and-population-genetics/haploview/input-file-formats-0"
409>linkage pedigree</a>,
410<a href="http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml#map">MAP</a>,
411and/or <a href="http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml#ped">PED</a>.
412<dl><dt>Can be converted to:
413<dd><ul>
414<li>PBED<br>Automatic
415<li>FPED<br>Automatic
416</ul></dl>
417<p>
418
419<div><a name="maf"></a></div>
420<hr>
421<strong>MAF</strong>
422<p>
423<a href="http://main.genome-browser.bx.psu.edu/FAQ/FAQformat#format5"
424>MAF</a> is the multi-sequence alignment format that is output by TBA
425and Multiz.  The first line begins with '<code>##maf</code>'.  This
426word is followed by whitespace-separated "variable<code>=</code>value"
427pairs.  There should be no whitespace surrounding the '<code>=</code>'.
428<dl><dt>Can be converted to:
429<dd><ul>
430<li>BED<br>
431Convert Formats &rarr; MAF to BED
432<li>Interval<br>
433Convert Formats &rarr; MAF to Interval
434<li>FASTA<br>
435Convert Formats &rarr; MAF to FASTA
436</ul></dl>
437<p>
438
439<div><a name="pbed"></a></div>
440<hr>
441<strong>PBED</strong>
442<p>
443This is the binary version of the LPED format.
444<dl><dt>Can be converted to:
445<dd><ul>
446<li>LPED<br>Automatic
447</ul></dl>
448<p>
449
450<div><a name="psl"></a></div>
451<hr>
452<strong>PSL</strong>
453<p>
454<a href="http://main.genome-browser.bx.psu.edu/FAQ/FAQformat#format2">PSL</a>
455format is used for alignments returned by
456<a href="http://genome.ucsc.edu/cgi-bin/hgBlat?command=start">BLAT</a>.
457It does not include any sequence.
458<p>
459
460<div><a name="scf"></a></div>
461<hr>
462<strong>SCF</strong>
463<p>
464This is a binary sequence format originally designed for the Staden
465sequence handling software package.  Files should have a
466'<code>.scf</code>' file extension.  You must manually select this
467file format when uploading the file.
468<a href="http://staden.sourceforge.net/manual/formats_unix_2.html"
469>More information</a>
470<p>
471
472<div><a name="sff"></a></div>
473<hr>
474<strong>SFF</strong>
475<p>
476This is a binary sequence format used by the Roche 454 GS FLX
477sequencing machine, and is documented on p.&nbsp;528 of their
478<a href="http://sequence.otago.ac.nz/download/GS_FLX_Software_Manual.pdf"
479>software manual</a>.  Files should have a '<code>.sff</code>' file
480extension.
481<!-- You must manually select this file format when uploading the file. -->
482<dl><dt>Can be converted to:
483<dd><ul>
484<li>FASTA<br>
485Convert Formats &rarr; SFF converter
486<li>FASTQ<br>
487Convert Formats &rarr; SFF converter
488</ul></dl>
489<p>
490
491<div><a name="table"></a></div>
492<hr>
493<strong>Table</strong>
494<p>
495Text data separated into columns by something other than tabs.
496<p>
497
498<div><a name="tab"></a></div>
499<hr>
500<strong>Tabular (tab-delimited)</strong>
501<p>
502One or more columns of text data separated by tabs.
503<dl><dt>Can be converted to:
504<dd><ul>
505<li>FASTA<br>
506Convert Formats &rarr; Tabular-to-FASTA<br>
507The Tabular file must have a title and sequence column.
508<li>FASTQ<br>
509NGS: QC and manipulation &rarr; Generic FASTQ manipulation &rarr; Tabular to FASTQ
510<li>Interval<br>
511If the Tabular file has a chromosome column (or is all on one
512chromosome) and has a position column, you can create an Interval
513file (e.g. for SNPs).  If it is all on one chromosome, use
514Text Manipulation &rarr; Add column to add a CHROM column.
515If the given position is 1-based, use
516Text Manipulation &rarr; Compute with the position column minus 1 to
517get the START, and use the original given column for the END.
518If the given position is 0-based, use it as the START, and compute
519that plus 1 to get the END.
520</ul></dl>
521<p>
522
523<div><a name="txtseqzip"></a></div>
524<hr>
525<strong>Txtseq.zip</strong>
526<p>
527A zipped archive consisting of flat text sequence files.  All files
528in this archive must have the same file extension of
529'<code>.txt</code>'.  You must manually select this file format when
530uploading the file.
531<p>
532
533<div><a name="wig"></a></div>
534<hr>
535<strong>Wiggle custom track</strong>
536<p>
537Wiggle tracks are typically used to display per-nucleotide scores
538in a genome browser.  The Wiggle format for custom tracks is
539line-oriented, and the wiggle data is preceded by a track definition
540line that specifies which of three different types is being used.
541<a href="http://main.genome-browser.bx.psu.edu/goldenPath/help/wiggle.html"
542>More information</a>
543<dl><dt>Can be converted to:
544<dd><ul>
545<li>Interval<br>
546Get Genomic Scores &rarr; Wiggle-to-Interval
547<li>As a second step this could be converted to 3- or 4-column BED,
548by removing extra columns using
549Text Manipulation &rarr; Cut columns from a table.
550</ul></dl>
551<p>
552
553<div><a name="text"></a></div>
554<hr>
555<strong>Other text type</strong>
556<p>
557Any text file.
558<dl><dt>Can be converted to:
559<dd><ul>
560<li>Tabular<br>
561If the text has fields separated by spaces, commas, or some other
562delimiter, it can be converted to Tabular by using
563Text Manipulation &rarr; Convert delimiters to TAB.
564</ul></dl>
565<p>
566
567<!-- blank lines so internal links will jump farther to end -->
568<br><br><br><br><br><br><br><br><br><br><br><br>
569<br><br><br><br><br><br><br><br><br><br><br><br>
570</body>
571</html>
Note: リポジトリブラウザについてのヘルプは TracBrowser を参照してください。