TABLE OF CONTENTS
This page describes the input files supported by Gmaj, and their formats. Only the alignment file is required; the others are optional. Except where noted, all information applies to both the stand-alone and applet modes of Gmaj.
For annotations, Gmaj supports two broad categories of file formats. The original set of formats is essentially the same as those used by PipMaker and Laj, where each destination for the data (exons panel, color underlays, etc.) has its own file format tailored for the needs of that display. These files can be cumbersome to prepare manually, though PipMaker's associated utilities, such as PipHelper and the PipTools, can significantly reduce the burden.
However, since sequence annotations are increasingly becoming available in standardized formats from on-line resources such as the UCSC Table Browser, Gmaj can now accept some of these formats as well. These are referred to here as "generic" formats because they are not restricted to a particular biological data type or Gmaj display panel.
The PipMaker-style formats are described below in the sections for each panel, while the generic ones are discussed in a separate section, Generic Annotation Formats.
|
The annotation files are optional, but because in some alignments any of the sequences can be viewed as the reference sequence, there are potentially a large number of annotation files to provide, too many to type their names on the command line or paste them into a dialog box every time you want to view the data. For this reason, Gmaj uses a meta-level parameters file that lists the names of all the data files, plus a few other data-related options. Then when running Gmaj, you only have to specify that one file name. However, if you don't want to use any of these annotations or options, you can specify a single alignment file directly in place of a parameters file.
A sample parameters file that you can use as a template is
provided at sample.gmaj
.
It contains detailed comments at the bottom explaining the syntax
and meaning of the parameters.
Gmaj supports a "bundle" option, which allows you to collect and
compress some or all of the data files into a single file in
.zip
or .jar
format (not
.tar
, sorry). This is especially useful for
streamlining the applet's data download, but is also supported in
stand-alone mode. A few tips:
sample.gmaj
.
/
,
\
, or :
in the bundle. Gmaj
needs to remove the path that may have been added to each
name by the zip or jar program, and since it doesn't know
what platform that program was run on, it treats all of
these characters as path separators.
As an alternative to bundling, data files can be compressed
individually in .zip
, .jar
, or
.gz
format; this gains the compact size for storage
and transfer, but still requires overhead for multiple HTTP
connections in applet mode. The file name must end with the
corresponding extension for the compression format to be
recognized. (Such files can also be included in the bundle
if desired; though little if any additional compression is
typically achieved, this may be more convenient than unzipping
a large file just to bundle it.)
If you supply any annotations for Gmaj to display, these files must all use position coordinates that refer to the same original sequences identified in the MAF alignment files (ignoring any display offsets specified in the parameters file). However, even though the MAF coordinates are 0-based, the PipMaker-style annotation files all use a 1-based, closed-interval coordinate system (i.e., the first nucleotide in the sequence is called "1", and specified ranges include both endpoints). This is for consistency with PipMaker, so the same files can be used with both programs, and the same tools can be used to prepare them. Coordinates for generic annotations may be either 1-based or 0-based and closed or half-open, depending on the format, but Gmaj always adjusts them as needed (including the ones in the MAF files) to convert everything to a 1-based, closed-interval system for display.
Gmaj is designed to display multiple-sequence alignments in
MAF format.
It is especially suited for sequence-symmetric alignments from
programs such as TBA, but can also display MAF files that have a fixed
reference sequence. (In the latter case it is a good idea to
set the refseq
field in your parameters file, to prevent displaying the alignments with
an inappropriate reference sequence.) It is possible to display
several alignment files simultaneously on the same plots, e.g.
for comparing output from different alignment programs.
Gmaj normally requires that each sequence name appears at most
once in each MAF block, i.e., that the values of the "src" field
are unique across all of the s
lines within the
same block. However, there is a special exception for the case
of pairwise self-alignments: if all of the blocks have just two
rows, then all of the sequence names can be the same. In this
case Gmaj distinguishes the rows in each block by internally
adding a ~
suffix to the second row's sequence name;
the ~
does not show in the main display, but you may
occasionally see it in an error message.
The downside of this feature is that sequence names in the MAF
files must not end with ~
, even for non-self
alignments.
Each of these files lists the locations of genes, exons, and coding regions in a particular reference sequence. The exons and UTRs are displayed as black and gray boxes in a separate panel above the alignment plots.
In the PipMaker-style exons format, the directionality of a gene
(>
, <
, or |
), its
start and end positions, and name should be on one line, followed
by an optional line beginning with a +
character that
indicates the first and last nucleotides of the translated region
(including the initiation codon, Met, and the stop codon).
These are followed by lines specifying the start and end positions
of each exon, which must be listed in order of increasing address
even if the gene is on the reverse strand (<
). By
default Gmaj will supply exon numbers, but you can override this
by specifying your own name or number for individual exons. Blank
lines are ignored, and you can put an optional title line at the
top. Thus, the file might begin as follows:
My favorite genomic region < 100 800 XYZZY + 150 750 100 200 600 800 > 1000 2000 Frobozz gene 1000 1200 exon 1 1400 1500 alt. spliced exon 1800 2000 exon 2 ... etc.
Each of these files lists interspersed repeats (and possibly other features such as CpG islands) in a particular reference sequence. These are displayed in a separate panel just below the exons, using the same shapes and shading as PipMaker if possible.
In the PipMaker-style repeats format, the first line identifies this as a simplified repeats file (as opposed to RepeatMasker output, which Gmaj does not yet support). Each subsequent line specifies the start, end, direction, and type of an individual feature.
%:repeats 1081 1364 Right Alu 1365 1405 Simple ... etc.The allowed PipMaker types are:
Alu
, B1
, B2
,
SINE
, LINE1
, LINE2
,
MIR
, LTR
, DNA
,
RNA
, Simple
, CpG60
,
CpG75
, and Other
. Of these, all except
Simple
, CpG60
, and CpG75
require a direction (Right
or Left
).
Each of these files contains reference annotations, i.e., noteworthy regions in a particular reference sequence, which are drawn in a separate panel as colored bars. Typically each bar has an associated URL pointing to a web site with more information about the region, but this is not required. In applet mode Gmaj opens a new browser window to visit the linked site when the user clicks on a bar; in stand-alone mode Gmaj is not running within a web browser, so it just displays the URL for the user to visit manually via copy-and-paste.
The PipMaker-style format first defines various types of links and associates a color with each of them, then specifies the type, position, description, and URL for each annotated region.
# linkbars for part of the mouse MHC class II region %define type %name PubMed %color Blue %define type %name LocusLink %color Orange %define annotation %type PubMed %range 1 2000 %label Yang et al. 1997. Daxx, a novel Fas-binding protein... %summary Yang, X., Khosravi-Far, R. Chang, H., and Baltimore, D. (1997). Daxx, a novel Fas-binding protein that activates JNK and apoptosis. Cell 89(7):1067-76. %url http://www.ncbi.nlm.nih.gov:80/entrez/ query.fcgi?cmd=Retrieve&db=PubMed&list_uids=9215629&dopt=Abstract ... etc.Here, for example, the first stanza requests that each feature subsequently identified as a PubMed entry be colored blue. The name must be a single word, perhaps containing underline characters (e.g.,
Entry_in_GenBank
), and the color
must come from Gmaj's color list.
The third stanza associates a PubMed link with positions 1-2000 in this sequence. The label should be kept fairly short, as it will be displayed on Gmaj's position indicator line when the user points at this linkbar. The summary is optional; it is used only by PipMaker and will be ignored by Gmaj. Also, while PipMaker allows several summary/URL pairs within a single annotation, Gmaj expects each field to occur at most once. If Gmaj encounters extra URLs, it will just use the first one and display a warning message.
Note that summaries and URLs (but not labels) can be broken into several lines for convenience; the line breaks are removed when the file is read, but they are not replaced with spaces. Thus a continuation line for a summary typically begins with a space to separate it from the last word of the previous line, while a URL continuation does not.
Also note that stanzas should be separated by blank lines, and
lines beginning with a #
character are comments
that will be ignored. The linkbars can appear in the file in
any order, and several can overlap at the same position with no
problem, since Gmaj will display them in multiple rows if
necessary. In PipMaker this format is called "annotations with
hyperlinks".
Each of these files specifies underlays (colored bands) to be painted on a particular pairwise pip and its corresponding dotplot. The bands are specified as regions in the reference sequence and are normally drawn vertically; however for a dotplot, Gmaj will also look to see if you have specified an underlay file for the transposed situation where the reference and secondary sequences are swapped, and if so, will draw those underlays as horizontal bands in the secondary sequence.
The PipMaker-style underlay format supported by Gmaj looks like this:
# partial underlays for the BTK region LightYellow Gene Green Exon Red Strongly_conserved 35324 72009 (BTK gene) Gene 49781 49849 (exon 4) Exon 51403 51484 Exon 50350 50513 (conserved 84%) Strongly_conserved 84 52376 52603 (Kilroy was here) Strongly_conserved 92 + ... etc.The first group of lines describes the intended meaning of the colors, while the second group specifies the location of each band. Colors must come from Gmaj's color list, but the meaning of each color can be any single word chosen by you. The text in parentheses is an optional label which will be displayed on Gmaj's position indicator line when the user points the mouse at that band. The parentheses must be present if the label is, and the label itself cannot contain any additional parentheses. The number following the color category is an optional integer score that can be used to interactively adjust which underlays are displayed; see "Underlays Box" in the Menus and Widgets section of Starting and Running Gmaj for more information. (The label and score are extra features not supported by PipMaker.) A
+
or -
character at the end of a
location line will paint just the upper or lower half of the band
on the pip (but is ignored for dotplots). This allows you to
differentiate between the two strands, or to plot potentially
overlapping features like gene predictions and database matches.
Note that if two bands overlap, the one that was specified last
in the file appears "on top" and obscures the earlier one (except
for the special Hatch
color).
Thus in this example, the green exons and red strongly conserved
regions cover up parts of the long yellow band representing the
gene. As in the links file, lines beginning with a #
character are comments that will be ignored.
Highlight files are analogous to the underlay files, but each of these specifies colored regions for a particular sequence in the text view, rather than for a plot. If you do not specify a highlight file for a particular sequence, Gmaj will automatically provide default highlights based on the exons file (if you provided one). These will use one color for whole genes, overlaid with different colors to indicate exons on the forward vs. reverse strand. If the exons file specifies a gene's translated region, then the 5´ and 3´ UTRs will be shaded using lighter colors. These default highlights make it easy to examine the putative start/stop codons and splice junctions, as well as providing a visual connection between the graphical and text views. But if for some reason you do not want any text highlights, you can suppress them by specifying an empty highlight file.
The PipMaker-style format for highlights is the same as for
underlays, except that any +
or -
indicators will be ignored, and the Hatch
color is
not supported for highlights. Just as with underlays, labels
can be included which will be shown when the user points at
the highlight, scores can be used to limit which entries are
displayed, and highlights that are listed later in the file will
cover up those that appear earlier.
For Gmaj's PipMaker-style annotations, the available colors are:
Black White Clear Gray LightGray DarkGray Red LightRed DarkRed Green LightGreen DarkGreen Blue LightBlue DarkBlue Yellow LightYellow DarkYellow Pink LightPink DarkPink Cyan LightCyan DarkCyan Purple LightPurple DarkPurple Orange LightOrange DarkOrange Brown LightBrown DarkBrownThese names are case-sensitive (i.e., capitalization matters). Not all of these are supported by PipMaker. Also, be aware that the appearance of the colors may vary between PipMaker and Gmaj, and from one printer or monitor to the next.
In addition to the regular colors listed above, Gmaj supports a
special "color" for underlays called Hatch
, which
is drawn as a pattern of diagonal gray lines. Normally if two
underlays overlap, the one that was specified last in the file
appears "on top" and obscures the earlier one. However,
Hatch
underlays have the special property that they
are always drawn after the other colors, and since the space
between the diagonal lines is transparent, they allow the other
colors to show through. Currently Hatch
is only
supported for underlays, not for highlights or linkbars.
The standardized generic formats currently supported by Gmaj include GFF (v1 & v2), GTF, and various flavors of BED (including the full BED12 format, a.k.a. "gene BED"). For details on these formats, please see the specifications at the above links; this document will mainly discuss their use by Gmaj.
These formats are all tab-separated, and despite their
differences are similar enough that Gmaj can extract comparable
fields and treat them more or less the same. Note that Gmaj is
not intended as a format validator: parsing is more lenient in
some respects than the official format specifications, and Gmaj
will ignore fields it has no use for. Also, interpretation of
these open-ended formats depends partly on what type of annotation
is expected; e.g. if Gmaj is trying to read exons from a GFF v1
file, it will assume that the group field is the gene name. It
will generally show warning messages to keep the user apprised
of any such assumptions it is making (if these become too annoying
they can be individually suppressed in the parameters file; see sample.gmaj
for details). Because one of the main
reasons for supporting these formats is to enable the use of
annotation files obtained from public sources, Gmaj tries not to
balk at anomalies that are probably not the user's fault, and
when practical will simply skip questionable items with a warning
message. Each type of message will generally be displayed only
once, and not repeated for every item with the same problem.
In order to distinguish generic files from PipMaker-style ones
and handle them appropriately, Gmaj requires that files in
generic formats have names ending with any of certain extensions.
The default list is .gff
, .gtf
,
.bed
, .ct
, and .trk
, but
this can be customized (see sample.gmaj
).
Some of the generic formats require text values to be enclosed
in double quotes (" "
). Even when not strictly
required it is usually a good idea to do so, especially if the
value contains spaces. The official specifications generally
don't say what to do if a value contains embedded quote
characters, but Gmaj supports a rudimentary mechanism for
escaping them with a backslash (\
). However it
does not provide for escaping the backslash: quoted values
should not end with \
(insert a space before the
final quote if necessary).
When reading the generic formats, Gmaj treats two adjacent tab
characters as an empty field. However, your files will be easier
for humans to read if you do not leave fields completely empty.
Gmaj recognizes a value of .
(the dot character)
to mean "unspecified" for fields such as strand, score, feature,
and color, in some cases even when the official formats don't.
For instance, GFF v2 explicitly calls for using .
when there is no score, but Gmaj allows you to do this with the
other generic formats as well, in order to distinguish between
"no score" and a score that is truly zero. For colors, in
addition to .
Gmaj also interprets 0
to mean "unspecified", in keeping with examples at UCSC.
The GFF and GTF formats use 1-based, closed-interval coordinates
(i.e., sequence numbering starts with "1", and specified ranges
include both endpoints), while BED uses a 0-based, half-open
system (the first nucleotide of the sequence is numbered "0",
and the ending position is not included in the region). For all
of these formats, positions are given relative to the beginning
of the named sequence regardless of which strand the feature is
on (unlike MAF), and start
must be less than or
equal to end
.
BED format is relatively fixed in how its fields are used, but GFF and GTF are more variable and require additional conventions for most effective use with Gmaj. In particular, the values of the "feature" field and the optional "attributes" affect how Gmaj will interpret and display an item.
Values of the feature field that are recognized for special treatment include:
gene
or values starting with gene_
exon
or values starting with exon_
start_codon
, str_codon
,
stop_codon
, stp_codon
, or
cds
repeatmasker
or any of the
PipMaker repeat or CpG types
Of these, only the PipMaker types are case-sensitive.
For GFF v2 and GTF, the currently recognized attribute tags are:
gene
or gene_id
: the name of the
gene, e.g. for grouping exons (transcript_id
is
ignored)
name
: an optional name for this individual item,
e.g. for an exon label
sequence
(when feature is
repeatmasker
): the name/class/family of the
repeat, e.g. AluJb/SINE/Alu
color
: a color
specification in UCSC format, e.g. 0,0,255
url
or ucsc_id
: the URL for
linkbars; $$
will be replaced with the value of
name
These keywords are not case-sensitive, but they cannot have multiple values.
Along with the basic formats listed above, Gmaj also supports UCSC
custom track headers.
Track lines can specify certain settings for an entire
track; currently color
,
itemRgb
, offset
,
and url
are supported. They also allow several
tracks (even in mixed formats) to be combined in a single file.
Gmaj does not currently provide a way to use just one particular
track from such a file (it will be treated as one big bag of
annotations), but lines in unsupported formats such as
WIG are gracefully skipped.
Browser lines are also skipped; Gmaj's initial zoom position
is controlled by command-line or applet parameters rather than by
individual annotation files.
Generic files can also contain annotations for several sequences,
because unlike the PipMaker-style formats, they all have a
"seqname" or "chrom" field that Gmaj can use to select the
appropriate lines. Ideally Gmaj expects this field to match
the sequence name from the alignment files,
but has two ways to deal with exceptions. If there is only one
seqname in the annotation file, then Gmaj will go ahead and use
it, but will display a warning (unless the mismatch can be fixed
by prepending the organism name, or the organism name plus
chr
, to the annotation seqname). But if the file
has annotations for several sequences and some don't match the
alignment files, you need to tell Gmaj which is which by adding
an alias in the parameters file (see
sample.gmaj
).
One of the advantages of using generic formats is that files can
be reused in multiple panels without reformatting, e.g. as both
exons and underlays. Normally linkbars, underlays, and text
highlights are simply handled as arbitrary regions of a specified
color, since they could represent any type of biological feature.
However, you can ask Gmaj to interpret them as exons or repeats
by adding a type hint in the parameters file
(see sample.gmaj
). Note
that currently this will also cause any specified colors in that file to be overridden with Gmaj's
defaults.
Combining several biological types of annotations (e.g. exons and repeats) in one file is possible, but not recommended. Gmaj will try to skip lines that are not appropriate for the type it is seeking, but it may draw more than you want.
Currently Gmaj has no special support for multiple transcripts. When inferring UTRs, all of the CDS-related items for a single gene name are combined, and the interval from the lowest coordinate to the highest is used as the CDS. Also, some of the formats' rules specify whether or not the initiation and stop codons should be included in the CDS, but Gmaj does not make adjustments to compensate for that; instead it simply includes all of the given endpoints in the CDS.
Colors can be specified for individual annotation lines via the
itemRgb
field (for BED) or a color
attribute (for GFF v2 or GTF). However, for custom tracks, these are governed by the track line's
itemRgb
attribute, which defaults to off per the
UCSC specification. Thus if you have track lines and want to
use the per-item colors, you need to include
itemRgb=On
in the track attributes.
Track lines can also have a color
attribute for
the entire track, which will be used if itemRgb
is
off, or if an individual item does not have its own color.
However in a rare break from the UCSC specification, Gmaj does
not use black as the default if the track color is unspecified
(black underlays and highlights just don't work with black plots
and text). Instead it uses its own default colors, which for
genes/exons are the same as the colors for default highlights, or light gray for other annotations.
Note that these defaults will also override your colors when
type hints are used.
All of the above-mentioned color values are specified in UCSC
format, which consists of three comma-separated RGB values from
0-255 (e.g. 0,0,255
).
The order of the lines is not supposed to matter in these generic formats, but for most of the Gmaj panels it does matter: exons need to be grouped by gene and ordered by position so UTRs can be inferred and exon numbers assigned, early underlays are covered up by later ones, etc. Gmaj solves this problem by sorting the data before it is displayed. Exons are sorted first by gene name in ascending order, and then within each gene by start position (ascending) and lastly in case of a tie, by end position (descending). All other annotation types are sorted first by length in descending order, and then in case of a tie by start position (ascending). This usually produces a reasonable display, but if you need direct control of the order, you can use the PipMaker-style formats instead.