1 /*--------------------------------------------------------------------------
2 * Copyright 2008 utgenome.org
3 *
4 * Licensed under the Apache License, Version 2.0 (the "License");
5 * you may not use this file except in compliance with the License.
6 * You may obtain a copy of the License at
7 *
8 * http://www.apache.org/licenses/LICENSE-2.0
9 *
10 * Unless required by applicable law or agreed to in writing, software
11 * distributed under the License is distributed on an "AS IS" BASIS,
12 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 * See the License for the specific language governing permissions and
14 * limitations under the License.
15 *--------------------------------------------------------------------------*/
16 //--------------------------------------
17 // utgb-core Project
18 //
19 // GFF3Reader.java
20 // Since: Jul 7, 2008
21 //
22 // $URL$
23 // $Author$
24 //--------------------------------------
25 package org.utgenome.format.gff3;
26
27 import java.io.Reader;
28
29 /**
30 * GFF3 Format has
31 *
32 * <pre>
33 * Undefined fields are replaced with the "." character, as described in
34 * the original GFF spec.
35 *
36 * Column 1: "seqid"
37 *
38 * The ID of the landmark used to establish the coordinate system for the
39 * current feature. IDs may contain any characters, but must escape any
40 * characters not in the set [a-zA-Z0-9.:ˆ*$@!+_?-|]. In particular, IDs
41 * may not contain unescaped whitespace and must not begin with an
42 * unescaped ">".
43 *
44 * Column 2: "source"
45 *
46 * The source is a free text qualifier intended to describe the algorithm
47 * or operating procedure that generated this feature. Typically this is
48 * the name of a piece of software, such as "Genescan" or a database
49 * name, such as "Genbank." In effect, the source is used to extend the
50 * feature ontology by adding a qualifier to the type creating a new
51 * composite type that is a subclass of the type in the type column.
52 *
53 * Column 3: "type"
54 *
55 * The type of the feature (previously called the "method"). This is
56 * constrained to be either: (a) a term from the "lite" sequence
57 * ontology, SOFA; or (b) a SOFA accession number. The latter
58 * alternative is distinguished using the syntax SO:000000.
59 *
60 * Columns 4 & 5: "start" and "end"
61 *
62 * The start and end of the feature, in 1-based integer coordinates,
63 * relative to the landmark given in column 1. Start is always less than
64 * or equal to end.
65 *
66 * For zero-length features, such as insertion sites, start equals end
67 * and the implied site is to the right of the indicated base in the
68 * direction of the landmark.
69 *
70 * Column 6: "score"
71 *
72 * The score of the feature, a floating point number. As in earlier
73 * versions of the format, the semantics of the score are ill-defined.
74 * It is strongly recommended that E-values be used for sequence
75 * similarity features, and that P-values be used for ab initio gene
76 * prediction features.
77 *
78 * Column 7: "strand"
79 *
80 * The strand of the feature. + for positive strand (relative to the
81 * landmark), - for minus strand, and . for features that are not
82 * stranded. In addition, ? can be used for features whose strandedness
83 * is relevant, but unknown.
84 *
85 * Column 8: "phase"
86 *
87 * For features of type "CDS", the phase indicates where the feature
88 * begins with reference to the reading frame. The phase is one of the
89 * integers 0, 1, or 2, indicating the number of bases that should be
90 * removed from the beginning of this feature to reach the first base of
91 * the next codon. In other words, a phase of "0" indicates that the next
92 * codon begins at the first base of the region described by the current
93 * line, a phase of "1" indicates that the next codon begins at the
94 * second base of this region, and a phase of "2" indicates that the
95 * codon begins at the third base of this region. This is NOT to be
96 * confused with the frame, which is simply start modulo 3.
97 *
98 * For forward strand features, phase is counted from the start
99 * field. For reverse strand features, phase is counted from the end
100 * field.
101 *
102 * The phase is REQUIRED for all CDS features.
103 *
104 * Column 9: "attributes"
105 *
106 * A list of feature attributes in the format tag=value. Multiple
107 * tag=value pairs are separated by semicolons. URL escaping rules are
108 * used for tags or values containing the following characters: ",=;".
109 * Spaces are allowed in this field, but tabs must be replaced with the
110 * %09 URL escape.
111 *
112 * These tags have predefined meanings:
113 *
114 * ID Indicates the name of the feature. IDs must be unique
115 * within the scope of the GFF file.
116 *
117 * Name Display name for the feature. This is the name to be
118 * displayed to the user. Unlike IDs, there is no requirement
119 * that the Name be unique within the file.
120 *
121 * Alias A secondary name for the feature. It is suggested that
122 * this tag be used whenever a secondary identifier for the
123 * feature is needed, such as locus names and
124 * accession numbers. Unlike ID, there is no requirement
125 * that Alias be unique within the file.
126 *
127 * Parent Indicates the parent of the feature. A parent ID can be
128 * used to group exons into transcripts, transcripts into
129 * genes, an so forth. A feature may have multiple parents.
130 * Parent can *only* be used to indicate a partof
131 * relationship.
132 *
133 * Target Indicates the target of a nucleotide-to-nucleotide or
134 * protein-to-nucleotide alignment. The format of the
135 * value is "target_id start end [strand]", where strand
136 * is optional and may be "+" or "-". If the target_id
137 * contains spaces, they must be escaped as hex escape %20.
138 *
139 * Gap The alignment of the feature to the target if the two are
140 * not collinear (e.g. contain gaps). The alignment format is
141 * taken from the CIGAR format described in the
142 * Exonerate documentation.
143 * (http://cvsweb.sanger.ac.uk/cgi-bin/cvsweb.cgi/exonerate
144 * ?cvsroot=Ensembl). See "THE GAP ATTRIBUTE" for a description
145 * of this format.
146 *
147 * Derives_from
148 * Used to disambiguate the relationship between one
149 * feature and another when the relationship is a temporal
150 * one rather than a purely structural "part of" one. This
151 * is needed for polycistronic genes. See "PATHOLOGICAL CASES"
152 * for further discussion.
153 *
154 * Note A free text note.
155 *
156 * Dbxref A database cross reference. See the section
157 * "Ontology Associations and Db Cross References" for
158 * details on the format.
159 *
160 * Ontology_term A cross reference to an ontology term. See
161 * the section "Ontology Associations and Db Cross References"
162 * for details.
163 *
164 * Multiple attributes of the same type are indicated by separating the
165 * values with the comma "," character, as in:
166 *
167 * Parent=AF2312,AB2812,abc-3
168 *
169 * Note that attribute names are case sensitive. "Parent" is not the
170 * same as "parent".
171 *
172 * All attributes that begin with an uppercase letter are reserved for
173 * later use. Attributes that begin with a lowercase letter can be used
174 * freely by applications.
175 *
176 *
177 * </pre>
178 *
179 *
180 *
181 * @author leo
182 *
183 */
184 public class GFF3Reader {
185
186 public GFF3Reader(Reader input) {
187
188 }
189
190 }