GFF3Reader xref

View Javadoc

1   /*--------------------------------------------------------------------------
2    *  Copyright 2008 utgenome.org
3    *
4    *  Licensed under the Apache License, Version 2.0 (the "License");
5    *  you may not use this file except in compliance with the License.
6    *  You may obtain a copy of the License at
7    *
8    *     http://www.apache.org/licenses/LICENSE-2.0
9    *
10   *  Unless required by applicable law or agreed to in writing, software
11   *  distributed under the License is distributed on an "AS IS" BASIS,
12   *  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13   *  See the License for the specific language governing permissions and
14   *  limitations under the License.
15   *--------------------------------------------------------------------------*/
16  //--------------------------------------
17  // utgb-core Project
18  //
19  // GFF3Reader.java
20  // Since: Jul 7, 2008
21  //
22  // $URL$ 
23  // $Author$
24  //--------------------------------------
25  package org.utgenome.format.gff3;
26  
27  import java.io.Reader;
28  
29  /**
30   * GFF3 Format has
31   * 
32   * <pre>
33   * Undefined fields are replaced with the &quot;.&quot; character, as described in
34   *  the original GFF spec.
35   * 
36   *  Column 1: &quot;seqid&quot;
37   * 
38   *  The ID of the landmark used to establish the coordinate system for the
39   *  current feature. IDs may contain any characters, but must escape any
40   *  characters not in the set [a-zA-Z0-9.:&circ;*$@!+_?-|].  In particular, IDs
41   *  may not contain unescaped whitespace and must not begin with an
42   *  unescaped &quot;&gt;&quot;.
43   * 
44   *  Column 2: &quot;source&quot;
45   * 
46   *  The source is a free text qualifier intended to describe the algorithm
47   *  or operating procedure that generated this feature.  Typically this is
48   *  the name of a piece of software, such as &quot;Genescan&quot; or a database
49   *  name, such as &quot;Genbank.&quot;  In effect, the source is used to extend the
50   *  feature ontology by adding a qualifier to the type creating a new
51   *  composite type that is a subclass of the type in the type column.
52   * 
53   *  Column 3: &quot;type&quot;
54   * 
55   *  The type of the feature (previously called the &quot;method&quot;).  This is
56   *  constrained to be either: (a) a term from the &quot;lite&quot; sequence
57   *  ontology, SOFA; or (b) a SOFA accession number.  The latter
58   *  alternative is distinguished using the syntax SO:000000.
59   * 
60   *  Columns 4 &amp; 5: &quot;start&quot; and &quot;end&quot;
61   * 
62   *  The start and end of the feature, in 1-based integer coordinates,
63   *  relative to the landmark given in column 1.  Start is always less than
64   *  or equal to end.
65   * 
66   *  For zero-length features, such as insertion sites, start equals end
67   *  and the implied site is to the right of the indicated base in the
68   *  direction of the landmark.
69   * 
70   *  Column 6: &quot;score&quot;
71   * 
72   *  The score of the feature, a floating point number.  As in earlier
73   *  versions of the format, the semantics of the score are ill-defined.
74   *  It is strongly recommended that E-values be used for sequence
75   *  similarity features, and that P-values be used for ab initio gene
76   *  prediction features.
77   * 
78   *  Column 7: &quot;strand&quot;
79   * 
80   *  The strand of the feature.  + for positive strand (relative to the
81   *  landmark), - for minus strand, and . for features that are not
82   *  stranded.  In addition, ? can be used for features whose strandedness
83   *  is relevant, but unknown.
84   * 
85   *  Column 8: &quot;phase&quot;
86   * 
87   *  For features of type &quot;CDS&quot;, the phase indicates where the feature
88   *  begins with reference to the reading frame.  The phase is one of the
89   *  integers 0, 1, or 2, indicating the number of bases that should be
90   *  removed from the beginning of this feature to reach the first base of
91   *  the next codon. In other words, a phase of &quot;0&quot; indicates that the next
92   *  codon begins at the first base of the region described by the current
93   *  line, a phase of &quot;1&quot; indicates that the next codon begins at the
94   *  second base of this region, and a phase of &quot;2&quot; indicates that the
95   *  codon begins at the third base of this region. This is NOT to be
96   *  confused with the frame, which is simply start modulo 3.
97   * 
98   *  For forward strand features, phase is counted from the start
99   *  field. For reverse strand features, phase is counted from the end
100  *  field.
101  * 
102  *  The phase is REQUIRED for all CDS features.
103  * 
104  *  Column 9: &quot;attributes&quot;
105  * 
106  *  A list of feature attributes in the format tag=value.  Multiple
107  *  tag=value pairs are separated by semicolons.  URL escaping rules are
108  *  used for tags or values containing the following characters: &quot;,=;&quot;.
109  *  Spaces are allowed in this field, but tabs must be replaced with the
110  *  %09 URL escape.
111  * 
112  *  These tags have predefined meanings:
113  * 
114  *  ID	   Indicates the name of the feature.  IDs must be unique
115  *  within the scope of the GFF file.
116  * 
117  *  Name   Display name for the feature.  This is the name to be
118  *  displayed to the user.  Unlike IDs, there is no requirement
119  *  that the Name be unique within the file.
120  * 
121  *  Alias  A secondary name for the feature.  It is suggested that
122  *  this tag be used whenever a secondary identifier for the
123  *  feature is needed, such as locus names and
124  *  accession numbers.  Unlike ID, there is no requirement
125  *  that Alias be unique within the file.
126  * 
127  *  Parent Indicates the parent of the feature.  A parent ID can be
128  *  used to group exons into transcripts, transcripts into
129  *  genes, an so forth.  A feature may have multiple parents.
130  *  Parent can *only* be used to indicate a partof 
131  *  relationship.
132  * 
133  *  Target Indicates the target of a nucleotide-to-nucleotide or
134  *  protein-to-nucleotide alignment.  The format of the
135  *  value is &quot;target_id start end [strand]&quot;, where strand
136  *  is optional and may be &quot;+&quot; or &quot;-&quot;.  If the target_id 
137  *  contains spaces, they must be escaped as hex escape %20.
138  * 
139  *  Gap   The alignment of the feature to the target if the two are
140  *  not collinear (e.g. contain gaps).  The alignment format is
141  *  taken from the CIGAR format described in the 
142  *  Exonerate documentation.
143  *  (http://cvsweb.sanger.ac.uk/cgi-bin/cvsweb.cgi/exonerate
144  *  ?cvsroot=Ensembl).  See &quot;THE GAP ATTRIBUTE&quot; for a description
145  *  of this format.
146  * 
147  *  Derives_from  
148  *  Used to disambiguate the relationship between one
149  *  feature and another when the relationship is a temporal
150  *  one rather than a purely structural &quot;part of&quot; one.  This
151  *  is needed for polycistronic genes.  See &quot;PATHOLOGICAL CASES&quot;
152  *  for further discussion.
153  * 
154  *  Note   A free text note.
155  * 
156  *  Dbxref A database cross reference.  See the section
157  *  &quot;Ontology Associations and Db Cross References&quot; for
158  *  details on the format.
159  * 
160  *  Ontology_term  A cross reference to an ontology term.  See
161  *  the section &quot;Ontology Associations and Db Cross References&quot;
162  *  for details.
163  * 
164  *  Multiple attributes of the same type are indicated by separating the
165  *  values with the comma &quot;,&quot; character, as in:
166  * 
167  *  Parent=AF2312,AB2812,abc-3
168  * 
169  *  Note that attribute names are case sensitive.  &quot;Parent&quot; is not the
170  *  same as &quot;parent&quot;.
171  * 
172  *  All attributes that begin with an uppercase letter are reserved for
173  *  later use.  Attributes that begin with a lowercase letter can be used
174  *  freely by applications.
175  * 
176  * 
177  * </pre>
178  * 
179  * 
180  * 
181  * @author leo
182  * 
183  */
184 public class GFF3Reader {
185 
186 	public GFF3Reader(Reader input) {
187 
188 	}
189 
190 }