1 /*-------------------------------------------------------------------------- 2 * Copyright 2008 utgenome.org 3 * 4 * Licensed under the Apache License, Version 2.0 (the "License"); 5 * you may not use this file except in compliance with the License. 6 * You may obtain a copy of the License at 7 * 8 * http://www.apache.org/licenses/LICENSE-2.0 9 * 10 * Unless required by applicable law or agreed to in writing, software 11 * distributed under the License is distributed on an "AS IS" BASIS, 12 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 * See the License for the specific language governing permissions and 14 * limitations under the License. 15 *--------------------------------------------------------------------------*/ 16 //-------------------------------------- 17 // utgb-core Project 18 // 19 // GFF3Reader.java 20 // Since: Jul 7, 2008 21 // 22 // $URL$ 23 // $Author$ 24 //-------------------------------------- 25 package org.utgenome.format.gff3; 26 27 import java.io.Reader; 28 29 /** 30 * GFF3 Format has 31 * 32 * <pre> 33 * Undefined fields are replaced with the "." character, as described in 34 * the original GFF spec. 35 * 36 * Column 1: "seqid" 37 * 38 * The ID of the landmark used to establish the coordinate system for the 39 * current feature. IDs may contain any characters, but must escape any 40 * characters not in the set [a-zA-Z0-9.:ˆ*$@!+_?-|]. In particular, IDs 41 * may not contain unescaped whitespace and must not begin with an 42 * unescaped ">". 43 * 44 * Column 2: "source" 45 * 46 * The source is a free text qualifier intended to describe the algorithm 47 * or operating procedure that generated this feature. Typically this is 48 * the name of a piece of software, such as "Genescan" or a database 49 * name, such as "Genbank." In effect, the source is used to extend the 50 * feature ontology by adding a qualifier to the type creating a new 51 * composite type that is a subclass of the type in the type column. 52 * 53 * Column 3: "type" 54 * 55 * The type of the feature (previously called the "method"). This is 56 * constrained to be either: (a) a term from the "lite" sequence 57 * ontology, SOFA; or (b) a SOFA accession number. The latter 58 * alternative is distinguished using the syntax SO:000000. 59 * 60 * Columns 4 & 5: "start" and "end" 61 * 62 * The start and end of the feature, in 1-based integer coordinates, 63 * relative to the landmark given in column 1. Start is always less than 64 * or equal to end. 65 * 66 * For zero-length features, such as insertion sites, start equals end 67 * and the implied site is to the right of the indicated base in the 68 * direction of the landmark. 69 * 70 * Column 6: "score" 71 * 72 * The score of the feature, a floating point number. As in earlier 73 * versions of the format, the semantics of the score are ill-defined. 74 * It is strongly recommended that E-values be used for sequence 75 * similarity features, and that P-values be used for ab initio gene 76 * prediction features. 77 * 78 * Column 7: "strand" 79 * 80 * The strand of the feature. + for positive strand (relative to the 81 * landmark), - for minus strand, and . for features that are not 82 * stranded. In addition, ? can be used for features whose strandedness 83 * is relevant, but unknown. 84 * 85 * Column 8: "phase" 86 * 87 * For features of type "CDS", the phase indicates where the feature 88 * begins with reference to the reading frame. The phase is one of the 89 * integers 0, 1, or 2, indicating the number of bases that should be 90 * removed from the beginning of this feature to reach the first base of 91 * the next codon. In other words, a phase of "0" indicates that the next 92 * codon begins at the first base of the region described by the current 93 * line, a phase of "1" indicates that the next codon begins at the 94 * second base of this region, and a phase of "2" indicates that the 95 * codon begins at the third base of this region. This is NOT to be 96 * confused with the frame, which is simply start modulo 3. 97 * 98 * For forward strand features, phase is counted from the start 99 * field. For reverse strand features, phase is counted from the end 100 * field. 101 * 102 * The phase is REQUIRED for all CDS features. 103 * 104 * Column 9: "attributes" 105 * 106 * A list of feature attributes in the format tag=value. Multiple 107 * tag=value pairs are separated by semicolons. URL escaping rules are 108 * used for tags or values containing the following characters: ",=;". 109 * Spaces are allowed in this field, but tabs must be replaced with the 110 * %09 URL escape. 111 * 112 * These tags have predefined meanings: 113 * 114 * ID Indicates the name of the feature. IDs must be unique 115 * within the scope of the GFF file. 116 * 117 * Name Display name for the feature. This is the name to be 118 * displayed to the user. Unlike IDs, there is no requirement 119 * that the Name be unique within the file. 120 * 121 * Alias A secondary name for the feature. It is suggested that 122 * this tag be used whenever a secondary identifier for the 123 * feature is needed, such as locus names and 124 * accession numbers. Unlike ID, there is no requirement 125 * that Alias be unique within the file. 126 * 127 * Parent Indicates the parent of the feature. A parent ID can be 128 * used to group exons into transcripts, transcripts into 129 * genes, an so forth. A feature may have multiple parents. 130 * Parent can *only* be used to indicate a partof 131 * relationship. 132 * 133 * Target Indicates the target of a nucleotide-to-nucleotide or 134 * protein-to-nucleotide alignment. The format of the 135 * value is "target_id start end [strand]", where strand 136 * is optional and may be "+" or "-". If the target_id 137 * contains spaces, they must be escaped as hex escape %20. 138 * 139 * Gap The alignment of the feature to the target if the two are 140 * not collinear (e.g. contain gaps). The alignment format is 141 * taken from the CIGAR format described in the 142 * Exonerate documentation. 143 * (http://cvsweb.sanger.ac.uk/cgi-bin/cvsweb.cgi/exonerate 144 * ?cvsroot=Ensembl). See "THE GAP ATTRIBUTE" for a description 145 * of this format. 146 * 147 * Derives_from 148 * Used to disambiguate the relationship between one 149 * feature and another when the relationship is a temporal 150 * one rather than a purely structural "part of" one. This 151 * is needed for polycistronic genes. See "PATHOLOGICAL CASES" 152 * for further discussion. 153 * 154 * Note A free text note. 155 * 156 * Dbxref A database cross reference. See the section 157 * "Ontology Associations and Db Cross References" for 158 * details on the format. 159 * 160 * Ontology_term A cross reference to an ontology term. See 161 * the section "Ontology Associations and Db Cross References" 162 * for details. 163 * 164 * Multiple attributes of the same type are indicated by separating the 165 * values with the comma "," character, as in: 166 * 167 * Parent=AF2312,AB2812,abc-3 168 * 169 * Note that attribute names are case sensitive. "Parent" is not the 170 * same as "parent". 171 * 172 * All attributes that begin with an uppercase letter are reserved for 173 * later use. Attributes that begin with a lowercase letter can be used 174 * freely by applications. 175 * 176 * 177 * </pre> 178 * 179 * 180 * 181 * @author leo 182 * 183 */ 184 public class GFF3Reader { 185 186 public GFF3Reader(Reader input) { 187 188 } 189 190 }