1 /*-------------------------------------------------------------------------- 2 * Copyright 2007 utgenome.org 3 * 4 * Licensed under the Apache License, Version 2.0 (the "License"); 5 * you may not use this file except in compliance with the License. 6 * You may obtain a copy of the License at 7 * 8 * http://www.apache.org/licenses/LICENSE-2.0 9 * 10 * Unless required by applicable law or agreed to in writing, software 11 * distributed under the License is distributed on an "AS IS" BASIS, 12 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 * See the License for the specific language governing permissions and 14 * limitations under the License. 15 *--------------------------------------------------------------------------*/ 16 //-------------------------------------- 17 // UTGB Common Project 18 // 19 // Assembly.java 20 // Since: Jun 5, 2007 21 // 22 // $URL$ 23 // $Author$ 24 //-------------------------------------- 25 package org.utgenome.format.agp; 26 27 /** 28 * File Format: One feature of the AGP file is that column definitions change depending on whether the line is a component line or a gap line. There is a single column definition up to column 5, then 29 * each column will have two definitions, depending on the value in column 5. column content description <table summary="AGP File format" border="1" cellpadding="1" cellspacing="1" width="700"> 30 * <tbody> 31 * <tr> 32 * <td width="47"><strong>column</strong></td> 33 * <td width="142"><strong>content</strong></td> 34 * <td width="493"><strong>description</strong></td> 35 * 36 * </tr> 37 * <tr> 38 * <td>1</td> 39 * <td>object</td> 40 * <td> This is the identifier for the object being assembled. This can be a chromosome, scaffold or contig. If the object is a chromosome and an accession.version identifier is not used to describe 41 * the object, then the naming convention is to precede the chromosome number with gchrc (if a chromosome) or gLGh (if a linkage group). For example: chr1. If the object is a contig or scaffold, then 42 * the identifier needs to be unique within the assembly. </td> 43 * </tr> 44 * <tr> 45 * 46 * <td>2</td> 47 * <td>object_beg</td> 48 * <td> The starting coordinates of the component/gap on the object in column 1. These are the location in the objectfs coordinate system, not the component’s. </td> 49 * </tr> 50 * <tr> 51 * <td>3</td> 52 * 53 * <td>object_end</td> 54 * <td> The ending coordinates of the component/gap on the object in column 1. These are the location in the objectfs coordinate system, not the component’s. </td> 55 * </tr> 56 * <tr> 57 * <td>4</td> 58 * <td>part_number</td> 59 * 60 * <td> The line count for the components/gaps that make up the object described in column 1. </td> 61 * </tr> 62 * <tr> 63 * <td>5</td> 64 * <td>component_type</td> 65 * <td> The sequencing status of the component. These typically correspond to keywords in the International Sequence Database (GenBank/EMBL/DDBJ) submission. Current acceptable values are:<br> 66 * 67 * <strong> A</strong>=Active Finishing<br> 68 * <strong> D</strong>=Draft HTG (often phase1 and phase2 are called Draft, whether or not they have the draft keyword).<br> 69 * <strong> F</strong>=Finished HTG (phase 3)<br> 70 * <strong> G</strong>=Whole Genome Finishing<br> 71 * 72 * <strong> N</strong>=gap with specified size<br> 73 * <strong> O</strong>=Other sequence (typically means no HTG keyword)<br> 74 * <strong> P</strong>=Pre Draft<br> 75 * <strong> U</strong>= gap of unknown size, typically defaulting to predefined values.<br> 76 * 77 * <strong> W</strong>=WGS contig </td> 78 * </tr> 79 * <tr> 80 * <td>6a</td> 81 * <td>component_id</td> 82 * <td> If column 5 not equal to N: This is a unique identifier for the sequence component contributing to the object described in column 1. Ideally this will be a valid accession.version identifier 83 * assigned by GenBank/EMBL/DDBJ. If the sequence has not been submitted to a public repository yet, a local identifier should be used. </td> 84 * 85 * </tr> 86 * <tr> 87 * <td>6b</td> 88 * <td>gap_length</td> 89 * <td> If column 5 equal to N: This column represents the length of the gap. </td> 90 * </tr> 91 * <tr> 92 * 93 * <td>7a</td> 94 * <td>component_beg</td> 95 * <td> If column 5 not equal to N: This column specifies the beginning of the part of the component sequence that contributes to the object in column 1 (in component coordinates). </td> 96 * </tr> 97 * <tr> 98 * <td>7b</td> 99 * 100 * <td>gap_type</td> 101 * <td> 102 * <p> 103 * If column 5 equal to N: This column specifies the gap type. The combination of gap type and linkage (column 8b) indicates whether the gap is captured or uncaptured. In some cases, the gap types are 104 * assigned a biological value (e.g. centromere).<br> 105 * <br> 106 * Accepted values: <strong> <br> 107 * fragment:</strong> gap between two sequence contigs (also called a sequence gap). <strong> <br> 108 * 109 * clone:</strong> a gap between two clones that do not overlap. <strong> <br> 110 * contig:</strong> a gap between clone contigs (also called a "layout gap"). <strong><br> 111 * centromere:</strong> a gap inserted for the centromere. <strong> <br> 112 * short_arm:</strong> a gap inserted at the start of an acrocentric chromosome. <strong> <br> 113 * 114 * heterochromatin:</strong> a gap inserted for an especially large region of heterochromatic sequence (may also include the centromere). <strong> <br> 115 * telomere:</strong> a gap inserted for the telomere. <strong> <br> 116 * repeat:</strong> an unresolvable repeat. 117 * </p> 118 * </td> 119 * </tr> 120 * 121 * <tr> 122 * <td>8a</td> 123 * <td>component_end</td> 124 * <td> If column 5 not equal to N: This column specifies the end of the part of the component that contributes to the object in column 1 (in component coordinates). </td> 125 * </tr> 126 * <tr> 127 * <td>8b</td> 128 * 129 * <td>linkage</td> 130 * <td>If column 5 equal to N: This column indicates if there is evidence of linkage between the adjacent lines. <br> 131 * Values: <strong><br> 132 * yes </strong> <strong><br> 133 * no</strong> </td> 134 * 135 * </tr> 136 * <tr> 137 * <td height="135">9a</td> 138 * <td>orientation</td> 139 * <td>If column 5 not equal to N: This column specifies the orientation of the component relative to the object in column 1. <br> 140 * Values:<br> 141 * <strong>+ = plus<br> 142 * 143 * </strong><strong>- = minus <br> 144 * </strong><strong>0 (zero) = unknown<br> 145 * na = irrelevant </strong> <br> 146 * By default, components with unknown orientation (0 or na) are treated as if they had + orientation.</td> 147 * </tr> 148 * <tr> 149 * <td height="42">9b</td> 150 * 151 * <td> </td> 152 * <td> If column 5 equal to N: This column is empty- there is no filler. A tab should be inserted after the 8 th column though so that all lines have 9 columns. </td> 153 * </tr> 154 * </tbody></table> 155 * 156 * 157 * Extended comments: 158 * <ul> 159 * <li>Columns should be tab delimited. Lines end with a new line (\n). There should be no extra space around the individual tokens.</li> 160 * <li>All coordinates given in the file are 1-based inclusive (not 0-based). i.e. the first base of an object is 1 (not 0).</li> 161 * 162 * <li>Evidence of linkage. In general, evidence of linkage is provided by end pairs (sometimes referred to as mate pairs). Although, other evidence could be used such as transcript alignments). In 163 * some cases, evidence of linkage may be indirect. For example, given the following scaffold:<br> 164 * A--B--C--D<br> 165 * Where A, B, C and D are components, there could be end pairs linking A and B and end pairs linking A and C. There might be no pairs linking B and C but their linkage is implied.</li> 166 * <li>If the object is a contig or scaffold, the object should not start with a gap line. A chromosome will frequently start or end with one or more biological gap types (e.g. telomere or 167 * short_arm).</li> 168 * <li>A gap of type fragment will usually be flanked by components and not by other gap lines. Typically, successive gap lines are not encouraged, except in the case of gaps implying some 169 * biologically defined entity (such as centromere, heterochromatin, etc.).</li> 170 * <li>Coordinates of the object are all with respect to the plus strand, no matter the orientation of the component.</li> 171 * 172 * <li>object_beg (column 2) should always be less than or equal to object_end (column 3).</li> 173 * <li>component_beg (column 7) should always be less than or equal to component_end (column 8).</li> 174 * <li>Each object must start with a part_num of 1 (column 4) and an object_beg coordinate of 1 (column 2).</li> 175 * <li>Gap lengths must be positive. Negative gaps and gap lines with zero length are not valid.</li> 176 * <li>For negative gaps or gaps of unknown size, use 100 as the gaps size, as that is the GenBank/EMBL/DDBJ standard for gaps of unknown size.</li> 177 * <li>In the case of an GenBank/EMBL/DDBJ submission, the object identifier should be unique not only within the assembly but also across different versions of the assembly. For example, 178 * chrUn01.0001 in the first version of a genome and chrUn02.0001 in the second version.</li> 179 * 180 * <li>Any text after a # symbol is assumed to be a comment</li> 181 * <li>The use of comment lines at the head of the file is encouraged. Useful information to include in such headers is:</li> 182 * <ul> 183 * <li>organism name</li> 184 * <li>assembly name</li> 185 * <li>a description of any non-standard object identifiers</li> 186 * 187 * </ul> 188 * </ul> 189 * 190 * @author leo 191 * 192 */ 193 public class Assembly 194 { 195 String objectNAme; 196 int objectBegin; 197 int objectEnd; 198 int part_number; 199 String compoenentType; 200 int componentId; 201 int gapLength; 202 int componentBegin; 203 int componentEnd; 204 String gapType; 205 boolean linkage; 206 String orientation; 207 208 } 209 210 211 212