UTGB Toolkit Index

What is Silk?

Silk is a human-friendly and space-efficient text format, designed for describing tree-structured data. Silk is a replacement XML and JSON, well-known tree-structured data formats. Silk format does not use tags or brackets to organize tree-structures. Instead, indentation via spaces represents data hierarchies, which is far simpler than neatly opening and closing matching tags (or brackets). Silk is compatible with XML and JSON when each tree node in XML or JSON has at most one text value. This type of XML or JSON data can be automatically translated into Silk format. The conversion to the reverse direction, Silk to XML or JSON, is straightforward. Silk is designed to accommodate stream-style processing, and it is possible to translate Silk file into XML or JSON streams, so you can utilize existing XML or JSON processors to analyze data written in Silk format.

Silk Features

Silk text format has the following features:

  • Tree Compatible
  • Human-friendly format
    • In Silk, indentation with spaces is used for nesting data structures, so no need exists to wrap text data with tags. In addition, you can omit double quotations to describe text data, which are mandatory in JSON.
  • Space-efficient format
    • Tab-separated format or comma-separated values (CSV) can be embedded in Silk at any hierarchical position. XML and JSON have no support for such compact data formats.
  • Import function
    • Several data files (e.g. another Silk file, tab-separated data, CSV, text or binary files, etc.) can be imported into a Silk file to compose a large data set.
  • One-liner format
    • Silk data can be processed line-by-line, so it is very familiar with standard text-processing tools, such as grep, awk, Perl, Ruby, etc.

Silk Data Model

Data model that can be described with Silk format is a forest, that is, a list of trees consisting of nodes. Each tree node can have several child nodes and a text value.

(needs some illustrations)

Situations where Silk is useful

Silk format has no need to wrap data with tags or quotations. This feature is suitable for data logging, which needs to incrementally append data to the end of a file. Silk is also useful for accumulating large program outputs. One of the design goals of Silk is to provide a compact representation of biological data. If you do not like verbose data descriptions of XML or JSON formats, Silk will match your needs.

In the UTGB Toolkit, we use Silk as a standard data description format for describing biological data, several configuration files, etc. Several years of experiences of processing XML and JSON format under our belt, we studied these syntaxes can be simplified by removing unnecessary notations, such as tags or brackets. Silk is the result of these syntax optimizations. For example, in most cases, double quotation mark to indicate string data is unnecessary. This plain-style text data description increase the editablity and readability of the Silk file format. In addition, Silk's embedded tab-separated data description significantly reduces the data size.

Silk can be used to enhance existing tab-separated data or comma-separated value (CSV) files with node labels and structures. These flat files can be imported into a Silk file, and you can annotate each data with node labels, and also can organize them in a hierarchical data structure.

Situations where Silk is not useful

Silk is not a markup language such as HTML, so it doesn't suit to represent text decorations. For example, the following text data description, which mixes text values and tags cannot be described with Silk:

<p>This paragraph contains <b>bold</b> and <i>italic</i> fonts.

This is because Silk's data model allows only one text value for each tree node. However, this limitation does not mean Silk cannot describe HTML data. If necessary, you can embed HTML data as a text value. Here is an example:

-p: This paragraph contains <b>bold</b> and <i>italic</i> fonts.

or you can use double quotation to embed arbitrary text.

-p:"This paragraph contains <b>bold</b> and <i>italic</i> fonts."

Tree Compatibility

(To be written)

Preamble

Preamble line beginning with '%' symbol specifies version or encoding of Silk text files. Preamble line must be placed in the first line of Silk files in order to correctly change the behaviour of a Silk processor according to the specified version or character encoding. Preamble description can be ommitted, and in this case Silk uses utf-8 as the default encoding.

Version information

Current Silk format version is 1.0:

% silk(version:1.0)

Encoding

% silk(version:1.0, encoding:utf-8)

  • Supported encodings (To be written)

Comment line

A comment line is marked by a sharp '#' indicator:

# This line will be ignored.

Tree node with a text value

In Silk, a tree node begins with a hyphen '-' followed by a node name. The text value of the node follows a colon ':'. If the colon and text value are not present, the node value of the tree node will be set to null:

Silk

- title: hello world

White spaces around text values will be ignored:

JSON

{ "title":"hello world" }

And also, white spaces around node names will be ignored:

Silk

- first name : Andy

JSON

{ "first name":"Andy" }

Tree node with several child nodes with text values

Silk

- book(id:1, title: Database Management Systems, isbn:0071230572, year:2002)

JSON

{
 "book":
  {
   "id":1, 
   "title":"Database Management Systems", 
   "isbn":"0071230572", 
   "year":2002
  }
}

Nested tree nodes

Indentation before hyphen ('-') represents tree node depth. Only space characters (' ') are allowed before the indentation hyphen ('-'). Tab character ('\t') cannot be used for indentations.

Silk

- book
 - id: 1
 - title: Database Management Systems
 - isbn:0071230572
 - year:2002
- book		
 - id: 2
 - title: Compilers: Second Edition
 - isbn:0321547985
 - year:2007

Alternatively, you can write the same data as follows:

- book(id:1, title: Database Management Systems, isbn:0071230572, year:2002)
- book(id:2, title: Compilers: Second Edition, isbn:0321547985, year:2007)

JSON

[
 {"book":
  {
   "id":1, 
   "title":"Database Management Systems", 
   "isbn":"0071230572", 
   "year":2002
  }
 },
 {"book":
  {
   "id":2, 
   "title":"Compilers: Second Edition",
   "isbn":"03215479785", 
   "year":2007
  }
 }
]

XML

<book>
 <id>1</id>
 <title>Database Management Systems</title>
 <isbn>0071230572</isbn>
 <year>2002</year>
</book>
<book>
 <id>2</id>
 <title>Compilers: Second Edition</title>
 <isbn>0321547985</isbn>
 <year>2007</year>
</book>

Tab-separated data

The design concept of Silk format is to reduce the redundancy of XML or JSON data format in describing large data set. If node description ends with a bar '|', following lines are split by tabs, and each text component separated by tabs is assigned a corresponding node name. The node name of each tab-separated data can be specified in the preceding child node descriptions:

Silk

# A book node schema with 4 parameters. 
-book(id, title, isbn, year)|
1	Database Management Systems	0071230572	2002
2	Compilers: Second Edition	0321547985	2007

# Tab-separated data region ends when a new node is found
-updated: 2009/02/16

JSON

{
 "book":[
   {
     "id":1, 
     "title":"Database Management Systems", 
     "isbn":"0071230572", 
     "year":2002
   },
   {
     "id":2, 
     "title":"Compilers: Second Edition",
     "isbn":"0321547985", 
     "year":2007
   }
 ],
 "updated":"2009/02/16"
}

XML

<book>
 <id>1</id>
 <title>Database Management Systems</title>
 <isbn>0071230572</isbn>
 <year>2002</year>
</book>
<book>
 <id>2</id>
 <title>Compilers: Second Edition</title>
 <isbn>0321547985</isbn>
 <year>2007</year>
</book>
<updated>2009/02/16</updated>

Multi-line text values

Large text values can be split into multiple lines. To describe muti-line text values, use '>' symbol instead of ':' (colon). The following examples shows a gene sequence of NM_001005277:

Silk

-gene(name:NM_001005277)
 -sequence>
ATGGATGGAGAGAATCACTCAGTGGTATCTGAGTTTTTGTTTCTGGGACTCACTCATTCATGGGAGATCC
AGCTCCTCCTCCTAGTGTTTTCCTCTGTGCTCTATGTGGCAAGCATTACTGGAAACATCCTCATTGTGTT
TTCTGTGACCACTGACCCTCACTTACACTCCCCCATGTACTTTCTACTGGCCAGTCTCTCCTTCATTGAC
TTAGGAGCCTGCTCTGTCACTTCTCCCAAGATGATTTATGACCTGTTCAGAAAGCGCAAAGTCATCTCCT
TTGGAGGCTGCATCGCTCAAATCTTCTTCATCCACGTCGTTGGTGGTGTGGAGATGGTGCTGCTCATAGC
CATGGCCTTTGACAGATATGTGGCCCTATGTAAGCCCCTCCACTATCTGACCATTATGAGCCCAAGAATG
TGCCTTTCATTTCTGGCTGTTGCCTGGACCCTTGGTGTCAGTCACTCCCTGTTCCAACTGGCATTTCTTG
TTAATTTAGCCTTCTGTGGCCCTAATGTGTTGGACAGCTTCTACTGTGACCTTCCTCGGCTTCTCAGACT
AGCCTGTACCGACACCTACAGATTGCAGTTCATGGTCACTGTTAACAGTGGGTTTATCTGTGTGGGTACT
TTCTTCATACTTCTAATCTCCTACGTCTTCATCCTGTTTACTGTTTGGAAACATTCCTCAGGTGGTTCAT
CCAAGGCCCTTTCCACTCTTTCAGCTCACAGCACAGTGGTCCTTTTGTTCTTTGGTCCACCCATGTTTGT
GTATACACGGCCACACCCTAATTCACAGATGGACAAGTTTCTGGCTATTTTTGATGCAGTTCTCACTCCT
TTTCTGAATCCAGTTGTCTATACATTCAGGAATAAGGAGATGAAGGCAGCAATAAAGAGAGTATGCAAAC
AGCTAGTGATTTACAAGAGGATCTCATAA

Multi-line text values in Silk are connected into a single string. Leading white spaces and tail white spaces (including new line chracters \r and \n) of each text line will be trimmed down. For example, you can insert spaces to the head of lines to make the nesting of the data clear:

Silk

-gene(name:NM_001005277)
 -sequence>
  ATGGATGGAGAGAATCACTCAGTGGTATCTGAGTTTTTGTTTCTGGGACTCACTCATTCATGGGAGATCC
  AGCTCCTCCTCCTAGTGTTTTCCTCTGTGCTCTATGTGGCAAGCATTACTGGAAACATCCTCATTGTGTT
  TTCTGTGACCACTGACCCTCACTTACACTCCCCCATGTACTTTCTACTGGCCAGTCTCTCCTTCATTGAC
  TTAGGAGCCTGCTCTGTCACTTCTCCCAAGATGATTTATGACCTGTTCAGAAAGCGCAAAGTCATCTCCT
  TTGGAGGCTGCATCGCTCAAATCTTCTTCATCCACGTCGTTGGTGGTGTGGAGATGGTGCTGCTCATAGC
  CATGGCCTTTGACAGATATGTGGCCCTATGTAAGCCCCTCCACTATCTGACCATTATGAGCCCAAGAATG
  TGCCTTTCATTTCTGGCTGTTGCCTGGACCCTTGGTGTCAGTCACTCCCTGTTCCAACTGGCATTTCTTG
  TTAATTTAGCCTTCTGTGGCCCTAATGTGTTGGACAGCTTCTACTGTGACCTTCCTCGGCTTCTCAGACT
  AGCCTGTACCGACACCTACAGATTGCAGTTCATGGTCACTGTTAACAGTGGGTTTATCTGTGTGGGTACT
  TTCTTCATACTTCTAATCTCCTACGTCTTCATCCTGTTTACTGTTTGGAAACATTCCTCAGGTGGTTCAT
  CCAAGGCCCTTTCCACTCTTTCAGCTCACAGCACAGTGGTCCTTTTGTTCTTTGGTCCACCCATGTTTGT
  GTATACACGGCCACACCCTAATTCACAGATGGACAAGTTTCTGGCTATTTTTGATGCAGTTCTCACTCCT
  TTTCTGAATCCAGTTGTCTATACATTCAGGAATAAGGAGATGAAGGCAGCAATAAAGAGAGTATGCAAAC
  AGCTAGTGATTTACAAGAGGATCTCATAA

The above two silk data has the same semantics with the following JSON data:

JSON

{"gene":
 {"name":"NM_001005277",
  "sequence":"ATGGATGGAGAGAATCACTCAGTGGTATCTGAGTTTTTGTTTCTGGGACTCACTCATTCATGGGAGATCCAGCTCCTCCTCCTAGTGTTTTCCTCTGTGCTCTATGTGGCAAGCATTACTGGAAACATCCTCATTGTGTTTTCTGTGACCACTGACCCTCACTTACACTCCCCCATGTACTTTCTACTGGCCAGTCTCTCCTTCATTGACTTAGGAGCCTGCTCTGTCACTTCTCCCAAGATGATTTATGACCTGTTCAGAAAGCGCAAAGTCATCTCCTTTGGAGGCTGCATCGCTCAAATCTTCTTCATCCACGTCGTTGGTGGTGTGGAGATGGTGCTGCTCATAGCCATGGCCTTTGACAGATATGTGGCCCTATGTAAGCCCCTCCACTATCTGACCATTATGAGCCCAAGAATGTGCCTTTCATTTCTGGCTGTTGCCTGGACCCTTGGTGTCAGTCACTCCCTGTTCCAACTGGCATTTCTTGTTAATTTAGCCTTCTGTGGCCCTAATGTGTTGGACAGCTTCTACTGTGACCTTCCTCGGCTTCTCAGACTAGCCTGTACCGACACCTACAGATTGCAGTTCATGGTCACTGTTAACAGTGGGTTTATCTGTGTGGGTACTTTCTTCATACTTCTAATCTCCTACGTCTTCATCCTGTTTACTGTTTGGAAACATTCCTCAGGTGGTTCATCCAAGGCCCTTTCCACTCTTTCAGCTCACAGCACAGTGGTCCTTTTGTTCTTTGGTCCACCCATGTTTGTGTATACACGGCCACACCCTAATTCACAGATGGACAAGTTTCTGGCTATTTTTGATGCAGTTCTCACTCCTTTTCTGAATCCAGTTGTCTATACATTCAGGAATAAGGAGATGAAGGCAGCAATAAAGAGAGTATGCAAACAGCTAGTGATTTACAAGAGGATCTCATAA"}}

Multi-line text values keeping line breaks

Use >> indicator for keeping line break chars between multi-line text data:

Silk

-message>>
Hello World!
Nice to meet you.

JSON

{"message":"Hello World!\nNice to meet you.\n"}

Escape sequences for the head of a line

When the multi-line data contains a hyphen in the head of a line, escape it by using \- notation, because hyphen(-) is a special character in Silk for describing nodes:

Silk

- sequence>
ATGGATGGAGAGAATCACTCAGTGGTATCTGAGTTTTTGTTTCTGGGACTCACTCATTCAT--
\-AGCTCCTCCTCCTAGTGTTTTCCTCTGTGCTCTATGTGGCAAGCATTACTGGAAACATCCT

JSON

{"sequence":"ATGGATGGAGAGAATCACTCAGTGGTATCTGAGTTTTTGTTTCTGGGACTCACTCATTCAT---AGCTCCTCCTCCTAGTGTTTTCCTCTGTGCTCTATGTGGCAAGCATTACTGGAAACATCC"}

The escape symbol '\' will be removed when parsing the Silk data. To include '\' symbol to the data, use '\\'. These escape sequences '\-' and '\\' are effective only in the head of a line; do not escape hyphen and \ (backslash) characters after the first character of a line.

Import another file

Silk format has a built-in support of import function, which is useful for annotating existing data files (e.g., tab-separated data) with Silk. The following examples loads tab-separated data from the file book.tab, and annotates the loaded tab-separated data with the book schema:

- book(id, title, isbn, year)|
@import(book.tab)

book.tab

1	Database Management Systems	0071230572	2002
2	Compilers: Second Edition	0321547985	2007

The above data using two files are equivalent to the following Silk file:

- book(id, title, isbn, year)|
1	Database Management Systems	0071230572	2002
2	Compilers: Second Edition	0321547985	2007

Binary files also can be imported as a node value:

- photo
 - title: vacation
 - image: @import(myphoto.jpg)	# import myphoto.jpg as a node value (encoded with base64)
 - categories: holiday family

In-line JSON data (Array)

JSON data can be embedded as a text value by specifying data type description [json] after the node name.

Silk

-prime[json]: [2, 3, 5, 7, 11, 13, 17, 19, 23]

JSON

{"prime":[2, 3, 5, 7, 11, 13, 17, 19, 23]}

XML

<prime>2</prime>
<prime>3</prime>
<prime>5</prime>
<prime>7</prime>
<prime>11</prime>
<prime>13</prime>
<prime>17</prime>
<prime>19</prime>
<prime>23</prime>

In-line JSON data (Object)

When you have to describe several parameter values for each node, but appearance of these parameters may varies, you can use in-line json object description.

Silk

-book(id, title, isbn, year, _[json])|
1	Database Management Systems	0071230572	2002	{"star":5, "comment":"good book"}
2	Compilers: Second Edition	0321547985	2007	{"tags":["read later", "textbook"]}

If the node name is '_' (underscore), each component of the in-line json data is treated as a direct child node of the parent node (book node in the above example).

JSON

{
 "book":[
   {
     "id":1, 
     "title":"Database Management Systems", 
     "isbn":"0071230572", 
     "year":2002,
     "star":5,
     "comment":"good book"
   },
   {
     "id":2, 
     "title":"Compilers: Second Edition",
     "isbn":"0321547985", 
     "year":2007,
     "tags":["read later", "textbook"]
   }
 ]
}

You can wrap the in-line json data within a named node:

Silk

-book(id, title, isbn, year, param[json])|
1	Database Management Systems	0071230572	2002	{"star":5, "comment":"good book"}
2	Compilers: Second Edition	0321547985	2007	{"tags":["read later", "textbook"]}

JSON

{
 "book":[
   {
     "id":1, 
     "title":"Database Management Systems", 
     "isbn":"0071230572", 
     "year":2002,
     "param":{"star":5, "comment":"good book"}
   },
   {
     "id":2, 
     "title":"Compilers: Second Edition",
     "isbn":"0321547985", 
     "year":2007,
     "param":{"tags":["read later", "textbook"]}
   }
 ]
}

Comma-separated values (CSV)

Silk (single line, using json array)

-prime[json]: [2, 3, 5, 7, 11, 13, 17, 19, 23]

Silk (multi lines)

-prime*
 2,  3,  5
 7, 11, 13
17, 19, 23

JSON

{"prime":[2, 3, 5, 7, 11, 13, 17, 19, 23]}

XML

<prime>2</prime>
<prime>3</prime>
<prime>5</prime>
<prime>7</prime>
<prime>11</prime>
<prime>13</prime>
<prime>17</prime>
<prime>19</prime>
<prime>23</prime>

CSV structured

Silk

-plot(x, y)*
1,3
4,5
7,8
9,10

Multi-Line Block Data Representation (Here Document)

Instead of tab-spearated format, Silk allows block-style data representations, where each node value is spearated by -- (node separator) and == (entry separator).

Silk

-sequence(seq1, seq2)==
ABCD
EFGHI
--
JKL
MN
==
0000
--
1234
-message:hello

JSON

{ "sequence":
   [
    {"seq1":"ABCDEFGHI", "seq2":"JKLMN"},
    {"seq1":"0000", "seq2":"1234"}
   ],
  "message":"hello" 
}

Gene locus

% silk(version:1.0)

# track name
- track(name:"gene locus")

# specify a coordinate system of the genome
- coordinate(group:utgb, species:human, revison:hg18, name:chr1)
# named locus in the tab-separated data form
 - locus(name, strand, start, end)|
NM_001005277	+	357521	358460
NM_001005224	+	357521	358460
NM_001005221	+	357521	358460
NM_001005277	-	610958	611897
NM_001005224	-	610958	611897
NM_001005221	-	610958	611897

# move to another coordinate, chr2
- coordinate(group:utgb, species:human, revison:hg18, name:chr2)
 - locus(name, strand, start, end)|
NM_001005277	+	357521	358460
NM_001005224	+	357521	358460
NM_001005221	+	357521	358460
NM_001005277	-	610958	611897
NM_001005224	-	610958	611897
NM_001005221	-	610958	611897

Bar chart data

% silk(version:1.0)

- track(name:"Transcript Frequency")

- barchart
 - title:bar chart
 - yMin:0
 - yMax:100
 - xTitle: genome position (bp)
 - yTitle: number of transcripts (log scale)
 - yLogScale: true
# plot y beginning from x=1 (offsetX = 1)
- coordinate(group:utgb, species:human, revison:hg18, name:chr1) 
 - offsetX:1
 - plot*
0,0,0,0,0,0,0,3,5,10
2,0,8,4,0,23,0,0,0,0

# plot (x, y)
- coordinate(group:utgb, species:human, revison:hg18, name:chr1)
 - plot(x, y)|
8  3
9  5
10 10
11 2
12 8
13 4
15 23

More complex example

% silk(version:1.0)
# single comment line

# tree node description. node_name (child_name1[:value], ...)
- track(name:"refseq gene")
 - author: leo	     # author is a child node of the track node

# specify coordinates 
- coordinate(group:utgb, name:chr1, species:human, revision:hg18)
# gene data description with tab-seaprated data format. CDS and exon data use micro-data format 
 - gene(name, strand, start, end, cds(start, end), exon(start, end)*)|
NM_001005277	+	357521	358460	[357521, 358460]	[[357521, 358460]]
NM_001005224	+	357521	358460	[357521, 358460]	[[357521, 358460]]
NM_001005221	+	357521	358460	[357521, 358460]	[[357521, 358460]]
NM_001005277	-	610958	611897	[610958, 611897]	[[610958, 611897]]
NM_001005224	-	610958	611897	[610958, 611897]	[[610958, 611897]]
NM_001005221	-	610958	611897	[610958, 611897]	[[610958, 611897]]
NM_152486	+	850983	869824	[851184, 869396]	[[850983, 851043],[851164, 851256],[855397, 855579]] 

# indentation before tab-separated data can be used for readability
- coordinate(group:utgb, name:chr2, species:human, revision:hg18)
 - gene(name, strand, start, end, cds(start, end), exon(start, end)*)|
   NM_001005277	+	357521	358460	[357521, 358460]	[[357521, 358460]]
   NM_001005278	+	357521	358460	[357521, 358460]	[[357521, 358460]]
  
# flexible structure organization 
- coordinate(group:utgb, species:human, revision:hg18)
 - gene(coordinate.name, name, strand, start, end)| # coordinate names is pulled down from the parent node
chr1	gene1	+	357521	358460
chr2	gene2	+	357521	358460
chr10	gene3	+	357521	358460
chr3	gene4	+	357521	358460
chr1	gene5	+	357521	358460