This document is written by Taro L. Saito.
Silk is a human-friendly and space-efficient text format, designed for describing tree-structured data. Silk is a replacement XML and JSON, well-known tree-structured data formats. Silk format does not use tags or brackets to organize tree-structures. Instead, indentation via spaces represents data hierarchies, which is far simpler than neatly opening and closing matching tags (or brackets). Silk is compatible with XML and JSON when each tree node in XML or JSON has at most one text value. This type of XML or JSON data can be automatically translated into Silk format. The conversion to the reverse direction, Silk to XML or JSON, is straightforward. Silk is designed to accommodate stream-style processing, and it is possible to translate Silk file into XML or JSON streams, so you can utilize existing XML or JSON processors to analyze data written in Silk format.
Silk text format has the following features:
Data model that can be described with Silk format is a forest, that is, a list of trees consisting of nodes. Each tree node can have several child nodes and a text value.
(needs some illustrations)
Silk format has no need to wrap data with tags or quotations. This feature is suitable for data logging, which needs to incrementally append data to the end of a file. Silk is also useful for accumulating large program outputs. One of the design goals of Silk is to provide a compact representation of biological data. If you do not like verbose data descriptions of XML or JSON formats, Silk will match your needs.
In the UTGB Toolkit, we use Silk as a standard data description format for describing biological data, several configuration files, etc. Several years of experiences of processing XML and JSON format under our belt, we studied these syntaxes can be simplified by removing unnecessary notations, such as tags or brackets. Silk is the result of these syntax optimizations. For example, in most cases, double quotation mark to indicate string data is unnecessary. This plain-style text data description increase the editablity and readability of the Silk file format. In addition, Silk's embedded tab-separated data description significantly reduces the data size.
Silk can be used to enhance existing tab-separated data or comma-separated value (CSV) files with node labels and structures. These flat files can be imported into a Silk file, and you can annotate each data with node labels, and also can organize them in a hierarchical data structure.
Silk is not a markup language such as HTML, so it doesn't suit to represent text decorations. For example, the following text data description, which mixes text values and tags cannot be described with Silk:
<p>This paragraph contains <b>bold</b> and <i>italic</i> fonts.
This is because Silk's data model allows only one text value for each tree node. However, this limitation does not mean Silk cannot describe HTML data. If necessary, you can embed HTML data as a text value. Here is an example:
-p: This paragraph contains <b>bold</b> and <i>italic</i> fonts.
or you can use double quotation to embed arbitrary text.
-p:"This paragraph contains <b>bold</b> and <i>italic</i> fonts."
(To be written)
Preamble line beginning with '%' symbol specifies version or encoding of Silk text files. Preamble line must be placed in the first line of Silk files in order to correctly change the behaviour of a Silk processor according to the specified version or character encoding. Preamble description can be ommitted, and in this case Silk uses utf-8 as the default encoding.
In Silk, a tree node begins with a hyphen '-' followed by a node name. The text value of the node follows a colon ':'. If the colon and text value are not present, the node value of the tree node will be set to null:
Silk
- title: hello world
White spaces around text values will be ignored:
JSON
{ "title":"hello world" }
And also, white spaces around node names will be ignored:
Silk
- first name : Andy
JSON
{ "first name":"Andy" }
Silk
- book(id:1, title: Database Management Systems, isbn:0071230572, year:2002)
JSON
{
"book":
{
"id":1,
"title":"Database Management Systems",
"isbn":"0071230572",
"year":2002
}
}
Indentation before hyphen ('-') represents tree node depth. Only space characters (' ') are allowed before the indentation hyphen ('-'). Tab character ('\t') cannot be used for indentations.
Silk
- book
- id: 1
- title: Database Management Systems
- isbn:0071230572
- year:2002
- book
- id: 2
- title: Compilers: Second Edition
- isbn:0321547985
- year:2007
Alternatively, you can write the same data as follows:
- book(id:1, title: Database Management Systems, isbn:0071230572, year:2002)
- book(id:2, title: Compilers: Second Edition, isbn:0321547985, year:2007)
JSON
[
{"book":
{
"id":1,
"title":"Database Management Systems",
"isbn":"0071230572",
"year":2002
}
},
{"book":
{
"id":2,
"title":"Compilers: Second Edition",
"isbn":"03215479785",
"year":2007
}
}
]
XML
<book>
<id>1</id>
<title>Database Management Systems</title>
<isbn>0071230572</isbn>
<year>2002</year>
</book>
<book>
<id>2</id>
<title>Compilers: Second Edition</title>
<isbn>0321547985</isbn>
<year>2007</year>
</book>
The design concept of Silk format is to reduce the redundancy of XML or JSON data format in describing large data set. If node description ends with a bar '|', following lines are split by tabs, and each text component separated by tabs is assigned a corresponding node name. The node name of each tab-separated data can be specified in the preceding child node descriptions:
Silk
# A book node schema with 4 parameters. -book(id, title, isbn, year)| 1 Database Management Systems 0071230572 2002 2 Compilers: Second Edition 0321547985 2007 # Tab-separated data region ends when a new node is found -updated: 2009/02/16
JSON
{
"book":[
{
"id":1,
"title":"Database Management Systems",
"isbn":"0071230572",
"year":2002
},
{
"id":2,
"title":"Compilers: Second Edition",
"isbn":"0321547985",
"year":2007
}
],
"updated":"2009/02/16"
}
XML
<book>
<id>1</id>
<title>Database Management Systems</title>
<isbn>0071230572</isbn>
<year>2002</year>
</book>
<book>
<id>2</id>
<title>Compilers: Second Edition</title>
<isbn>0321547985</isbn>
<year>2007</year>
</book>
<updated>2009/02/16</updated>
Large text values can be split into multiple lines. To describe muti-line text values, use '>' symbol instead of ':' (colon). The following examples shows a gene sequence of NM_001005277:
Silk
-gene(name:NM_001005277)
-sequence>
ATGGATGGAGAGAATCACTCAGTGGTATCTGAGTTTTTGTTTCTGGGACTCACTCATTCATGGGAGATCC
AGCTCCTCCTCCTAGTGTTTTCCTCTGTGCTCTATGTGGCAAGCATTACTGGAAACATCCTCATTGTGTT
TTCTGTGACCACTGACCCTCACTTACACTCCCCCATGTACTTTCTACTGGCCAGTCTCTCCTTCATTGAC
TTAGGAGCCTGCTCTGTCACTTCTCCCAAGATGATTTATGACCTGTTCAGAAAGCGCAAAGTCATCTCCT
TTGGAGGCTGCATCGCTCAAATCTTCTTCATCCACGTCGTTGGTGGTGTGGAGATGGTGCTGCTCATAGC
CATGGCCTTTGACAGATATGTGGCCCTATGTAAGCCCCTCCACTATCTGACCATTATGAGCCCAAGAATG
TGCCTTTCATTTCTGGCTGTTGCCTGGACCCTTGGTGTCAGTCACTCCCTGTTCCAACTGGCATTTCTTG
TTAATTTAGCCTTCTGTGGCCCTAATGTGTTGGACAGCTTCTACTGTGACCTTCCTCGGCTTCTCAGACT
AGCCTGTACCGACACCTACAGATTGCAGTTCATGGTCACTGTTAACAGTGGGTTTATCTGTGTGGGTACT
TTCTTCATACTTCTAATCTCCTACGTCTTCATCCTGTTTACTGTTTGGAAACATTCCTCAGGTGGTTCAT
CCAAGGCCCTTTCCACTCTTTCAGCTCACAGCACAGTGGTCCTTTTGTTCTTTGGTCCACCCATGTTTGT
GTATACACGGCCACACCCTAATTCACAGATGGACAAGTTTCTGGCTATTTTTGATGCAGTTCTCACTCCT
TTTCTGAATCCAGTTGTCTATACATTCAGGAATAAGGAGATGAAGGCAGCAATAAAGAGAGTATGCAAAC
AGCTAGTGATTTACAAGAGGATCTCATAA
Multi-line text values in Silk are connected into a single string. Leading white spaces and tail white spaces (including new line chracters \r and \n) of each text line will be trimmed down. For example, you can insert spaces to the head of lines to make the nesting of the data clear:
Silk
-gene(name:NM_001005277)
-sequence>
ATGGATGGAGAGAATCACTCAGTGGTATCTGAGTTTTTGTTTCTGGGACTCACTCATTCATGGGAGATCC
AGCTCCTCCTCCTAGTGTTTTCCTCTGTGCTCTATGTGGCAAGCATTACTGGAAACATCCTCATTGTGTT
TTCTGTGACCACTGACCCTCACTTACACTCCCCCATGTACTTTCTACTGGCCAGTCTCTCCTTCATTGAC
TTAGGAGCCTGCTCTGTCACTTCTCCCAAGATGATTTATGACCTGTTCAGAAAGCGCAAAGTCATCTCCT
TTGGAGGCTGCATCGCTCAAATCTTCTTCATCCACGTCGTTGGTGGTGTGGAGATGGTGCTGCTCATAGC
CATGGCCTTTGACAGATATGTGGCCCTATGTAAGCCCCTCCACTATCTGACCATTATGAGCCCAAGAATG
TGCCTTTCATTTCTGGCTGTTGCCTGGACCCTTGGTGTCAGTCACTCCCTGTTCCAACTGGCATTTCTTG
TTAATTTAGCCTTCTGTGGCCCTAATGTGTTGGACAGCTTCTACTGTGACCTTCCTCGGCTTCTCAGACT
AGCCTGTACCGACACCTACAGATTGCAGTTCATGGTCACTGTTAACAGTGGGTTTATCTGTGTGGGTACT
TTCTTCATACTTCTAATCTCCTACGTCTTCATCCTGTTTACTGTTTGGAAACATTCCTCAGGTGGTTCAT
CCAAGGCCCTTTCCACTCTTTCAGCTCACAGCACAGTGGTCCTTTTGTTCTTTGGTCCACCCATGTTTGT
GTATACACGGCCACACCCTAATTCACAGATGGACAAGTTTCTGGCTATTTTTGATGCAGTTCTCACTCCT
TTTCTGAATCCAGTTGTCTATACATTCAGGAATAAGGAGATGAAGGCAGCAATAAAGAGAGTATGCAAAC
AGCTAGTGATTTACAAGAGGATCTCATAA
The above two silk data has the same semantics with the following JSON data:
JSON
{"gene":
{"name":"NM_001005277",
"sequence":"ATGGATGGAGAGAATCACTCAGTGGTATCTGAGTTTTTGTTTCTGGGACTCACTCATTCATGGGAGATCCAGCTCCTCCTCCTAGTGTTTTCCTCTGTGCTCTATGTGGCAAGCATTACTGGAAACATCCTCATTGTGTTTTCTGTGACCACTGACCCTCACTTACACTCCCCCATGTACTTTCTACTGGCCAGTCTCTCCTTCATTGACTTAGGAGCCTGCTCTGTCACTTCTCCCAAGATGATTTATGACCTGTTCAGAAAGCGCAAAGTCATCTCCTTTGGAGGCTGCATCGCTCAAATCTTCTTCATCCACGTCGTTGGTGGTGTGGAGATGGTGCTGCTCATAGCCATGGCCTTTGACAGATATGTGGCCCTATGTAAGCCCCTCCACTATCTGACCATTATGAGCCCAAGAATGTGCCTTTCATTTCTGGCTGTTGCCTGGACCCTTGGTGTCAGTCACTCCCTGTTCCAACTGGCATTTCTTGTTAATTTAGCCTTCTGTGGCCCTAATGTGTTGGACAGCTTCTACTGTGACCTTCCTCGGCTTCTCAGACTAGCCTGTACCGACACCTACAGATTGCAGTTCATGGTCACTGTTAACAGTGGGTTTATCTGTGTGGGTACTTTCTTCATACTTCTAATCTCCTACGTCTTCATCCTGTTTACTGTTTGGAAACATTCCTCAGGTGGTTCATCCAAGGCCCTTTCCACTCTTTCAGCTCACAGCACAGTGGTCCTTTTGTTCTTTGGTCCACCCATGTTTGTGTATACACGGCCACACCCTAATTCACAGATGGACAAGTTTCTGGCTATTTTTGATGCAGTTCTCACTCCTTTTCTGAATCCAGTTGTCTATACATTCAGGAATAAGGAGATGAAGGCAGCAATAAAGAGAGTATGCAAACAGCTAGTGATTTACAAGAGGATCTCATAA"}}
Use >> indicator for keeping line break chars between multi-line text data:
Silk
-message>>
Hello World!
Nice to meet you.
JSON
{"message":"Hello World!\nNice to meet you.\n"}
When the multi-line data contains a hyphen in the head of a line, escape it by using \- notation, because hyphen(-) is a special character in Silk for describing nodes:
Silk
- sequence>
ATGGATGGAGAGAATCACTCAGTGGTATCTGAGTTTTTGTTTCTGGGACTCACTCATTCAT--
\-AGCTCCTCCTCCTAGTGTTTTCCTCTGTGCTCTATGTGGCAAGCATTACTGGAAACATCCT
JSON
{"sequence":"ATGGATGGAGAGAATCACTCAGTGGTATCTGAGTTTTTGTTTCTGGGACTCACTCATTCAT---AGCTCCTCCTCCTAGTGTTTTCCTCTGTGCTCTATGTGGCAAGCATTACTGGAAACATCC"}
The escape symbol '\' will be removed when parsing the Silk data. To include '\' symbol to the data, use '\\'. These escape sequences '\-' and '\\' are effective only in the head of a line; do not escape hyphen and \ (backslash) characters after the first character of a line.
Silk format has a built-in support of import function, which is useful for annotating existing data files (e.g., tab-separated data) with Silk. The following examples loads tab-separated data from the file book.tab, and annotates the loaded tab-separated data with the book schema:
- book(id, title, isbn, year)|
@import(book.tab)
book.tab
1 Database Management Systems 0071230572 2002 2 Compilers: Second Edition 0321547985 2007
The above data using two files are equivalent to the following Silk file:
- book(id, title, isbn, year)| 1 Database Management Systems 0071230572 2002 2 Compilers: Second Edition 0321547985 2007
Binary files also can be imported as a node value:
- photo
- title: vacation
- image: @import(myphoto.jpg) # import myphoto.jpg as a node value (encoded with base64)
- categories: holiday family
JSON data can be embedded as a text value by specifying data type description [json] after the node name.
Silk
-prime[json]: [2, 3, 5, 7, 11, 13, 17, 19, 23]
JSON
{"prime":[2, 3, 5, 7, 11, 13, 17, 19, 23]}
XML
<prime>2</prime>
<prime>3</prime>
<prime>5</prime>
<prime>7</prime>
<prime>11</prime>
<prime>13</prime>
<prime>17</prime>
<prime>19</prime>
<prime>23</prime>
When you have to describe several parameter values for each node, but appearance of these parameters may varies, you can use in-line json object description.
Silk
-book(id, title, isbn, year, _[json])|
1 Database Management Systems 0071230572 2002 {"star":5, "comment":"good book"}
2 Compilers: Second Edition 0321547985 2007 {"tags":["read later", "textbook"]}
If the node name is '_' (underscore), each component of the in-line json data is treated as a direct child node of the parent node (book node in the above example).
JSON
{
"book":[
{
"id":1,
"title":"Database Management Systems",
"isbn":"0071230572",
"year":2002,
"star":5,
"comment":"good book"
},
{
"id":2,
"title":"Compilers: Second Edition",
"isbn":"0321547985",
"year":2007,
"tags":["read later", "textbook"]
}
]
}
You can wrap the in-line json data within a named node:
Silk
-book(id, title, isbn, year, param[json])|
1 Database Management Systems 0071230572 2002 {"star":5, "comment":"good book"}
2 Compilers: Second Edition 0321547985 2007 {"tags":["read later", "textbook"]}
JSON
{
"book":[
{
"id":1,
"title":"Database Management Systems",
"isbn":"0071230572",
"year":2002,
"param":{"star":5, "comment":"good book"}
},
{
"id":2,
"title":"Compilers: Second Edition",
"isbn":"0321547985",
"year":2007,
"param":{"tags":["read later", "textbook"]}
}
]
}
Silk (single line, using json array)
-prime[json]: [2, 3, 5, 7, 11, 13, 17, 19, 23]
Silk (multi lines)
-prime*
2, 3, 5
7, 11, 13
17, 19, 23
JSON
{"prime":[2, 3, 5, 7, 11, 13, 17, 19, 23]}
XML
<prime>2</prime>
<prime>3</prime>
<prime>5</prime>
<prime>7</prime>
<prime>11</prime>
<prime>13</prime>
<prime>17</prime>
<prime>19</prime>
<prime>23</prime>
Instead of tab-spearated format, Silk allows block-style data representations, where each node value is spearated by -- (node separator) and == (entry separator).
Silk
-sequence(seq1, seq2)==
ABCD
EFGHI
--
JKL
MN
==
0000
--
1234
-message:hello
JSON
{ "sequence":
[
{"seq1":"ABCDEFGHI", "seq2":"JKLMN"},
{"seq1":"0000", "seq2":"1234"}
],
"message":"hello"
}
% silk(version:1.0)
# track name
- track(name:"gene locus")
# specify a coordinate system of the genome
- coordinate(group:utgb, species:human, revison:hg18, name:chr1)
# named locus in the tab-separated data form
- locus(name, strand, start, end)|
NM_001005277 + 357521 358460
NM_001005224 + 357521 358460
NM_001005221 + 357521 358460
NM_001005277 - 610958 611897
NM_001005224 - 610958 611897
NM_001005221 - 610958 611897
# move to another coordinate, chr2
- coordinate(group:utgb, species:human, revison:hg18, name:chr2)
- locus(name, strand, start, end)|
NM_001005277 + 357521 358460
NM_001005224 + 357521 358460
NM_001005221 + 357521 358460
NM_001005277 - 610958 611897
NM_001005224 - 610958 611897
NM_001005221 - 610958 611897
% silk(version:1.0)
- track(name:"Transcript Frequency")
- barchart
- title:bar chart
- yMin:0
- yMax:100
- xTitle: genome position (bp)
- yTitle: number of transcripts (log scale)
- yLogScale: true
# plot y beginning from x=1 (offsetX = 1)
- coordinate(group:utgb, species:human, revison:hg18, name:chr1)
- offsetX:1
- plot*
0,0,0,0,0,0,0,3,5,10
2,0,8,4,0,23,0,0,0,0
# plot (x, y)
- coordinate(group:utgb, species:human, revison:hg18, name:chr1)
- plot(x, y)|
8 3
9 5
10 10
11 2
12 8
13 4
15 23
% silk(version:1.0)
# single comment line
# tree node description. node_name (child_name1[:value], ...)
- track(name:"refseq gene")
- author: leo # author is a child node of the track node
# specify coordinates
- coordinate(group:utgb, name:chr1, species:human, revision:hg18)
# gene data description with tab-seaprated data format. CDS and exon data use micro-data format
- gene(name, strand, start, end, cds(start, end), exon(start, end)*)|
NM_001005277 + 357521 358460 [357521, 358460] [[357521, 358460]]
NM_001005224 + 357521 358460 [357521, 358460] [[357521, 358460]]
NM_001005221 + 357521 358460 [357521, 358460] [[357521, 358460]]
NM_001005277 - 610958 611897 [610958, 611897] [[610958, 611897]]
NM_001005224 - 610958 611897 [610958, 611897] [[610958, 611897]]
NM_001005221 - 610958 611897 [610958, 611897] [[610958, 611897]]
NM_152486 + 850983 869824 [851184, 869396] [[850983, 851043],[851164, 851256],[855397, 855579]]
# indentation before tab-separated data can be used for readability
- coordinate(group:utgb, name:chr2, species:human, revision:hg18)
- gene(name, strand, start, end, cds(start, end), exon(start, end)*)|
NM_001005277 + 357521 358460 [357521, 358460] [[357521, 358460]]
NM_001005278 + 357521 358460 [357521, 358460] [[357521, 358460]]
# flexible structure organization
- coordinate(group:utgb, species:human, revision:hg18)
- gene(coordinate.name, name, strand, start, end)| # coordinate names is pulled down from the parent node
chr1 gene1 + 357521 358460
chr2 gene2 + 357521 358460
chr10 gene3 + 357521 358460
chr3 gene4 + 357521 358460
chr1 gene5 + 357521 358460