Data Definition Language

Introduction

This document is the manual for version 1 of the Data Definition Language. Programs of this language are descriptions of data as text for the purpose of storing and exchanging that data between entities (humans and machines alike). For the purpose of describing such data, the language offers built-in data types including scalar values (boolean, number, string, void) as well as aggregate values (maps and lists).

Remark The language is inspired by JSON (see ECMA-404: The JSON data interchange syntax,2nd edition, December 2017 for more information), however, it is neither designed as a super- or subset of JSON. However, a conversion between those two is possible without a loss of data.

Grammars

This section describes context-free grammars used in this specification to define the lexical and syntactical structure of a program.

Context-free Grammars

A context-free grammar consists of a number of productions. Each production has an abstract symbol called a nonterminal as its left-hand side, and a sequence of one or more nonterminal and terminal symbols as its right-hand side. For each grammar, the terminal symbols are drawn from a specified alphabet.

Starting from a sentence consisting of a single distinguished nonterminal, called the goal symbol, a given context-free grammar specifies a language, namely, the set of possible sequences of terminal symbols that can result from repeatedly replacing any nonterminal in the sequence with a right-hand side of a production for which the nonterminal is the left-hand side.

Lexical Grammar

A lexical grammar for the DDL language has its terminal symbols the characters of the Unicode character set. It defines a set of productions, starting from the goal symbol `word`, that describe how sequences of Unicode characters are translated into a sequence of words. Only UTF-8 sequences of length 1 are support in version 1 of this language.

Syntactical Grammar

A syntactical grammar for the Data Definition Language has its terminal symbols the words defined by the lexical grammar. It defines a set of productions, starting from the goal symbol `sentence`, that describes how sequences of words are translated into a sentence.

Grammar Notation

Productions are written in fixed width fonts.

A production is defined by its left-hand side, followed by a colon :, followed by its right-hand side definition. The left hand side is the name of the non-terminal defined by the production.

Multiple alternating definitions of a production may be defined.

The right hand side of a production consists of any sequence of terminals and non-terminals.

In certain cases the right-hand side is replaced by a comment describing the right-hand. This comment is opened by /* and closed by */.

Example:

digit : /* A single Unicode character from the code point range +U0030 to +U0039. */

A terminal is a sequence of Unicode symbols. A Unicode symbol is denoted a shebang (# followed by a hexadecimal number denoting its code point.

Example:

The following productions denote the non-terminal for a sign as used in the definitions of numerals:

/* #2b is also known as "PLUS SIGN" */
plus_sign : #2b
/* #2d is also known as "MINUS SIGN" */
minus_sign : #2d sign : plus_sign
sign : minus_sign

The syntax {x} on the right-hand side of a production denotes zero or more occurrences of x.

Example:

The following production defines a possibly empty sequence of digits

zero-or-more-digits : {digit}

The syntax [x] on the right-hand side of a production denotes zero or one occurrences of x.

Example:

The following productions denotes a possible definition of an integer numeral. It consists of an optional sign followed by a (with sign and zero-or-more-digits as defined in the preceeding examples):

integer : [sign] digit zero-or-more-digits

The empty string is denoted by ε.

Example:

The following productions denotes a possibly empty list of integers (with integer as defined in the preceeding example). Note that this list may include a trailing comma.

integer-list : integer integer-list-rest
integer-list : ε

integer-list-rest : comma integer integer-list-rest
integer-list-rest : comma
integer-list-rest : ε

/* #2c is also known as "COMMA" */
comma : #2c

Lexical Structure

The lexical grammar describes the translation of Unicode characters into words. The single goal symbol of the lexical grammar is the word symbol.

goal symbol

The goal symbol word is defined by

word : delimiters
word : boolean
word : number
word : string
word : void
word : name
word : left_curly_bracket
word : right_curly_bracket
word : left_square_bracket
word : right_square_bracket
word : comma
word : colon
/*whitespace, newline, and comment are not considered the syntactical grammar*/
word : whitespace
word : newline
word : comment

whitespaces

The word whitespace is defined by

/* #9 is also known as "CHARACTER TABULATION" */
whitespace : #9
/* #20 is also known as "SPACE" */
whitespace : #20

line terminators

The word line_terminator is defined by

/* #a is also known as "LINEFEED (LF)" */
/* #d is also known as "CARRIAGE RETURN (CR)" */
line_terminator : #a {#d}
line_terminator : #d {#a}

comments

The language supports both single-line comments and multi-line comments. A comment_block is either a single_line_comment or a multi_line_comment and hence is defined by

comment : single_line_comment
| multi_line_comment

A single_line_comment starts with two solidus. It extends to the end of the line. Hence it is defined by

/* #2f is also known as SOLIDUS */
single_line_comment :
#2f #2f
/* any sequence of characters except for line_terminator */

The line_terminator is not considered as part of the comment text.

A multi_line_comment is opened by a solidus and an asterisk and closed by an asterisk and a solidus. Hence it is defined by

/* #2f is also known as SOLIDUS */
/* #2a is also known as ASTERISK */
multi_line_comment :
#2f #2a
/* any sequence of characters except except for #2a #2f */
#2a #2f

The #2f 2a and #2a #2f sequences are not considered as part of the comment text.

This implies

  • // has no special meaning either comment.
  • /* and */ have no special meaning in single-line comments.
  • Multi-line comments do not test.

parentheses

The words left_parenthesis and right_parenthesis, respectively, are defined by

/* #28 is also known as "LEFT PARENTHESIS" */
left_parenthesis : #28
/* #29 is also known as "RIGHT PARENTHESIS" */
right_parenthesis : #29

curly brackets

The words left_curly_bracket and right_curly_bracket, respectively, are defined by

/* #7b is also known as "LEFT CURLY BRACKET" */
left_curly_bracket : #7b
/* #7d is also known as "RIGHT CURLY BRACKET" */
right_curly_bracket : #7d

colon

The word colon is

/* #3a is also known as "COLON" */
colon : #3a

square brackets

The words left_square_bracket and right_square_bracket, respectively, are defined by

/* #5b is also known as "LEFT SQUARE BRACKET" */
left_square_bracket : #5b
/* #5d is also known as "RIGHT SQUARE BRACKET" */
right_square_bracket : #5d

alphanumeric

The word alphanumeric is reserved for future use.

comma

The word comma is

/* #2c is also known as "COMMA" */ comma : #2c

name

The word name is defined by

name : {underscore} alphabetic {name_suffix_character}

/* #41 is also known as "LATIN CAPITAL LETTER A" */
/* #5a is also known as "LATIN CAPITAL LETTER Z" */
/* #61 is also known as "LATIN SMALL LETTER A" */
/* #7a is also known as "LATIN SMALLER LETTER Z" */
name_suffix_character : /* The unicode characters from #41 to #5a and from #61 to #7a. */

/* #30 is also known as "DIGIT ZERO" */
/* #39 is also known as "DIGIT NINE" */
name_suffix_character : /* The unicode characters from #30 to #39. */

/* #5f is also known as "LOW LINE" */
name_suffix_character : #5f

number

The word number is defined by

number : integer_number
number : real_number
integer_number : [sign] digit {digit}
real_number : [sign] period digit {digit} [exponent]
real_number : [sign] digit {digit} [period {digit}] [exponent]
exponent : exponent_prefix [sign] digit {digit}

/* #2b is also known as "PLUS SIGN" */
sign : #2b
/* #2d is also known as "MINUS SIGN" */
sign : #2d
/* #2e is also known as "FULL STOP" */
period : 2e
/* #65 is also known as "LATIN SMALL LETTER E" */
exponent_prefix : #65
/* #45 is also known as "LATIN CAPITAL LETTER E" */
exponent_prefix : #45

string

The word string is defined by

string : single_quoted_string
stirng : double_quoted_string

double_quoted_string : double_quote {double_quoted_string_character} double_quote
double_quoted_string_character : /* any character except for newline and double_quote */
double_quoted_string_character : escape_sequence
double_quoted_string_character : #5c double_quote
/* #22 is also known as "QUOTATION MARK" */
double_quote : #22

single_quoted_string : single_quote {single_quoted_string_character} single_quote
single_quoted_string_character : /*any character except for newline and single quote*/
single_quoted_string_character : escape_sequence
single_quoted_string_character : #5c single_quote
/* #27 is also known as "APOSTROPHE" */
single_quote : #27

/* #5c is also known as "REVERSE SOLIDUS" */
escape_sequence : #5c #5c
/* #6e is also known as "LATIN SMALL LETTER N" */
escape_sequence : #5c #6e
/* #72 is also known as "LATIN SMALL LETTER R" */
escape_sequence : #5c #72

boolean, void

The words boolean and void, respectively, are defined by

boolean : true
boolean : false
true : #74 #72 #75 #65
false : #66 #61 #6c #73 #65
void : #76 #6f # #69 #64

digit

The word digit is defined by

digit : /* A single Unicode character from the code point range +U0030 to +U0039. */

Syntactical Structure

The syntactical grammar describes the translation of the sequence of words that make up a program into sentences. The single goal symbol of the syntactical grammar is the sentence sentence symbol.

The words whitespace, line_terminator, and comment are removed from the sequence of words before the translation to sentences is performed.

The goal sentence sentence is defined by

sentence : value

The sentence value is defined by

value : map
value : list
value : string
value : number
value : boolean
value : void

The sentence map is defined by

map : left_curly_bracket
  map_body
  right_curly_bracket

map_body : map_body_element map_body_rest
map_body : ε

map_body_rest : comma map_body_element map_body_rest
map_body_rest : comma
map_body_rest : ε

map_body_element : name colon value

The sentence list is defined by

list : left_square_bracket
  list_body
  right_square_bracket

list_body : list_body_element list_body_rest
list_body : ε

list_body_rest : comma list_body_element list_body_rest
list_body_rest : comma
list_body_rest : ε

list_body_element : value

Types and Values

The Data Definition Language knows six basic types List and Map, which are the so called aggregate types, and Boolean, Number, String and Void, which are the so called scalar types.

Scalar Types

Boolean Type

The type Boolean has two values true and false which are expressed in the language by the words true and false, respectively (as defined in the lexical grammar).

Number Type

The type Number represents both 2-complement integer numbers as well as IEE754 floating-point numbers. A value of type Number is expressed in the language by the word `number` (as defined in the lexical grammar).

Note that the Data Definition Language does not impose restrictions on the range and precision of values of type Number. Implementations, however, may impose restrictions.

String Type

The type String represents UTF-8 strings. String values are expressed in the language by the word string (as defined in the lexical grammar). At the end of the lexical translation of a String word, its escape sequences are replaced by the Unicode characters they are representing. Furthermore, the opening and closing quotes are removed.

Note that the Data Definition Language does not impose restrictions on the length of values of type String. Implementations, however, may impose restrictions.

Void Type

The type Void has a single value void which is represented in the language by the word void (as defined in the lexical grammar).

Aggregate Types

List Type

The type List represents lists of values. A value of type List is expressed in the language by the sentence list (as defined in the syntactical grammar).

Example:

// A list with three numbers 1, 2, and 3.
[ 1, 2, 3 ]

Map Type

The type Map represents maps from names to values. A value of type Map is expressed in the language by the sentence map (as defined in the syntactical grammar).

Example:

// A map of
// text to 'Hello, World!'
// action to 'Print', and
// fontSize to 12.
{ text : 'Hello World!', action : 'Print', fontSize: 12 }

If two name/value pairs from the same name in a map are specified, then last specified name/value pair takes precedence.

Example:

The following Data Definition Language program defines a Map value that contains two name/value pairs with the same name x. The first name/value pair maps to the value 0 and second name/value pair to the number 1.

{ x : 0, x : 1 }

The effective Map value defined by the program is hence

{ x : 1 }

as name/value pair mapping to the value 0 is specified before the name/value pair mapping to the value 1.