Creating a Language to define and read Complex Data Sources
DDL, The Data Definition Language

 

homepage : http://ddl.sscli.net 

 

Developers often have to face the challenge of reading data from complex file formats. File formats would contain data according to some predefined format. Developing applications to read data from such formats is usually a rather daunting task.

 

To cite an example, we were once trying to help out a team that needed to read information from files that were data-dumps of information collected by aircraft black-boxes. The problem with black-box data formats is that they are packed with information. Every bit contains some information. Data values would spread across byte or word boundaries. Some values would have their bits scattered, which means that to read pitch or rotor-speed one might have to read bits from various locations and put them together to form a single value. So it took a lot of effort to develop reliable code that could read a black-box data format. To add to the team’s difficulty, the story didn’t end there. They needed to support black-boxes of different aircraft, which meant that their application had to have code for various kinds of black-box formats. These formats varied very much from aircraft to aircraft, manufacturer to manufacturer, country to country and even model to model.

 

Out of meditations on the problem, we realized that there was no real modern language deigned for data retrieval. Thus was born the need to write a language for data definition and retrieval, the DDL – the Data Definition Language.

 

 

 

The DDL is used to define arbitrary data formats. These formats can have multiple dependencies and bit level intricacies. The DDL has a simple intuitive syntax that can enable you to express a very large variety of data formats. Once the data format is coded in DDL script, you can use the DDL interpreter or the engine to read data from a data source.

 

The DDL is provided as a .Net assembly called System.DDL.dll. You can use this privately in your application or install it in the GAC. The assembly is the interpreter for the DDL language.

 

This article is a brief introduction to how the DDL works.

Approach to Data Interpretation with the DDL

The current implementation of the DDL cannot handle every conceivable file types possible. We don’t think we are theoretically sound enough to comment whether such a definition is possible. It however does implement the functionality required to handle a large class of the common file types that a developer is likely to encounter.

 

The first concept to familiarize is that of treating a data block as a stream of bits. A data block is nothing but a collection of bytes. Since there is ambiguity about what exactly a word is like, given a little-endian or big-endian byte ordering, the DDL sees the data block as the bits of a collection of little-endian bytes. This version of the DDL has been developed on the Intel platform and thus sticks to Intel’s little-endian byte-ordering.

 

The input to a DDL file is a stream of data from some source, typically a file. The stream of bytes can be represented as series of bits like this:

 

The DDL understands bits and groups of bits. The maximum size of a bit group is 32 bits (for the current implementation).

 

A data file format specifies the format of the data – it goes onto formally provide information like:

·         which variables’ values are to be mapped from the data-source,

·         what their bit sizes are and

·         at which locations they exist.

 

For example, a variable called ‘xyz’ might be defined to be of size 5 bits and is expected to be at an offset of 16 bits from the start of the data block. So mapping the variable onto the data source would yield a value for the variable as shown in the diagram:

 

 

 

In this case 10101 = 21 would be the expected value of the variable xyz. The DDL lets you define the existence of such variables in the DDL source code in manner similar to variable declaration in C. There are some important differences however.

 

The DDL is a definition language: In the sense that it does not *do* anything, unlike a language like C, where the code represents functions or operations; i.e. DDL scripts have no functional or procedural operations to carry out. These scripts are definitions of the data, the simply define what exists in the data source and how to read it. The choice of what is to be done with the read data or whether the data should be read at all is not made by the DDL. That is the domain of the application that uses the DDL. To call DDL scripts ‘static’ is not completely correct because in some scripts, a variable’s existence, size, location etc may only determinable at runtime.

 

The declaration of a variable in C creates a memory location in the computer that would from then be referenced through use of that variable. The declaration of a DDL variable does no such thing – it simply tells the DDL engine that in the data source there is a value that can be referred to by a certain name (in this example ‘xyz’) and it can be read from a particular location in the data source (if it needs to be read).

The DDL specification is simply a specification of what data exist. It does not mean that the host program will or need to use all of the data that is specified or that the DDL will read these values. The DDL simply knows that these values exist in the script.

 

The script thus contains the locations and sizes of various elements in the data source. Take a look at this little DDL script snippet:

 

i2    alpha
i4    beta
i4    gamma
i2    delta

 

This snippet defines the existence of 4 variables. The size of these variables is indicated by the bit-sizes defined. i.e.

alpha is 2 bits

beta is 4 bits, etc

 

Since there are no explicit addresses provided for these variables, they will use memory locations that fall one after the other. ‘beta’ will start immediately after ‘alpha’ and so on.

 

When mapped to the earlier bit stream the variables will give the values:

 

 

alpha = 10 = 2
beta=1011 = 11
gamma=0010 = 2
delta=10 = 2

 

Addresses are usually calculated by the engine by incrementally adding the size of each sequential block, which like in the above assumed that alpha started at offset 0 in the bit stream.

Addresses of elements can also be explicitly given. This involves setting of a variable called the ‘word_length’. The usage of word_length will be discussed later, The following shows an example of explicit addressing with a word_length of 8 bits.

 

@ 0,2       i2 alpha
            i4 beta
@ 1,2       i4 gamma
            i2 delta

 

When mapped to 10101100101011001010110010101100 will give the values:

 

 

alpha = 10 = 2
beta=1100 = 12
gamma=1011 = 11
delta=00 = 0

 

Thus the DDL simply does a mapping of the given format specification to the available data in the data source. If the DDL engine were given the above format definition and  bit data stream, and is asked for the value of ‘gamma’ it would return the equivalent of binary 1011 (which is 11) as the value.

 

Enabling file format support through the DDL goes only as far as being able to retrieve values from the data source. Each value should be accessed through a unique name. The DDL does not attach any special meanings to any of these values nor does it dictate what you do with the data it reads.

 

To be practically useful, however the DDL must have many more features than this and it does. This is a short basic description of data mapping.

 

And how would you use this Language?

The main intention in creating the DDL is that it should be used as a Rapid-Application-Development tool for developing file handling applications.

 

The DDL engine, standalone, is not a useful program. It is useful as part of a developers toolkit. The engine is provided as a run time library which can be loaded by a host program written by the developer. The DDL can then be used to do the file handling for a host program.

 

The host program then would supply the DDL engine with a DDL script file which is the data definition and a data source. The engine will load a specified DDL script file and the specified data, and map the script to the data. Once this is done the engine can then be queried by the host program for values from the data file.

 

To enable the DDL and the host program to interact, the DDL exposes a simple API. This API will have methods for loading the script file, specifying the data file, making data related queries to the DDL etc.

 

The basics high-level steps to be followed while developing an application that consumes the DDL are:

  1. Obtain a format specification of the data file that developer intends to support.
  2. Formalize the specification and write a DDL script, so that it the data format can be understood by the DDL interpreter/engine.
  3. Obtain a copy of data source file from which you need to extract information.
  4. Test the specification file (the DDL script file written), by using any of the DDL engine development tools provided and do sample runs on the data file and ensure that values are read as expected.
  5. Once the script has been tested, the developer can host the DDL engine in his application. The DDL engine will be available as a c++ library module, a dynamic link library or as a .net assembly.
  6. Once the integration is complete, the script file and data file can be loaded into the engine programmatically by the application source code and can be queried programmatically for values.

 

 

 

 

The DDL can be considered as an additional API layer above the basic file handling exposed by the operating system. While it does not do anything that is not possible with the underlying API, it makes the job substantially simpler and error free.

 

 

The Homepage of the DDL project is:

http://ddl.sscli.net

Here you can download the latest versions of the DDL and sample applications, as well as find other documents about the DDL. There are link to the authors pages where you can find additional material as well.

Email:

spark@sscli.net and dolly@sscli.net

You can mail us about the DDL, future enhancements etc at