The Cambridge Programming Research Group Blog: A domain specific language for binary file formats

I recently wrote a parser for the Flash file format specification, in Haskell. I managed to build support for reading and writing all the Flash 10 specification (apart from the embedded multimedia formats, such as h.264 and MP3) in only three days of committed coding.

How did I implement this 250+ page specification in such a short amount of time? Simple - I took the specification and made it executable.

The specification is full of tables describing the on-disk format that look like this:

At first, I began by manually coding up my library from these tables. Unfortunately, each table corresponds to no less than three bits of Haskell:

A data type definition for the record (called FOCALGRADIENT in this case)

A function for reading that record from the file (of type SwfGet FOCALGRADIENT: SwfGet is a state monad encapsulating a ByteString)

A function for writing that thing back (of type FOCALGRADIENT -> SwfPut (), where SwfPut is another monad)

It quickly became obvious that writing this code (and keeping the three versions in sync!) was totally untenable.

My solution was to turn my parser into a Literate Haskell file which can contain sections like the following:

   p18: RGB color record
   \begin{record}
   RGB
   Field Type Comment
   Red   UI8  Red color value
   Green UI8  Green color value
   Blue  UI8  Blue color value
   \end{record}

I then have a preprocessor which replaces these wholesale with code implementing the data type, reader and writer. For our running example, we get this:

   data RGB = RGB{rGB_red :: UI8, rGB_green :: UI8, rGB_blue :: UI8}
            deriving (Eq, Show, Typeable, Data)
   getRGB
     = do rGB_red <- getUI8
          rGB_green <- getUI8
          rGB_blue <- getUI8
          return (RGB{..})
   putRGB RGB{..}
     = do putUI8 rGB_red
          putUI8 rGB_green
          putUI8 rGB_blue
          return ()

Choice and Repetition

Frequently, some part of the record is repeated a number of times that depends on an earlier part of the record. In other cases, the exact shape a field takes can depend on an earlier one. For example:

    p96: ActionConstantPool
    \begin{record}
    ActionConstantPool
    Field              Type               Comment
    ActionConstantPool ACTIONRECORDHEADER ActionCode = 0x88
    Count              UI16               Number of constants to follow
    ConstantPool       STRING[Count]      String constants
    \end{record}

This record contains a field-controlled repetition count. My generator is smart enough to:

Generate a Haskell expression from the repetition count within the square brackets. In this case it is just Count, but in general this can contain arithmetic and references to several fields.

Omit the Count field from the generated data type, and infer it from the length of the ConstantPool when we come to put this record back

There is similar support for choice. For example, look at the CodeTable field in DefineFontInfo:

    p177: DefineFontInfo
    \begin{record}
    DefineFontInfo
    Field              Type                                           Comment
    Header             RECORDHEADER                                   Tag type = 13
    FontID             UI16                                           Font ID this information is for.
    FontNameLen        UI8                                            Length of font name.
    FontName           UI8[FontNameLen]                               Name of the font (see following).
    FontFlagsReserved  UB[2]                                          Reserved bit fields.
    FontFlagsSmallText UB[1]                                          SWF 7 file format or later: Font is small. Character glyphs are aligned on pixel boundaries for dynamic and input text.
    FontFlagsShiftJIS  UB[1]                                          ShiftJIS character codes.
    FontFlagsANSI      UB[1]                                          ANSI character codes.
    FontFlagsItalic    UB[1]                                          Font is italic.
    FontFlagsBold      UB[1]                                          Font is bold.
    FontFlagsWideCodes UB[1]                                          If 1, CodeTable is UI16 array; otherwise, CodeTable is UI8 array.
    CodeTable          If FontFlagsWideCodes, UI16[] Otherwise, UI8[] Glyph to code table, sorted in ascending order.
    \end{record}

The field changes representation based on FontFlagWideCodes - this is represented in the corresponding Haskell record as an Either type, and the FontFlagsWideCodes flag is not put in the record at all, as it can be unambiguously inferred from whether we are Left or Right.

Bit Fields

The fields we saw in DefineFontInfo with the name UB were actually bit fields. UB[n] is actually an unsigned number with n bits in it. Here is another example:

    \begin{record}
    BLURFILTER
    Field    Type  Comment
    BlurX    FIXED Horizontal blur amount
    BlurY    FIXED Vertical blur amount
    Passes   UB[5] Number of blur passes
    Reserved UB[3] Must be 0
    \end{record}

Bit fields pose a problem - how should they be represented in Haskell? If I know the number of bits statically, there might be a suitable type I can choose. For example, my generator will use Bool for UB[1] fields. Unfortunately, in general I'll have to choose a Haskell type which is large enough to hold the maximum possible number of bits I might encounter.

What I have chosen to do is represent any UB[n] field (where n is not 1) with a Word32 (32 bits seems to be the upper limit on any variable-length bitfields I've encountered in practice, though this isn't part of the specification). What this means, however, is that my generator must produce runtime assertions in the writing code to ensure that we won't lose any information by truncating the Word32 to the number of bits we actually have available.

Abstraction

Sometimes, the structure of a record varies depending on the context in which it is used. For example, the GLYPHENTRY record can only be read and written if we know the number of bits used for its two fields. This is encoded as follows:

    p192: Glyph entry
    \begin{record}
    GLYPHENTRY(GlyphBits, AdvanceBits)
    Field        Type            Comment
    GlyphIndex   UB[GlyphBits]   Glyph index into current font.
    GlyphAdvance SB[AdvanceBits] x advance value for glyph.
    \end{record}

The "parameter list" after GLYPHENTRY in the header must be filled out with concrete arguments by any user of GLYPHENTRY, such as in the GlyphEntries field of the following:

    \begin{record}
    TEXTRECORD(TextVer, GlyphBits, AdvanceBits)
    Field                Type                                                      Comment
    TextRecordType       UB[1]                                                     Always 1.
    StyleFlagsReserved   UB[3]                                                     Always 0.
    StyleFlagsHasFont    UB[1]                                                     1 if text font specified.
    StyleFlagsHasColor   UB[1]                                                     1 if text color specified.
    StyleFlagsHasYOffset UB[1]                                                     1 if y offset specified.
    StyleFlagsHasXOffset UB[1]                                                     1 if x offset specified.
    FontID               If StyleFlagsHasFont, UI16                                Font ID for following text.
    TextColor            If StyleFlagsHasColor, If TextVer = 2, RGBA Otherwise RGB Font color for following text.
    XOffset              If StyleFlagsHasXOffset, SI16                             x offset for following text.
    YOffset              If StyleFlagsHasYOffset, SI16                             y offset for following text.
    TextHeight           If StyleFlagsHasFont, UI16                                Font height for following text.
    GlyphCount           UI8                                                       Number of glyphs in record.
    GlyphEntries         GLYPHENTRY(GlyphBits, AdvanceBits)[GlyphCount]            Glyph entry (see following).
    Padding              PADDING8                                                  Padding to byte boundary
    \end{record}

Records abstracted over parameters generate Haskell reading and writing code which is lambda-abstracted over those same paramaters, just as you might expect. Likewise, occurrences of argument lists correspond to applications of arguments to a nested reading or writing function.

Conclusion

Having a declarative description of the contents of SWF files helped enormously. It had the following benefits:

Very rapid development. I essentially just pasted in the contents of the specification for each record. This also led to a very low rate of bugs in the code (though I found some specification bugs, I can hardly be blamed for those :-)

Easy post-hoc changes when adding new features.
- I only added assertions to check bit fields lengths quite late on in development, and it was a painless generator change. If I had handwritten everything I would have had to do a lot of tedious and error-prone programming to add a small feature like this.
- I could make use of the declarative information to generate reasonable error messages for showing to the user when they try to write back an invalid SWF. Again, this is something I decided to do after transcribing the specification.

Very easy to read and spot bugs. Legibility was a goal when designing the DSL.

It's relatively easy to change to using custom Haskell code for dealing with those records that can't be expressed in the DSL - you can copy and paste the unsatisfactory generated Haskell and then mold it to fit your needs.

Using custom code (as per the last bullet point) is sometimes indispensable for making the Haskell records look nicer. For example, you might need to write it so you can stop generating this sort of record:

    data RECT = RECT { rECT_nbits :: UI8, rECT_xmin :: SB, rECT_xmax :: SB, rECT_ymin :: SB, rECT_ymax :: SB }

And, by observing that you can compute nbits by the maximum of the base-2 logarithms of the 4 SB fields, generate this one instead:

    data RECT = RECT { rECT_xmin :: SB, rECT_xmax :: SB, rECT_ymin :: SB, rECT_ymax :: SB }

The fact that you have to use copy and paste to get this done is something which I'm not totally happy with. I've been looking at approaches to turn my DSL into
much more of a combinator library (i.e. taking the focus away from using an external code generator).

A combinator based approach would make it easy to add extra primitive combinators that know about special cases like one - but more on that another time!

Interested souls can get the code for my SWF library on GitHub. It can successfully roundtrip all of the example files that I could gather from the Gnash and Flash Gordon projects.

The Cambridge Programming Research Group Blog

Tuesday, 9 February 2010

A domain specific language for binary file formats

Choice and Repetition

Bit Fields

Abstraction

Conclusion

21 comments:

Blog Archive

Contributors

Labels