Introducing PRONOM syntax
Last updated on 2025-10-29 | Edit this page
Overview
Questions
- Why does PRONOM need syntax?
- What syntax exists?
- What does the syntax enable us to do?
Objectives
- Write our first PRONOM compliant signatures.
- Learn what a “BOF” is.
- PRONOM needs syntax to enable the expression of format identification signatures
- Needs to articulate specific byte patterns, at specific locations.
- Byte patterns use hexadecimal notation
- Syntax has overlap with ‘Regular Expressions’ (RegEx) but is distinct from RegEx implementations in common code languages such as Java or Python
- Highly flexible!
Signature positions
- BOF: Beginning Of File - the signature sequence starts at, or near the beginning of the file
- EOF: End Of File - the signature sequence starts at, or near the end of the file
- Var: Variable - the signature sequence may be found anywhere within the file
- Offset - the position, relative to the BOF, or EOF, where the sequence begins. 0 is default, meaning no offset. Since an offset of 0 means ‘starting from the first byte’, an offset of 4 means ‘starting from the 5th byte’, or ‘after the 4th byte’
- Maximum Offset - A further offset, relative to the initial Offset value described above. The default is 0, meaning no further possible offset.
Position and offset examples
BOF, Offset 0, Maximum offset 0: The signature sequence starts at the very beginning of the file
BOF, Offset 4, Maximum offset 0: The signature sequence starts at exactly position 0x04, the 5th byte
BOF, Offset 0, Maximum offset 4: The signature sequence may start anywhere within the first 5 bytes
BOF, Offset 4, Maximum Offset 4: The signature sequence may start anywhere from byte 5 through to byte 9
EOF, Offset 4, Maximum Offset 0: The signature sequence ends exactly 4 bytes from the end of the file
Questions
- Where can the byte sequence appear for BOF, Offset 16, Maximum offset 16?
- What do you think happens if you add an offset to a variably-positioned sequence?
Most common syntax
| Syntax element | Intended use | Example |
|---|---|---|
| Literal sequence | Just a plain signature sequence that appears as-is | A1B2C3D4 |
Infinite wildcard: *
|
The following sequence will appear at any point further in the file | A1B2C3D4*E5F6A7B8 |
Precise wildcard: {n}
|
The following sequence will appear after exactly the number of bytes specified | A1B2C3D4{4}E5F6A7B8 |
Wildcard range: {m-n}
|
The following sequence will appear at some point between the number of bytes specified | A1B2C3D4{4-8}E5F6A7B8 |
Either/Or: (a|b)
|
The following sequence will be any of the sequences specified. Any number of sequences can be specified | A1B2C3D4(0D |
Byte range [a:b]
|
The next byte will be within the range specified | A1B2C3D4[A4:B0]E5 |
Most signatures will combine some or all of the above.
Less common syntax
| Syntax element | Intended use | Example |
|---|---|---|
NOT sequence: [!a]
|
The following byte value is not this byte | A1B2C3D4[!E5]F6 |
Wildcard with infinite range:
{m-*}
|
The following sequence will appear minimally after the first value specified, but otherwise anywhere else in the file | A1B2C3D4{4-*}E5F6A7B8 |
Single wildcard: ??
|
The following byte may have any value. This is functionally
equivalent to {1}
|
A1B2C3D4??E5F6A7B8 |
NOT Byte range [!a:b]
|
The next byte will not be within the range specified | A1B2C3D4[!A4:B0]E5 |
| Wildcards at a beginning of a BOF sequence, or end of an EOF sequence | This is functionally equivalent to specifying Offset/Maximum Offset, however this is not recommended |
{4}A1B2C3D4 or: {0-4}A1B2C3D4
|
PRONOM Simplified Cheatsheet
PRONOM terms, basic syntax and data model
Offset markers
BOF = Beginning of File.
EOF = End of File. Var = Variable (anywhere in the file)
Offset/Max Offset = Exact or positional range in which a signature starts
Combining signatures and sequences
- A Format can have many Signatures - matching any Signature will return a hit.
- A Signature may consist of any number of BOF, EOF, and Var sequences. All sequences within a Signature must match to return a hit.
- Signature sequences must be logically positioned differently, so you couldn’t have two BOF sequences with offset 0, maximum offset 0, but if two signatures had BOF, offset 0, maximum offset 128, then both sequences must appear within the first 128 bytes
- Most commonly, a signature sequence will only have a BOF sequence - this is fine!
- By wary with purely Variable-positioned sequences - in isolation they will cause the whole of your files to be scanned, so it’s always best to include either a BOF or EOF as an ‘anchor’
PRONOM in Practice
The team at The National Archives have worked hard to create good resources for PRONOM research and development. The PRONOM in Practice guide is an important set of documents to follow up on after this tutorial.
- PRONOM syntax is a regular expression (regex).
- PRONOM syntax can be combined in multiple ways.
- Sometimes there is more than one way to write a signature.