5. Binary Diffs¶
DiffX officially supports three types of binary diff formats:
VCDIFF binary diffs (recommended)
Git-compatible Literal binary diffs
Git-compatible Delta binary diffs
It’s recommended that new diffs make use of the VCDIFF format. The Git delta and literal binary diff formats are available for backward compatibility with Git-compatible diff implementations.
These formats are represented in similar ways. This page will discuss the common and diff-specific encoding and decoding rules for binary diffs.
5.1. Encoding/Decoding Basics¶
All three binary patch formats contain up to two payloads:
The application (forward) patch payload.
This is used to patch the binary file, transforming the original into the patched copy.
The reverse (backward) patch payload.
This is used to reverse the patch, transforming the patched copy back into the original.
For the VCDIFF binary diff, the reverse patch payload is optional. For the Git binary diffs, they are required, for backward compatibility.
Note
If the reverse payload is not available, then changes to a patched binary file cannot be undone.
Binary patches are in the following form:
<modified_type> <modified_raw_payload_length>
<modified_payload>
<original_type> <original_raw_payload_length>
<original_payload>
The contents will be in ASCII format.
Valid types and payload formats are dependent on the type of binary diff:
For VCDIFFs,
modified_typewill bevcdiff-applyandoriginal_typewill bevcdiff-reverse.For Git Literal binary diffs, both
modified_typeandoriginal_typewill beliteral.For Git Delta binary diffs, both
modified_typeandoriginal_typewill bedelta.
modified_raw_payload_length and original_raw_payload_length will be
the lengths of the raw modified and original payloads before compression and
encoding.
5.1.1. Binary Payloads¶
All three formats use the following format for payload data:
<len_c><data>
...
Each line represents up to 52 bytes of pre-encoded data. There may be an unlimited number of lines. They contain the following fields:
len_cis a line length character. This encodes the length of the (pre-encoded) data written on this line. See Line Length Characters below.datais Base85-encoded data for this line.
This format is memory-efficient, allowing data to be progressively written to or read from a DiffX file without needing to keep the entire contents in memory for the Base85-encode/decode steps.
Depending on the binary diff format, additional information may be included. This will be documented below for each binary diff format.
5.1.1.1. Encoding Logic¶
To encode data:
Prepare the data to encode.
Data preparation depends on the diff format:
For VCDIFFs, generate a VCDIFF of the file contents and then compress it with zlib.
For Git Literal binary diffs, read the file data and then compress it with zlib.
For Git Delta binary diffs, generate a delta set of instructions for patching the file, and then compress it with zlib.
For each 52-byte chunk of prepared data:
Generate and write a line length character representing the chunk length.
Generate and write Base85-encoded data for the chunk.
5.1.1.2. Decoding Logic¶
To decode data:
For each line:
Read and decode 1 byte for the line length character.
Read and Base85-decode remaining bytes for the line.
Process decoded data.
This depends on the diff format:
For VCDIFFs, zlib-decompress the data and read as a VCDIFF.
For Git Literal binary diffs, zlib-decompress the data and write to the patched binary file.
For Git Delta binary diffs, zlib-decompress the data, process the resulting delta instructions, and apply to the patched binary file.
5.1.2. Line Length Characters¶
Each encoded line in a binary diff payload is prefixed by a line length character. This encodes the length of the compressed (but not encoded) data for the line.
Line length characters always represent a value between 1 and 52:
A value of
A-Zrepresents a number between 1..26.A value of
a-zrepresents a number between 27..52.
5.1.2.1. Encoding Logic¶
To encode a length to a line length character:
If
line_lengthbetween 1..26:Set to ASCII character for
line_length + (ASCII value of 'A') - 1
Else if
line_lengthbetween 27..52:Set to ASCII character for
line_length + (ASCII value of 'a') - 1 - 26
Example in Python
LEN_LOWER = ord('Z') - ord('A') # 26
assert 1 <= unencoded_line_len <= 52
if unencoded_line_len <= LEN_LOWER:
len_c = chr(unencoded_line_len + ord('A') - 1)
else:
len_c = chr(unencoded_line_len + ord('a') - 1 - LEN_LOWER)
5.1.2.2. Decoding Logic¶
To decode a length from a line length character:
Set length to ASCII value of character.
If length is between 1..26:
Increment length by
1 - (ASCII value of 'A').
Else if length is between 27..52:
Increment length by
1 - (ASCII value of 'a') + 26
Example in Python
LEN_LOWER = ord('Z') - ord('A') # 26
unencoded_line_length = ord(len_c)
if unencoded_line_length <= LEN_LOWER:
unencoded_line_length += 1 - ord('A')
else:
unencoded_line_length += 1 - ord('a') + LEN_LOWER
5.2. VCDIFFs¶
DiffX recommends using VCDIFF files to represent binary diffs.
The VCDIFF format itself is covered under RFC 3284, and will not be documented here. We recommend using an existing VCDIFF implementation as part of your DiffX implementation.
A DiffX-stored VCDIFF is stored in the following format:
#...diff: length=<content_length>, type=binary, binary-format=vcdiff
vcdiff-apply <modified_raw_payload_length>
<len_c><data>
...
vcdiff-revert <original_raw_payload_length>
<len_c><data>
...
5.2.1. Encoding Logic¶
To encode a VCDIFF to a DiffX file:
Generate the binary VCDIFF data for the file using a VCDIFF implementation.
Compress the binary data with zlib.
For each 52-byte chunk of compressed data:
Generate and write a line length character representing the chunk length.
Generate and write Base85-encoded data for the chunk.
5.2.2. Decoding Logic¶
To decode an encoded VCDIFF from a DiffX file:
For each line of diff section content data:
Read and decode 1 byte for the line length character.
Read and Base85-decode remaining bytes for the line.
Cap the decoded data to the decoded line length.
This is important for ensuring we don’t treat any padding as zlib-compressed data.
zlib-decompress the resulting data.
Process decompressed data as a VCDIFF.
5.3. Git Literal Binary Diffs¶
Git Literal binary diffs contain the full contents of both the original and modified binary files, zlib-compressed and encoded as Base85.
A DiffX-stored Git Literal binary diff will be in the following format:
#...diff: length=<content_length>, type=binary, binary-format=git-literal
GIT binary patch
literal <modified_length>
<len_c><data>
...
literal <original_length>
<len_c><data>
...
5.3.1. Encoding Logic¶
To encode a Git Literal binary diff:
Compress the binary file using zlib.
For each 52-byte chunk:
Encode a line length character for the chunk size.
It may be < 52 for the final chunk in a file.
Encode the chunk data using Base85.
Write the line length character and encoded data.
Example in Python
LEN_A = ord('A') # 65
LEN_Z = ord('Z') # 90
LEN_a = ord('a') # 97
LEN_z = ord('z') # 122
LEN_LOWER = LEN_Z - LEN_A + 1 # 26
LEN_UPPER = LEN_z - LEN_a + 1 + LEN_LOWER # 52
data: bytes = b'<file data>'
compressed_data: bytes = zlib.compress(data)
compressed_len: int = len(compressed_data)
pos: int = 0
while pos < compressed_len:
chunk: bytes = compressed_data[pos:pos + LEN_UPPER]
chunk_len: int = len(chunk)
pos += chunk_len
if chunk_len <= LEN_LOWER:
len_c = chunk_len + LEN_A - 1
else:
len_c = chunk_len + LEN_a - 1 - LEN_LOWER
out.write('%c%s' % (len_c, base64.b85encode(chunk, pad=True)))
5.3.2. Decoding Logic¶
To decode a Literal:
For each line:
Read and decode 1 byte for the line length character.
Read and Base85-decode remaining bytes for the line.
Cap the decoded data to the decoded line length.
This is important for ensuring we don’t treat any padding as zlib-compressed data.
zlib-decompress the resulting data as the patched file.
Example in Python
LEN_A = ord('A') # 65
LEN_Z = ord('Z') # 90
LEN_a = ord('a') # 97
LEN_LOWER = LEN_Z - LEN_A + 1 # 26
lines_data: list[bytes] = [b'...', ...]
result_lines: list[bytes] = []
for line_data in lines_data:
length: int = line_data[0]
if length <= LEN_LOWER:
length += 1 - LEN_A
else:
length += 1 - LEN_a + LEN_LOWER
result_lines.append(base64.b85decode(line_data[1:])[:length])
result_data: bytes = zlib.decompress(b''.join(result_lines))
5.4. Git Delta Binary Diffs¶
Git Delta binary diffs contain instructions on applying patches to binary files. Unlike Git Literal Binary Diffs, these do not require embedding the full content of the new file in the patch.
There are two sets of delta lines, each starting with delta. The first
patches the original file, producing the patched file. The second reverts
the patched file, producing the original file.
A DiffX-stored Git Delta binary diff will be in the following format:
#...diff: length=<content_length>, type=binary, binary-format=git-delta
GIT binary patch
delta <modified_raw_payload_length>
<len_c><data>
...
delta <original_raw_payload_length>
<len_c><data>
...
5.4.1. Delta Payload Format¶
The Git delta diff format describes the changes made to binary files. It’s in the following format:
Offset (bytes) |
Length (bytes) |
Description |
|---|---|---|
0 |
Variable (>= 2) |
Header |
>= 2 |
Variable |
Instructions |
5.4.1.1. Header Format¶
The header describes the total file size of the original file and the total file size of the modified file. It’s in the following format:
Offset (bytes) |
Length (bytes) |
Description |
|---|---|---|
0 |
Variable (>= 1) |
The size of the original file in bytes. |
>= 1 |
Variable (>= 1) |
The size of the modified file in bytes. |
Each length is encoded as a series of bytes, from Most-Significant Byte to Least. Each length byte uses 1 bit (Most-Significant Bit) to indicate if there’s more bytes to read for the length, and 7 bits to encode a value for the length.
Logic for encoding and decoding lengths will be shown below.
5.4.1.2. Instructions¶
The Delta format has two types of instructions:
ADD: Adds new bytes following the instruction to the modified file.
COPY: Copies a range of bytes from the original file to the modified file.
Instructions must cover the entire contents of the file. Unchanged ranges of the file must be represented as COPY instructions.
5.4.1.2.1. ADD Instruction¶
The ADD instruction adds new bytes following the instruction to the modified file. These bytes are appended to the target object at the current write position.
This instruction is in the following form:
Field |
Length (bytes) |
Description |
|---|---|---|
control |
1 |
The ADD instruction in the form of |
data |
Variable (1..127) |
The bytes to write. |
For example, the following instruction adds 6 new bytes to the modified file:
control |
byte1 |
byte2 |
byte3 |
byte4 |
byte5 |
byte6 |
|---|---|---|---|---|---|---|
00000110 |
0x68 |
0x65 |
0x6C |
0x6C |
0x6F |
0x21 |
5.4.1.2.2. COPY Instruction¶
The COPY instruction copies a range of bytes from the original file to the modified file. These bytes are appended to the target object at the current write position.
This instruction is in the following form:
Field |
Length (bytes) |
Condition |
Description |
|---|---|---|---|
control |
1 |
Always present |
The COPY instruction in the form of |
offset1 |
1 |
Bit 1 is set |
Offset byte 1 to write (0x00 assumed if not present) |
offset2 |
1 |
Bit 2 is set |
Offset byte 2 to write (0x00 assumed if not present) |
offset3 |
1 |
Bit 3 is set |
Offset byte 3 to write (0x00 assumed if not present) |
offset4 |
1 |
Bit 4 is set |
Offset byte 4 to write (0x00 assumed if not present) |
size1 |
1 |
Bit 5 is set |
Size byte 1 to write (0x00 assumed if not present) |
size2 |
1 |
Bit 6 is set |
Size byte 2 to write (0x00 assumed if not present) |
size3 |
1 |
Bit 7 is set |
Size byte 3 to write (0x00 assumed if not present) |
The control bits indicate which of the offset and size bytes are present. An offset or size byte that is not present will be set to 0. This helps keep the COPY instructions as compact as possible.
The resulting offset is 4 bytes, and the resulting size is 3 bytes. Both are in little-endian order.
If the size is zero, it must be interpreted as 65535. Similarly, when encoding, a size of 65535 can be encoded to 0.
For example, the following instruction copies 2600 bytes of data at offset 123456:
control |
offset2 |
offset3 |
size1 |
size2 |
size3 |
|---|---|---|---|---|---|
10110111 |
0x40 |
0xE2 |
0x01 |
0x28 |
0x0A |
5.4.2. Encoding Logic¶
5.4.2.1. Outer Encoding¶
Write a header for the original file.
Write a header for the modified file.
For each hunk to write:
If COPY is optimal:
Write a COPY instruction.
Else if ADD is optimal:
Write an ADD instruction.
Compress the binary file using zlib.
For each 52-byte chunk:
Encode a line length character for the chunk size.
It may be < 52 for the final chunk in a file.
Encode the chunk data using Base85.
Write the line length character and encoded data.
5.4.2.2. Header Encoding¶
This takes in the length of the file the header represents.
Loop:
Set
header_bytetofile_len & 0x7F.Bit-shift
file_lenright 7 bits.If
file_lenis 0:Write
header_byte.Break loop.
Else if
file_lenis non-0:Write
header_byte | 0x80.
5.4.2.3. ADD Instruction Encoding¶
Verify size is <= 127 bytes.
Write size as byte.
Write new data bytes.
5.4.2.4. COPY Instruction Encoding¶
Set
controlbyte to 0x80.Compute offsets (up to 4 bytes):
If
(offset & 0xFF)is not 0:OR 0x01 to
control.Append
offset & 0xFFto arguments.
If
(offset >> 8 & 0xFF)is not 0:OR 0x02 to
control.Append
offset >> 8 & 0xFFto arguments.
If
(offset >> 16 & 0xFF)is not 0:OR 0x04 to
control.Append
offset >> 16 & 0xFFto arguments.
If
(offset >> 24 & 0xFF)is not 0:OR 0x08 to
control.Append
offset >> 24 & 0xFFto arguments.
Compute sizes (up to 3 bytes):
Set
sizeto 0 if set to 65535.If
(size & 0xFF)is not 0:OR 0x10 to
control.Append
size & 0xFFto arguments.
If
(size >> 8 & 0xFF)is not 0:OR 0x20 to
control.Append
size >> 8 & 0xFFto arguments.
If
(size >> 16 & 0xFF)is not 0:OR 0x40 to
control.Append
size >> 16 & 0xFFto arguments.
Write
controland arguments.
5.4.2.5. Example¶
Example in Python
LEN_A = ord('A') # 65
LEN_Z = ord('Z') # 90
LEN_a = ord('a') # 97
LEN_z = ord('z') # 122
LEN_LOWER = LEN_Z - LEN_A + 1 # 26
LEN_UPPER = LEN_z - LEN_a + 1 + LEN_LOWER # 52
orig_data: bytes = b'<file data>'
modified_data: bytes = b'<file data>'
result = io.Bytes()
for (mode,
modified_offset,
modified_hunk,
modified_size) in calc_hunks(orig_data, modified_data):
if mode == ADD:
assert 1 <= modified_size <= 127
result.write(bytes([modified_size, *modified_hunk]))
elif mode == COPY:
control = 0x80
args = bytearray()
for shift, mask in zip((0, 8, 16, 24),
(0x01, 0x02, 0x04, 0x08)):
value = (modified_offset >> shift) & 0xFF
if value:
control |= mask
args.append(value)
for shift, mask in zip((0, 8, 16),
(0x10, 0x20, 0x40)):
value = (modified_size >> shift) & 0xFF
if value:
control |= mask
args.append(value)
result.write(bytes([control, *args]))
compressed_data: bytes = zlib.compress(result.getvalue())
compressed_len: int = len(compressed_data)
pos: int = 0
while pos < compressed_len:
chunk: bytes = compressed_data[pos:pos + LEN_UPPER]
chunk_len: int = len(chunk)
pos += chunk_len
if chunk_len <= LEN_LOWER:
len_c = chunk_len + LEN_A - 1
else:
len_c = chunk_len + LEN_a - 1 - LEN_LOWER
out.write('%c%s' % (len_c, base64.b85encode(chunk, pad=True)))
5.4.3. Decoding Logic¶
5.4.3.1. Outer Decoding¶
For each line:
Read and decode 1 byte for the line length character.
Read and Base85-decode remaining bytes for the line.
Cap the decoded data to the decoded line length.
This is important for ensuring we don’t treat any padding as zlib-compressed data.
zlib-decompress the resulting data as the Delta instructions payload.
Read the original file header.
Read the modified file header
While there’s data to read:
Read 1 byte as
control.If left-most bit 1 is set (
control & 0x80):Decode as a COPY, giving
src_offsetandcopy_len.Write data from original file at
src_offsetof lengthcopy_len.
Else:
Decode as an ADD, giving
add_lengthandadd_bytes.
Write
add_bytes.
5.4.3.2. Header Decoding¶
Set
file_lenandshiftto 0.Loop:
Read
header_byte.OR
(header_byte & 0x7F) << shifttofile_len.If
header_byte & 0x80is set:Break loop.
Add 7 to
shift.
Return
file_len.
5.4.3.3. ADD Instruction Decoding¶
Use
controlasadd_length.Read and return
add_lengthbytes.
5.4.3.4. COPY Instruction Decoding¶
Compute
src_offset(up to 4 bytes):If
(control & 0x01)is not 0:Read 1 byte as
src_offset.
If
(control & 0x02)is not 0:Read 1 byte and bit-shift left 8.
OR result to
src_offset.
If
(control & 0x04)is not 0:Read 1 byte and bit-shift left 16.
OR result to
src_offset.
If
(control & 0x08)is not 0:Read 1 byte and bit-shift left 24.
OR result to
src_offset.
Compute
copy_len(up to 3 bytes):If
(control & 0x10)is not 0:Read 1 byte as
copy_len.
If
(control & 0x20)is not 0:Read 1 byte and bit-shift left 8.
OR result to
copy_len.
If
(control & 0x40)is not 0:Read 1 byte and bit-shift left 16.
OR result to
copy_len.
If
copy_lenis 0:Set to
65536.
Return
src_offsetandcopy_len.
5.4.3.5. Example¶
Example in Python
LEN_A = ord('A') # 65
LEN_Z = ord('Z') # 90
LEN_a = ord('a') # 97
LEN_z = ord('z') # 122
LEN_LOWER = LEN_Z - LEN_A + 1 # 26
LEN_UPPER = LEN_z - LEN_a + 1 + LEN_LOWER # 52
orig_data: bytes = '<original file data>'
delta_lines_data: list[bytes] = [b'...', ...]
# Decode the delta data.
result_lines: list[bytes] = []
for line_data in delta_lines_data:
length: int = line_data[0]
if length <= LEN_LOWER:
length += 1 - LEN_A
else:
length += 1 - LEN_a + LEN_LOWER
result_lines.append(base64.b85decode(line_data[1:])[:length])
# Decompress the decoded delta data.
delta_data: bytes = zlib.decompress(b''.join(result_lines))
delta_data_len = len(delta_data)
result = io.Bytes()
offset: int = 0
# Read headers.
for i in range(2):
file_len: int = 0
shift: int = 0
while True:
header_byte = delta_data[offset]
offset += 1
file_len |= (header_byte & 0x7F) << shift
if not (header_byte & 0x80):
break
shift += 7
while offset < delta_data_len:
op = delta_data[offset]
offset += 1
if op & 0x80:
# This is a COPY.
src_offset = 0
copy_len = 0
# Calculate the offset.
for shift, mask in zip((0, 8, 16, 24),
(0x01, 0x02, 0x04, 0x08)):
if op & mask:
src_offset |= delta_data[offset] << shift
offset += 1
# Calculate the length.
for shift, mask in zip((0, 8, 16),
(0x10, 0x20, 0x40)):
if op & mask:
copy_len |= delta_data[offset] << shift
offset += 1
if copy_len == 0:
copy_len = 65536
result.write(orig_data[src_offset:src_offset + copy_len])
else:
# This is an ADD.
add_len = op
result.write(delta_data[offset:offset + add_len])
offset += add_len
result_data = result.getvalue()