Magic Bytes in Python
Posted: April 23, 2025
Last Updated: April 23, 2025
Magic Bytes in Python
I have a question. What if you want to figure out the file type of some bytes in Python? My guess is you would google “python get file type from bytes” or something similar, you would check this StackOverflow post, and land on using the python-magic package. That’s great, but what if you are also somehow lazy enough to not want to install libmagic on Windows, but not lazy enough to just make your own extensible file type-guesser in Python? Then you would be me.
In my opinion, this approach is about as short and no-nonsense as you can get with file type guessing in Python. Essentially, we can just check the query byte-by-byte against any file signature and organize the signatures in a way that we can quickly filter out any that would possibly be applicable. If you care about nothing else, the full code is below, and the details are described after.
Magic Bytes Full Code:
# * * * * *
# Original Author: Orcsune
# Date: April 23, 2025
# License: MIT License
# * * * * *
# Use the following methods to check your file type queries:
# format_from_file
# format_from_bytes
# extension_from_bytes
# mimetype_from_bytes
from typing import Dict, Iterable, Set, List, Union
from dataclasses import dataclass, field
from collections import defaultdict
from pathlib import Path
@dataclass
class FileFormat:
"""Class describing a file format with extension and MIME type."""
extension:str = None
mimetype:str = None
@property
def mime(self):
return self.mimetype
@property
def ext(self):
return self.extension
@property
def has_mimetype(self):
return self.mimetype is not None
@property
def has_extension(self):
return self.extension is not None
@dataclass
class SignatureBytes:
data: bytes
offset: int
any_byte_idxs: Set[int] = field(default_factory=set)
@property
def length(self):
"""Total length of the signature including 'any'-bytes."""
return len(self.data) + len(self.any_byte_idxs)
def compare(self, other:bytes):
"""
Compare an incoming byte sequence to see if it
matches this signature. Compares byte-by-byte
skipping any "any"-match indices, such as those
in WAV: "52 49 46 46 ?? ?? ?? ?? 57 41 56 45".
Args:
other: Query bytes for guessing.
Returns:
match: True if query matches signature.
False otherwise.
"""
cut = other[self.offset:self.offset+self.length]
comp_offset = 0
for i in range(len(self.data)):
while comp_offset + i in self.any_byte_idxs:
comp_offset += 1
if (self.data[i] != cut[comp_offset + i]):
return False
return True
def __str__(self) -> str:
"""Hex string representation of signature."""
any_byte = "?? "
final = any_byte*self.offset
comp_offset = 0
# Loop all bytes, replacing "any" bytes
# with "?? " substring
for i in range(self.length):
if i in self.any_byte_idxs:
final += any_byte
comp_offset += 1
else:
final += self.data[i - comp_offset].to_bytes(1, "big").hex() + " "
return final.strip()
class Signature:
"""
Representation of a file type signature, including the
signature bytes and the file type extension and MIME type.
"""
def __init__(self, sig_bytes: SignatureBytes, extension="<no_extension>", mime="<no_mime>") -> None:
self.signature = sig_bytes
self.format = FileFormat(extension, mime)
@property
def data(self):
"""Signature bytes."""
return self.signature.data
@property
def length(self):
"""Signature length (including 'any'-bytes)."""
return self.signature.length
@property
def offset(self):
"""Signature offset."""
return self.signature.offset
@property
def extension(self):
"""Signature extension."""
return self.format.extension
@property
def mimetype(self):
"""Signature MIME type"""
return self.format.mimetype
def compare(self, other:bytes):
"""
Compare an incoming byte sequence to see if it
matches this signature. Compares byte-by-byte
skipping any "any"-match indices, such as those
in WAV: "52 49 46 46 ?? ?? ?? ?? 57 41 56 45".
Args:
other: Query bytes for guessing.
Returns:
match: True if query matches signature.
False otherwise.
"""
return self.signature.compare(other)
def __str__(self) -> str:
"""Signature string, including extension and MIME type."""
return f"Signature(ext={self.extension}, mime={self.mimetype}, signature={self.signature.__str__()})"
class SignatureMap:
"""
Special signature map that does some easy preprocessing
that separates each signature based on their first bytes.
This helps increase speed by pruning a large subset of
signatures that do not start with the query's first byte.
Also separated by offset.
"""
def __init__(self, signatures: Iterable[Signature]) -> None:
# Separate the signatures based first on their offset values
# Then separate them based on their first byte values.
self.unique_offsets: List[int] = sorted(list(set((sig.offset for sig in signatures))))
self.signatures: Dict[int, Dict[bytes, List[Signature]]] = {
offset: defaultdict(list) for offset in self.unique_offsets
}
# Fill the signature map
for sig in signatures:
self.signatures[sig.offset][sig.data[0].to_bytes(1, "big")].append(sig)
def guess_format(self, other: bytes) -> FileFormat:
"""
Guesses the file format of given bytes using the available
file signatures.
Args:
other: Query bytes for guessing.
Returns:
format: FileFormat describing the file type.
"""
other_len = len(other)
if other_len < 1: return None
for offset in self.unique_offsets:
# Ignore checking signatures where the offset
# is out of bounds of the input comparison data
if offset >= other_len:
continue
first_byte = other[offset].to_bytes(1, "big")
if first_byte not in self.signatures[offset]:
continue
for signature in self.signatures[offset][first_byte]:
if (signature.compare(other)):
return signature.format
return None
# https://en.wikipedia.org/wiki/List_of_file_signatures
# gif (GIF87a): (offset 0), 47 49 46 38 37 61
# gif (GIF89a): (offset 0), 47 49 46 38 39 61
# png: (offset 0), 89 50 4E 47 0D 0A 1A 0A
# wav: (offset 0), 52 49 46 46 ?? ?? ?? ?? 57 41 56 45
# mp4 (ftypMSNV): (offset 4), 66 74 79 70 4D 53 4E 56
# mp4 (ftypisom): (offset 4), 66 74 79 70 69 73 6F 6D
SIG_GIF1 = Signature(SignatureBytes(b'\x47\x49\x46\x38\x37\x61', 0), "gif", "image/gif")
SIG_GIF2 = Signature(SignatureBytes(b'\x47\x49\x46\x38\x39\x61', 0), "gif", "image/gif")
SIG_PNG = Signature(SignatureBytes(b'\x89\x50\x4E\x47\x0D\x0A\x1A\x0A', 0), "png", "image/png")
SIG_WAV = Signature(SignatureBytes(b'\x52\x49\x46\x46\x57\x41\x56\x45', 0, set((4, 5, 6, 7))), "wav", "audio/wav")
SIG_MP41 = Signature(SignatureBytes(b'\x66\x74\x79\x70\x4D\x53\x4E\x56', 4), "mp4", "video/mp4")
SIG_MP42 = Signature(SignatureBytes(b'\x66\x74\x79\x70\x69\x73\x6F\x6D', 4), "mp4", "video/mp4")
signatures = [
SIG_GIF1,
SIG_GIF2,
SIG_PNG,
SIG_WAV,
SIG_MP41,
SIG_MP42
]
signature_map = SignatureMap(signatures)
def format_from_file(file_path: Union[str, Path]) -> Union[FileFormat, None]:
"""
Guesses the file format of given file using the available
file signatures from the SignatureMap.
Args:
file_path: Query file path for guessing.
Returns:
format: FileFormat describing the file type.
"""
chunk_size = 256
try:
with open(file_path, "rb") as f:
data = f.read(chunk_size)
return format_from_bytes(data)
except:
return None
def format_from_bytes(file_bytes: bytes) -> Union[FileFormat, None]:
"""
Guesses the file format of given bytes using the available
file signatures from the SignatureMap.
Args:
file_bytes: Query bytes for guessing.
Returns:
format: FileFormat describing the file type.
"""
if not isinstance(file_bytes, bytes):
return None
return signature_map.guess_format(file_bytes)
def extension_from_bytes(file_bytes: bytes) -> Union[str, None]:
"""
Guesses the file extension of given bytes using the available
file signatures from the SignatureMap.
Args:
file_bytes: Query bytes for guessing.
Returns:
extension: File extension describing the file type.
"""
fformat = format_from_bytes(file_bytes)
if fformat is None:
return None
return fformat.ext
def mimetype_from_bytes(file_bytes: bytes) -> Union[str, None]:
"""
Guesses the MIME type of given bytes using the available
file signatures from the SignatureMap.
Args:
file_bytes: Query bytes for guessing.
Returns:
extension: MIME type describing the file type.
"""
fformat = format_from_bytes(file_bytes)
if fformat is None:
return None
return fformat.mime
FileFormat:
FileFormat
just describes the format of the file to be identified. My only use case for this is to retrieve the MIME type and the file extension, but this can easily be changed or augmented. I make FileFormat
a basic dataclass and provide some simple properties for quick access to these desired values. If certain file types don’t have one or the other of extension or MIME type, then you can check this with the has_extension
or has_mimetype
, respectively.
@dataclass
class FileFormat:
"""Class describing a file format with extension and MIME type."""
extension:str = None
mimetype:str = None
@property
def mime(self):
return self.mimetype
@property
def ext(self):
return self.extension
@property
def has_mimetype(self):
return self.mimetype is not None
@property
def has_extension(self):
return self.extension is not None
SignatureBytes:
The SignatureBytes
class is the main logic for where the actual “magic byte” checking occurs. As you can see, it’s actually pretty short. The major design decision was how to cover the majority of file types without adding a ton of complexity. Looking through the list of file signatures on Wikipedia, I saw that the two major obstacles were: 1) Signature offset and 2) Split Signatures.
- Signature Offset: The offset of a signature is just a number which represents the byte where the signature would begin in a file of that type. For example, one of mp4’s signatures (
66 74 79 70 69 73 6F 6D
) has an offset of 4, so in reality, it’s signature looks like: (?? ?? ?? ?? 66 74 79 70 69 73 6F 6D
), where the??
bytes represent “any” bytes – bytes that can be anything. To represent this, I just store a single integer representing the offset. As we will see later, I can encode multiple different offsets for the same file type by just having multipleSignatures
in the map with different offsets. Of course, this is only viable for signatures with a finite/small number of possible offset values. - Split Signature: The next problem was with split signatures, like wav’s: (
52 49 46 46 ?? ?? ?? ?? 57 41 56 45
). Here, there are some “any” bytes stuck in the middle. To keep this extensible, I did not want to put any restrictions on the index or contiguity of these “any” bytes, so I choose to store each specific “any” byte index in a set.
During comparison to some query bytes, it just iterates all the signatures bytes and compares each to the corresponding index in the query bytes. If it would be checking an index in the any_byte_idxs
set, it increments a secondary index pointer until it checks the next byte not a part of the “any” bytes. If any byte does not match then it returns False
, otherwise True
.
@dataclass
class SignatureBytes:
data: bytes
offset: int
any_byte_idxs: Set[int] = field(default_factory=set)
@property
def length(self):
"""Total length of the signature including 'any'-bytes."""
return len(self.data) + len(self.any_byte_idxs)
def compare(self, other:bytes):
"""
Compare an incoming byte sequence to see if it
matches this signature. Compares byte-by-byte
skipping any "any"-match indices, such as those
in WAV: "52 49 46 46 ?? ?? ?? ?? 57 41 56 45".
Args:
other: Query bytes for guessing.
Returns:
match: True if query matches signature.
False otherwise.
"""
cut = other[self.offset:self.offset+self.length]
comp_offset = 0
for i in range(len(self.data)):
while comp_offset + i in self.any_byte_idxs:
comp_offset += 1
if (self.data[i] != cut[comp_offset + i]):
return False
return True
def __str__(self) -> str:
"""Hex string representation of signature."""
any_byte = "?? "
final = any_byte*self.offset
comp_offset = 0
# Loop all bytes, replacing "any" bytes
# with "?? " substring
for i in range(self.length):
if i in self.any_byte_idxs:
final += any_byte
comp_offset += 1
else:
final += self.data[i - comp_offset].to_bytes(1, "big").hex() + " "
return final.strip()
Signature:
The Signature
is just a wrapper that ties FileFormat
and SignatureBytes
instances together. It provides an easy interface to check for matching and provide resulting format information in one place. The implementation is basically just wrapper functions for the important encapsulated instances.
class Signature:
"""
Representation of a file type signature, including the
signature bytes and the file type extension and MIME type.
"""
def __init__(self, sig_bytes: SignatureBytes, extension="<no_extension>", mime="<no_mime>") -> None:
self.signature = sig_bytes
self.format = FileFormat(extension, mime)
@property
def data(self):
"""Signature bytes."""
return self.signature.data
@property
def length(self):
"""Signature length (including 'any'-bytes)."""
return self.signature.length
@property
def offset(self):
"""Signature offset."""
return self.signature.offset
@property
def extension(self):
"""Signature extension."""
return self.format.extension
@property
def mimetype(self):
"""Signature MIME type"""
return self.format.mimetype
def compare(self, other:bytes):
"""
Compare an incoming byte sequence to see if it
matches this signature. Compares byte-by-byte
skipping any "any"-match indices, such as those
in WAV: "52 49 46 46 ?? ?? ?? ?? 57 41 56 45".
Args:
other: Query bytes for guessing.
Returns:
match: True if query matches signature.
False otherwise.
"""
return self.signature.compare(other)
def __str__(self) -> str:
"""Signature string, including extension and MIME type."""
return f"Signature(ext={self.extension}, mime={self.mimetype}, signature={self.signature.__str__()})"
SignatureMap:
I could have basically stopped here and just created and iterated a list of Signatures, but I did want to provide a little bit of indexing support to (hopefully) speed queries up slightly. The SignatureMap
is my attempt at this indexing. It creates 2 nested mappings. The first mapping separates Signatures
by their offset fields. Each offset mapping value is another mapping from the first byte of the signature to a list of signatures with matching first bytes.
Why do this? That’s because there seemed to be enough variety in the first bytes of signatures that they would not all land under the same key in the dictionary. We can filter out a bunch of potential Signatures
just by checking the first byte of the query, basically. Why not more than 1 byte for more specificity so we have to iterate fewer Signatures? We don’t know if there are more than 1 bytes before the first “any” byte (take the PGP format, for example), so sticking with the first byte is a safe bet.
Other than the indexing, when we actually perform a query, we iterate the unique offset keys, and for each sub-dictionary, we find the list of Signatures
matching the query’s first byte at that offset (if any). Once we find this list, we just iterate those signatures until one matches, returning its format (or None
if no matches).
class SignatureMap:
"""
Special signature map that does some easy preprocessing
that separates each signature based on their first bytes.
This helps increase speed by pruning a large subset of
signatures that do not start with the query's first byte.
Also separated by offset.
"""
def __init__(self, signatures: Iterable[Signature]) -> None:
# Separate the signatures based first on their offset values
# Then separate them based on their first byte values.
self.unique_offsets: List[int] = sorted(list(set((sig.offset for sig in signatures))))
self.signatures: Dict[int, Dict[bytes, List[Signature]]] = {
offset: defaultdict(list) for offset in self.unique_offsets
}
# Fill the signature map
for sig in signatures:
self.signatures[sig.offset][sig.data[0].to_bytes(1, "big")].append(sig)
def guess_format(self, other: bytes) -> FileFormat:
"""
Guesses the file format of given bytes using the available
file signatures.
Args:
other: Query bytes for guessing.
Returns:
format: FileFormat describing the file type.
"""
other_len = len(other)
if other_len < 1: return None
for offset in self.unique_offsets:
# Ignore checking signatures where the offset
# is out of bounds of the input comparison data
if offset >= other_len:
continue
first_byte = other[offset].to_bytes(1, "big")
if first_byte not in self.signatures[offset]:
continue
for signature in self.signatures[offset][first_byte]:
if (signature.compare(other)):
return signature.format
return None
Setting Up:
Once all the implementations are out of the way, it becomes really simple to use. Just instantiate a bunch of Signatures
that meet your specifications, pass them into the SignatureMap
, and create a few public methods for your queries.
My current use case only really needs GIFs, PNGs, WAVs, and MP4s. Notice here there are 2 different Signatures
for GIFs and MP4s.
I also add a chunk_size limit for checking directly using a file path just as a heuristic that most formats do not have large offsets, so they don’t need to check very far. You can remove this from format_from_file
if you need.
# https://en.wikipedia.org/wiki/List_of_file_signatures
# gif (GIF87a): (offset 0), 47 49 46 38 37 61
# gif (GIF89a): (offset 0), 47 49 46 38 39 61
# png: (offset 0), 89 50 4E 47 0D 0A 1A 0A
# wav: (offset 0), 52 49 46 46 ?? ?? ?? ?? 57 41 56 45
# mp4 (ftypMSNV): (offset 4), 66 74 79 70 4D 53 4E 56
# mp4 (ftypisom): (offset 4), 66 74 79 70 69 73 6F 6D
SIG_GIF1 = Signature(SignatureBytes(b'\x47\x49\x46\x38\x37\x61', 0), "gif", "image/gif")
SIG_GIF2 = Signature(SignatureBytes(b'\x47\x49\x46\x38\x39\x61', 0), "gif", "image/gif")
SIG_PNG = Signature(SignatureBytes(b'\x89\x50\x4E\x47\x0D\x0A\x1A\x0A', 0), "png", "image/png")
SIG_WAV = Signature(SignatureBytes(b'\x52\x49\x46\x46\x57\x41\x56\x45', 0, set((4, 5, 6, 7))), "wav", "audio/wav")
SIG_MP41 = Signature(SignatureBytes(b'\x66\x74\x79\x70\x4D\x53\x4E\x56', 4), "mp4", "video/mp4")
SIG_MP42 = Signature(SignatureBytes(b'\x66\x74\x79\x70\x69\x73\x6F\x6D', 4), "mp4", "video/mp4")
signatures = [
SIG_GIF1,
SIG_GIF2,
SIG_PNG,
SIG_WAV,
SIG_MP41,
SIG_MP42
]
signature_map = SignatureMap(signatures)
def format_from_file(file_path: Union[str, Path]) -> Union[FileFormat, None]:
"""
Guesses the file format of given file using the available
file signatures from the SignatureMap.
Args:
file_path: Query file path for guessing.
Returns:
format: FileFormat describing the file type.
"""
chunk_size = 256
try:
with open(file_path, "rb") as f:
data = f.read(chunk_size)
return format_from_bytes(data)
except:
return None
def format_from_bytes(file_bytes: bytes) -> Union[FileFormat, None]:
"""
Guesses the file format of given bytes using the available
file signatures from the SignatureMap.
Args:
file_bytes: Query bytes for guessing.
Returns:
format: FileFormat describing the file type.
"""
if not isinstance(file_bytes, bytes):
return None
return signature_map.guess_format(file_bytes)
def extension_from_bytes(file_bytes: bytes) -> Union[str, None]:
"""
Guesses the file extension of given bytes using the available
file signatures from the SignatureMap.
Args:
file_bytes: Query bytes for guessing.
Returns:
extension: File extension describing the file type.
"""
fformat = format_from_bytes(file_bytes)
if fformat is None:
return None
return fformat.ext
def mimetype_from_bytes(file_bytes: bytes) -> Union[str, None]:
"""
Guesses the MIME type of given bytes using the available
file signatures from the SignatureMap.
Args:
file_bytes: Query bytes for guessing.
Returns:
extension: MIME type describing the file type.
"""
fformat = format_from_bytes(file_bytes)
if fformat is None:
return None
return fformat.mime
Example:
In my case, I mostly need to use the type guesser to distinguish between two different image file types that can be stored in TensorBoard. In my reinforcement learning project, I can store either PNG images, or entire GIFs in the backend TensorBoard log. Both are considered images and retrieved as ImageEvents
from the EventAccumulator
. Unfortunately, I also need to know the exact MIME types of these records because I am attempting to send it over the network through an API. I just created a little helper function to allow me to do just that:
def tb_event_to_media_format(event) -> Union[FileFormat, None]:
"""
Return the FileFormat for the contents of a media file
logged in a tensorboard file. Returns None if no
such event or if conversion cannot find suitable
indicators of a particular type.
"""
fformat = None
if isinstance(event, ImageEvent):
fformat = format_from_bytes(event.encoded_image_string)
elif isinstance(event, AudioEvent):
fformat = format_from_bytes(event.encoded_audio_string)
return fformat
Conclusion:
Need a small, easy, and extensible way to check for file types in Python without libmagic? Try this.