| A file format is a particular way to encode information for storage in a computer file.
Since a disk drive, or indeed any computer storage, can store only bits, the computer must have some way
of converting information to 0s and 1s and vice-versa. There are different
kinds of formats for different kinds of information. However, within any format type e.g. word processor documents, there will typically be several different - and sometimes competing - formats.
Generality
Some file formats are designed to store very particular sorts of data: the JPEG format,
for example, is designed only to store static images. Other file formats, however, are
designed for storage of several different types of data: the GIF format supports storage of
both still images and simple animations, and the AVI format can act as a container for many
different types of multimedia. A text file is simply one that stores any text, in a format such as ASCII
or Unicode, with few if any control characters. Some file formats, such as HTML, or the
source code of some particular programming language, are in fact also text
files, but adhere to more specific rules which allow them to be used for specific purposes.
It is sometimes possible to cause a program to read a file encoded in one format as if it were encoded in another format. For
example, either by making minor modifications to a Microsoft Word
document or by using a music-playing program that deals in "headerless" audio files, one can play a Microsoft Word document as if it were a song. The result does not sound very
musical, however. This is so because a sensible arrangement of bits in one format is almost
always nonsensical in another.
It should be noted that it is very difficult to make a principled distinction between a file format and a programming language, or between a "normal program" and a
programming language interpreter.
A programming language can be seen as a file format for storing algorithms, while even a simple image file viewer can be seen as
an "interpreter" for, say, the GIF "language".
Specifications
Many file formats, including some of the most well-known file formats, have a published specification document (often with a reference implementation) that describes exactly how the data is to be encoded, and which can be
used to determine whether or not a particular program treats a
particular file format correctly. There are, however, two reasons why this is not always the case. First, some file format
developers view their specification documents as trade secrets, and
therefore do not release them to the public. A prominent example of this exists in several formats used by the Microsoft Office suite of applications. Second, some file format developers
never spend time writing a separate specification document; rather, the format is defined only implicitly, through the program(s)
that manipulate data in the format.
Note that using file formats without a publicly available specification can be costly. Learning how the format works will
require either reverse-engineering it from a reference
implementation or acquiring the specification document for a fee from the format developers. This second approach is possible
only when there is a specification document, and typically requires the signing of a non-disclosure agreement. Both strategies require
significant time, money, or both. Therefore, as a general rule, file formats with publicly available specifications are supported
by a large number of programs, while non-public formats are supported by only a few programs.
The most useful part of intellectual property law for
protecting ownership of a file format appears to be patent law. Although patents for
file formats are not directly permitted under US law, some formats require the encoding of data with patented algorithms. For example, the GIF file format requires the use of a patented algorithm,
and although initially the patent owner did not enforce it, they later began collecting fees for use of the algorithm. This has
resulted in a significant decrease in the use of GIFs, and is partly responsible for the
development of the alternative PNG format. However, the patent expired in the US in
mid-2003, worldwide in mid-2004; algorithms are
themselves not currently patentable under European law.
Identifying the type of a file
Since files are seen by programs as streams of data, a method is required to determine the format of a particular file within
the filesystem — an example of metadata. Different operating
systems have traditionally taken different approaches to this problem, with each approach having its own advantages and
disadvantages.
Of course, most modern operating systems, and individual applications, need to use all of these approaches to process various
files, at least to be able to read 'foreign' file formats, if not work with them completely.
By file extension
One popular method — used by several operating systems including some of those produced by DEC, CP/M,
and consequently DOS and Windows — is to determine the format of a file based on the section of its name following the final
appearance of "." - referred to as the filename extension. For
example, HTML documents are identified by names ending .htm or .html, and GIF images by .gif. In the
original FAT filesystem, this extension was limited to three characters, and thus many formats
still use three-character identifiers even though most operating systems and application programs no longer have this limitation.
Since there is no standard list of extensions, this can cause confusion due to more than one relatively rare format using the
same extension, causing the operating system to misidentify files, confusing users.
One advantage of this approach is that the system can easily be tricked into treating a file as a different format simply by
renaming it — an HTML file can for instance be easily treated as plain text by renaming it from filename.html to
filename.txt. Although this strategy was useful to expert users who could easily understand and manipulate this
information, it was frequently confusing to less technical users, who might accidentally make a file unusable (or 'lose' it) by
renaming it incorrectly. This led to more recent operating
system shells, such as Windows 95 and Mac OS X, to hide the extension when displaying lists of files - thus negating the advantages of having the format
accessible in the name.
By "magic number"
An alternative method, often associated with Unix and its derivatives, is to store a
magic number inside the file itself.
Originally, this term was used for a specific set of 2-byte identifiers at the beginning of
a file, but since any undecoded binary sequence can be regarded as a number, any feature of a file format which uniquely
distinguishes it can be used for identification. GIF images, for instance, always begin with the ASCII representation of either GIF87a or GIF89a, depending which standard they adhere to. HTML
files are harder to spot by this method: they might begin with the ASCII characters <html>, or an appropriate
document type definition beginning
<!DOCTYPE, or, for XHTML, the XML
identifier which begins <?xml — or they might just launch straight in with some text, but still be valid
HTML.
This approach offers better guarantees that the format will be identified correctly, and can often determine more precise
information about the file. This is only useful, however, if the interface used to access the files allows the user to easily
manipulate any file in a variety of ways — as opposed to double
clicking automatically doing the "right" thing; it is therefore more often associated with command line interfaces than graphical ones. Since reliable
"magic number" tests can be fairly complex, and each file must be tested against every possibility in the "magic file", this
approach is also relatively inefficient, especially for displaying large lists of files (in contrast, filename and metadata based
methods need check only one piece of data, and match it against a sorted index). And, as with the example of HTML, some filetypes
just don't lend themselves to recognition in this way. It is, however, the best way for a program to check if a file it has been
told to process is of the correct format: while the file's name or metadata may be altered indepently of its content, failing a
well-designed magic number test is a pretty sure sign that the file is either corrupt or of the wrong type.
So-called shebang lines in script files are a special case of magic numbers. Here, the magic number is
human-readable text that identifies a specific command interpreter and options to be passed to the command interpreter.
By explicit metadata
A final way of storing the format of a file is to store an explicit data on the disc about the file.
This approach keeps the metadata separate from both the main data and the name, but is also less portable than either file extensions or "magic numbers", since the format has to be converted from
filesystem to filesystem. While this is also true to an extent with filename extensions — for instance, for compatibility
with MS-DOS's three character limit — most forms of storage have a roughly equivalent definition of a file's data and name,
but may have varying or no representation of further metadata.
Apple Macintosh type-codes
The Macintosh's Hierarchical File System stores codes for creator and type as part of the directory entry for
each file. These codes are referred to as OSTypes, and for instance an application
written by Apple would have a creator of AAPL and a
type of APPL. RISC OS uses a similar system, consisting of a
12-bit number which can be looked up in a table of descriptions — e.g. the hexadecimal
number FF5 is "aliased" to PoScript, representing a PostScript
file.
IBM/Microsoft extended attributes
The HPFS, NTFS, FAT12, FAT16, and FAT32 filesystems allow the storage of extended
attributes with files. These comprise an arbitrary list of name and value pairs, where the names are unique. There are
standardized meanings for certain names. One such is that the ".type" extended attribute is used to determine the file type. Its
value comprises a list of one or more file types associated with the file, each of which is a string, such as "Plain Text" or
"HTML document". Thus a file may have several types.
Unix extended attributes
The ext2, ext3, ReiserFS version 3, XFS, JFS, and FFS filesystems allow the storage of extended attributes
with files. These comprise an arbitrary list of "name=value" strings, where the names are unique, which can be accessed by their
"name" parts.
Mime types
MIME types are widely used in many Internet-related applications, and increasingly elsewhere, although their usage for on-disc type information is
rare. These consist of a standardised system of identifiers consisting of a type and a sub-type, sperated by a
slash — for instance, text/html or
image/gif. These were originally intended as a way of identifying what type of file was attached to an e-mail, independent of the source and target operating systems. In Macintosh terms, the MIME
type encodes the file type information but not the file creator.
References
External links
|