Est. 1996 Intermediate

Bzip2

A free, open-source data compression utility using the Burrows-Wheeler algorithm to achieve superior compression ratios

Created by Julian Seward

Paradigm Compression
Typing N/A
First Appeared 1996
Latest Version 1.0.8 (2019)

Bzip2 is a free, open-source file compression utility created by Julian Seward and first released in July 1996. It uses the Burrows-Wheeler transform (BWT) as its core algorithm, achieving compression ratios typically better than those of gzip’s DEFLATE algorithm, at the cost of slower compression and decompression speed. For over a decade, .tar.bz2 was the dominant archive format for distributing open-source software.

History & Origins

Julian Seward, a British software developer also known for creating the Valgrind memory debugging tool, originally wrote a compression tool called bzip (sometimes referred to as “bzip1”). That tool used arithmetic coding for its entropy encoding step. However, due to patent concerns surrounding the arithmetic coding algorithms, Seward rewrote the tool from scratch, replacing arithmetic coding with Huffman coding. The result was bzip2, first publicly released as version 0.15 in July 1996.

The core innovation that bzip2 leverages is the Burrows-Wheeler Transform (BWT), described in a 1994 technical report by Michael Burrows and David Wheeler at Digital Equipment Corporation. Wheeler had originally conceived the transform in 1983, but the joint publication with Burrows brought it to wider attention. Seward recognized that combining BWT with other well-known techniques could produce a practical compression tool with excellent compression ratios.

Design Philosophy

Bzip2 was designed to fill a gap between gzip (fast but modest compression) and more exotic compression methods. Its core design goals were:

  1. Better compression than gzip — Achieve meaningfully smaller files for general-purpose data
  2. Patent-free algorithms — Use only unencumbered techniques (Huffman coding rather than arithmetic coding)
  3. Simplicity — A clean command-line interface modeled after gzip’s conventions
  4. Robustness — Reliable compression and decompression with integrity checking

Key Features

Compression Pipeline

Bzip2 applies a multi-stage compression pipeline during encoding:

  1. Run-length encoding (RLE) — Initial pass to reduce runs of repeated bytes
  2. Burrows-Wheeler Transform (BWT) — Block-sorts the data to group similar characters together, making the output more compressible
  3. Move-to-front (MTF) transform — Converts the BWT output into a sequence of small integers
  4. Run-length encoding — Second RLE pass on the MTF output
  5. Huffman coding — Final entropy encoding using multiple Huffman tables selected per segment

Block-Based Compression

Bzip2 compresses data in independent blocks. The block size is configurable from 100 KB to 900 KB of uncompressed data (selected via the -1 through -9 command-line flags, each representing 100 KB increments). Larger blocks generally achieve better compression but require more memory during both compression and decompression.

File Format

Bzip2 files use the .bz2 extension and are identified by a distinctive header:

  • Bytes 1-2: BZ (ASCII signature)
  • Byte 3: h (indicating Huffman coding; the deprecated bzip1 used 0)
  • Byte 4: A digit 1 through 9 indicating the block size in units of 100 KB

Compressed block headers contain a magic number derived from the digits of pi (0x314159265359), and the end-of-stream marker uses a constant derived from the square root of pi (0x177245385090).

No formal specification for the bzip2 format has been published — the reference implementation serves as the de facto specification.

Integrity Checking

Each compressed block includes a CRC32 checksum, and the entire file has a combined checksum. Bzip2 verifies these checksums during decompression, providing built-in data integrity validation.

Recovery Tool

Bzip2 ships with bzip2recover, a companion utility that can extract individual blocks from a damaged .bz2 file. Since blocks are compressed independently, undamaged blocks can be recovered even if other parts of the archive are corrupted.

Comparison with Other Compression Tools

Featuregzipbzip2xz (LZMA)
AlgorithmDEFLATE (LZ77 + Huffman)BWT + HuffmanLZMA2
Compression ratioGoodBetter than gzipBest of the three
Compression speedFastSlower than gzipSlowest
Decompression speedFastSlower than gzipFaster than bzip2
Memory usageLowModerateHigher
First released199219962009

Bzip2 generally achieves compression ratios approximately 10-20% better than gzip on typical data, though the exact improvement varies significantly depending on the nature of the input. The xz format, which emerged later, typically achieves even better compression than bzip2 and decompresses faster, which led to xz gradually replacing bzip2 for many use cases.

Evolution

The Early Years (1996–2005)

After the initial release in 1996, bzip2 quickly gained adoption in the Unix and Linux communities. Version 1.0 arrived around 2000, adding 64-bit large file support. Through the early 2000s, .tar.bz2 became the preferred archive format for distributing open-source software source code, offering a meaningful compression improvement over the .tar.gz format that had been standard.

Maintenance and Security Fixes (2005–2010)

Between 2005 and 2010, development focused primarily on security fixes:

  • Version 1.0.3 (February 2005) addressed a decompression bomb vulnerability
  • Version 1.0.4 (December 2006) fixed a file permissions race condition
  • Version 1.0.5 (December 2007) patched a vulnerability reported by CERT-FI
  • Version 1.0.6 (September 2010) fixed an integer overflow in decompression (CVE-2010-0405)

The Nine-Year Hiatus (2010–2019)

After version 1.0.6, bzip2 saw no new releases for nine years. During this period, bzip2 continued to be widely used and shipped by default on most Linux distributions, but the lack of active maintenance raised concerns in the open-source community.

Revival (2019–Present)

In June 2019, Federico Mena-Quintero, co-founder of the GNOME project, took over as maintainer. He released version 1.0.7 (June 2019) and 1.0.8 (July 2019), fixing several accumulated security vulnerabilities including CVE-2016-3189 (a use-after-free in bzip2recover) and CVE-2019-12900 (an out-of-bounds write during decompression). Version 1.0.8 also added large file support on Windows.

In June 2021, Micah Snyder assumed maintainership for feature development on the 1.1+ branch. The project repository moved to GitLab, and the build system was modernized from the original Makefile to Meson and CMake.

Parallel Implementations

Because bzip2’s block-based design compresses blocks independently, it is well-suited to parallelization:

  • pbzip2 — A parallel implementation using POSIX threads, created in 2003. It can use multiple CPU cores for both compression and decompression, and its output is compatible with standard bzip2 (version 1.0.2 and later).
  • lbzip2 — Another parallel implementation focused on efficient multi-threaded decompression of single-stream .bz2 files.

Current Relevance

Bzip2 remains widely installed — it ships by default on virtually all Linux distributions and macOS. However, its role has diminished as xz (LZMA) has become the preferred compression format for many use cases:

  • The Linux kernel switched from .tar.bz2 to .tar.xz in 2013
  • Many software projects now distribute sources as .tar.xz or .tar.zst (Zstandard)
  • Newer compression tools like Zstandard offer better speed-to-ratio tradeoffs

Despite this shift, bzip2 remains relevant for decompressing existing archives, as a dependency in many software build systems, and in contexts where its specific compression characteristics are advantageous.

Licensing

Bzip2 is released under a BSD-style license (SPDX identifier: bzip2-1.0.6) that permits free use, modification, and redistribution. The license requires retaining the copyright notice and prohibits using the author’s name for endorsement without permission.

Why It Matters

Bzip2 played a pivotal role in the open-source software ecosystem during the 2000s and early 2010s. By providing meaningfully better compression than gzip using only patent-free algorithms, it became the de facto standard for source code distribution. Its block-based design influenced later parallel compression implementations, and its approach to combining multiple transform stages into a compression pipeline demonstrated the practical power of the Burrows-Wheeler Transform for real-world data compression.

Timeline

1994
Michael Burrows and David Wheeler publish the Burrows-Wheeler Transform paper at Digital Equipment Corporation
1996
Julian Seward releases bzip2 version 0.15 in July, replacing his earlier bzip tool which used patent-encumbered arithmetic coding
2000
Version 1.0 released with 64-bit large file support and improved decompression robustness
2005
Version 1.0.3 released on February 15 with security fix for decompression bomb vulnerability (CAN-2005-1260)
2007
Version 1.0.5 released on December 10 with security fix for CERT-FI 20469
2010
Version 1.0.6 released on September 6 with fix for integer overflow in decompression (CVE-2010-0405); last release for nine years
2013
Kernel.org drops .tar.bz2 format in favor of .tar.xz for Linux kernel source distribution
2019
Federico Mena-Quintero, co-founder of GNOME, takes over maintainership; releases version 1.0.7 (June) and 1.0.8 (July) fixing multiple CVEs
2021
Micah Snyder assumes maintainership for feature development on the 1.1+ branch

Notable Uses & Legacy

Linux Kernel Distribution

The Linux kernel source code was distributed as .tar.bz2 archives for over a decade before kernel.org switched to xz in 2013.

Open-Source Software Distribution

For many years .tar.bz2 was the dominant format for distributing free and open-source software source code, offering significantly better compression than gzip tarballs.

Debian Package System

Bzip2 was a supported compression format for .deb packages in the Debian and Ubuntu ecosystems.

FreeBSD Ports

FreeBSD packages used bzip2 compression with the .tbz extension for distributing compiled software packages.

RPM Package Management

RPM-based Linux distributions including Red Hat and Fedora supported bzip2 as a compression option for package payloads.

Language Influence

Influenced By

gzip Burrows-Wheeler Transform

Influenced

pbzip2 lbzip2

Running Today

Run examples using the official Docker image:

docker pull
Last updated: