Stata
A commercial statistical software package and command-driven programming environment for data management, analysis, and reproducible research, widely used in economics, epidemiology, and the social sciences
Created by William Gould (with Sean Becketti in the early years); developed by StataCorp
Stata is a commercial, general-purpose statistical software package that doubles as a programming environment for data management, statistical analysis, and reproducible research. First released in 1985 and developed by StataCorp, it is command-driven at its core: a user works by issuing concise commands that read data, transform it, estimate models, and produce tables and graphics. Those same commands can be collected into scripts (called do-files) and extended with user-written programs (ado-files), which is what makes Stata as much a language as a piece of software. It is especially entrenched in economics, epidemiology, sociology, and political science, where its combination of terse syntax, strong data-management tools, and reproducibility features has made it a research standard.
A note on dating
The encyclopedia metadata lists 1985 as Stata’s first-appearance year, and this is well documented: Stata 1.0 was officially released in January 1985. Development began in 1984 under William Gould, and the product was first shown and sold around the American Economic Association meeting in late 1984, but the versioned 1.0 release is dated to January 1985. This page therefore uses 1985 as the first-appearance year. Where a specific month or precise detail is less certain, dates below should be read as approximate.
History & Origins
Stata was created by William (Bill) Gould, with Sean Becketti contributing in the early years, under the banner of the Computing Resource Center in California. The name “Stata” is a coinage that blends stat (from statistics) and data — deliberately not an acronym, and chosen to rhyme with “data.” The first release, Stata 1.0 (January 1985), was an MS-DOS program with roughly 44 commands covering basic regression, summary statistics, and data management. In a landscape then dominated by large mainframe-oriented packages, Stata was notable for running on ordinary personal computers and for its fast, interactive command loop.
A defining moment came with Stata 2.1 (1990), which introduced ado-files — a mechanism that let users write new commands in Stata’s own language and share them. This turned Stata from a fixed program into an extensible platform: much of Stata’s later breadth came from commands contributed by statisticians and econometricians in the user community, many distributed through the Statistical Software Components (SSC) archive and installable directly from within the program.
In 1993, the company relocated to College Station, Texas and was renamed Stata Corporation, today StataCorp. Through the 1990s Stata became a genuinely cross-platform product, with versions for Windows, Macintosh, and Unix. Since roughly the release of Stata 8 (2003), StataCorp has shipped a major new version approximately every two years.
Design Philosophy
Stata is built around a few consistent ideas that account for its enduring popularity in research settings:
- One command, one idea. Stata commands follow a regular grammar — a command name, a variable list, qualifiers such as
ifandin, and options after a comma. Once the pattern is learned, most of the language reads the same way, whether summarizing a variable or fitting a mixed model. - A single dataset in focus. Historically Stata worked on one rectangular dataset held in memory at a time, which keeps the mental model simple; modern versions relax this with frames (multiple in-memory datasets).
- Reproducibility first. Because analyses are expressed as text commands, they can be saved in do-files and re-run exactly. This makes Stata a natural fit for the replication packages that accompany academic papers.
- Extensibility in the same language. The ado-file system means users extend Stata using the very language they already know, lowering the barrier to contributing new methods.
- Documentation as a first-class artifact. Stata is known for extensive, methodologically detailed reference manuals that explain not just syntax but the statistics behind each command.
Key Features
- Command-driven with an optional GUI. The command line and do-files are the heart of Stata, but since Stata 8 (2003) a full graphical interface with menus and dialog boxes has made commands discoverable for newcomers.
- Rich statistical coverage. Linear and generalized linear models, panel/longitudinal data, time series, survival analysis, survey estimation, multilevel/mixed models, structural equation modeling, Bayesian analysis, and causal-inference estimators are all built in.
- Strong data management. Reshaping, merging, collapsing, and by-group processing are concise and central to the language.
- Mata. Introduced in Stata 9 (2005), Mata is a compiled, statically typed matrix programming language embedded inside Stata, used for writing fast numerical routines and the internals of many commands.
- ado-files and the SSC archive. User-written commands extend the language, with a large community catalog installable via
ssc install. - Python and other integration. Stata 16 (2019) added tight Python integration, letting users call Python from Stata and vice versa, alongside features such as multiple in-memory datasets.
A taste of the syntax
* Load an example dataset and summarize a variable
sysuse auto, clear
summarize price mpg
* Regression with a subset condition, then a graph
regress price mpg weight if foreign == 0
scatter price mpg
* A tiny user-written command (ado-style program)
program define hello
display "Hello from Stata"
end
hello
The example shows Stata’s regularity: summarize, regress, and scatter all follow the same command-verb-then-varlist shape, and if foreign == 0 restricts the analysis in exactly the way it reads.
Evolution
Stata’s growth has been steady and version-numbered. After the extensibility milestone of ado-files (1990), the biggest architectural additions were the GUI and modern graphics (Stata 8, 2003), the Mata matrix language (Stata 9, 2005), Unicode support and Bayesian analysis (Stata 14, 2015), and Python integration with multiple in-memory datasets (Stata 16, 2019). Recent releases — Stata 18 (April 2023) and Stata 19 (April 2025) — have continued to expand reporting/table tools, causal-inference methods, and machine-learning capabilities.
Stata is sold in editions that differ in dataset-size limits and processing speed rather than in available methods — the modern lineup is Stata/BE (basic), Stata/SE (standard, larger datasets), and Stata/MP (multiprocessor, the fastest edition). All editions share the same commands and statistical methods; the differences are in scale. According to StataCorp’s own Stata/MP performance report, on multicore hardware Stata/MP runs meaningfully faster than single-core editions — the company reports that on a dual-core machine commands run in roughly 71% of the time (about a 40% overall speedup) and estimation commands run in roughly 59% of the time (a median of about 1.7× faster), with larger gains as more cores are added. These are StataCorp’s own benchmark figures and depend heavily on the specific commands, dataset, and hardware, so they should be read as illustrative rather than universal. A related, well-known subtlety is that splitting numerical work across processors can produce tiny differences in results due to floating-point round-off, since addition across pieces is not perfectly associative on a computer.
Current Relevance
Stata remains one of the most widely used statistical packages in academic and applied research, particularly in economics, where replication packages for journal articles are very frequently written in it, and in epidemiology, public health, sociology, and political science. Its niche is quantitative research on tabular and survey data, where reproducibility, well-documented methods, and a concise data-management language matter more than raw general-purpose programming.
It competes with R (open source, enormously extensible), SAS (long dominant in pharma and enterprise), SPSS (menu-driven social science), and increasingly Python with its data-science libraries. Stata’s differentiators are its coherence, its documentation, its stability across versions, and a research community that has published thousands of user-written commands and a dedicated methodological journal, the Stata Journal. As a commercial product, it is licensed per user and per edition rather than freely distributed — a trade-off users accept for its polish and support.
Why It Matters
Stata occupies a distinctive place in the history of statistical computing: it showed that a command language and a statistics package could be the same thing, and that an ecosystem of user-written commands in that language could keep a commercial product at the research frontier for decades. Its emphasis on do-files and reproducibility anticipated today’s broad concern with replicable computational research, and its influence is visible in how empirical economics and quantitative social science are taught and practiced. For millions of research analyses over four decades, “the code” has meant a Stata do-file.
Sources
Timeline
Notable Uses & Legacy
Academic economics research
Stata is a dominant tool in empirical economics; replication and data packages accompanying articles in major economics journals are very frequently written in Stata, and it is a standard part of graduate econometrics training
Epidemiology and public health
Researchers use Stata for survival analysis, epidemiological tables, survey-weighted estimation, and cohort and case-control studies, supported by a large biostatistics and epidemiology literature built around it
Sociology and political science
Quantitative social scientists rely on Stata for regression modeling, panel and longitudinal data analysis, and survey data, where its concise command syntax and strong data-management features are well suited to survey datasets
Policy and development institutions
Economists and analysts at research and policy organizations use Stata to clean survey data, run impact evaluations, and produce reproducible estimates for reports and working papers
The user-contributed command ecosystem
A large body of community-written commands (many distributed through the Statistical Software Components archive, SSC) extends Stata with specialized estimators and tools, installable directly from within the program