Every programming language has strings. They’re so universal — so unremarkable — that we rarely stop to ask how they actually work. You type some characters between quotation marks, hand them to a function, and move on.
But pull back the curtain and strings are one of the most philosophically contested corners of language design. What even is a string? Is it mutable or immutable? What encoding does it use? What does indexing mean? Is it a first-class type, or something the type system doesn’t acknowledge at all?
The answers vary wildly. And the disagreements aren’t arbitrary — each one reflects deeper choices about safety, performance, hardware constraints, and who the language was built for.
The Question Nobody Asks
Before we dig into languages, consider what we’re actually asking when we say “how does this language handle strings?”
There are at least four distinct decisions hidden in that question:
- Mutability — can you change a string after creation, or must you build a new one?
- Encoding — what unit does the language think a string is made of? Bytes? Code units? Code points?
- Indexing — what does
str[0]return? A byte? A character? Is it even allowed? - Memory model — where does the string live? Who owns it? Who frees it?
Most languages don’t make these decisions explicit. They’re baked into the design and quietly inherited by every program you write. Let’s make them explicit.
C: The Non-Type
C is the best place to start because it has the most honest answer to “what is a string?” — honest in the sense of being brutally minimal.
In C, a string is not a type. There is no String. There is only a null-terminated array of char, where the convention is that the array ends when you hit a byte with value zero ('\0'). That’s it.
| |
The == operator on two char* values compares their memory addresses — it asks whether both pointers point to the same location, not whether they contain the same characters. This is a silent footgun that has caused bugs for fifty years. strcmp exists precisely because the type system provides no help.
C strings are mutable by default (though you can declare a pointer to a string literal as const). There is no automatic bounds checking. Writing past the end of a string buffer is the source of an entire category of security vulnerabilities — buffer overflows — that have shaped computing history.
C was designed in 1972 to write operating systems. The “strings are byte arrays with a null terminator” approach maps directly to how the hardware works. There is no abstraction overhead because there is no abstraction.
C++: Mutation by Default
C++ kept C’s char* for compatibility but added std::string — a proper heap-allocated string class that manages its own memory.
| |
std::string is mutable and heap-allocated. You can modify characters in place, append to the end, replace substrings — all without creating new objects. This is the direct opposite of what Java, Python, and Go will do.
Java: Immutability and the Cost of +
Java strings are immutable. Every String object, once created, never changes. If you “modify” a string, what you actually do is create a new String object containing the modified content and let the old one become eligible for garbage collection.
| |
This has a consequence most Java developers learn the hard way:
| |
StringBuilder exists specifically because Java strings are immutable. It’s a mutable buffer you build into, then convert to an immutable String at the end. The immutability of String is what forces the existence of a parallel mutable class.
Why make strings immutable at all? Immutable objects are thread-safe by definition — no synchronization needed when multiple threads read the same string. They can be safely used as keys in hash maps without defensive copies. They’re simpler to reason about. Java was designed for networked, multi-threaded enterprise applications where these properties matter.
Java stores strings as UTF-16 internally — each char is a 16-bit code unit, not a Unicode code point. Java has used 16-bit character storage since Java 1.0 (originally UCS-2), and added full UTF-16 supplementary character support with surrogate pairs in J2SE 5.0. For Basic Multilingual Plane characters, one char equals one code point. For characters outside the BMP — including most emoji — you need a surrogate pair: two consecutive char values that together encode a single code point.
| |
This is a persistent source of off-by-one errors in Java string manipulation.
Python: Code Points and Convenient Indexing
Python 3’s str is a sequence of Unicode code points. Not bytes. Not UTF-16 code units. Actual code points — the abstract Unicode identity of each character.
| |
Python strings are immutable, like Java’s. Repeated concatenation with + has the same performance problem — use str.join() or io.StringIO for building large strings.
The design choice to make str[0] return a one-character string rather than a numeric code is deliberate. In Python, there is no separate char type. A character is just a length-1 string. This is convenient — you don’t need different comparison logic for characters versus strings — but it means Python treats strings as sequences of abstract code points, which requires internal bookkeeping to handle efficiently.
Go: Bytes by Default
Go makes a different choice that surprises many developers coming from Python or Java.
A Go string is a sequence of bytes. That’s the base definition. The bytes are conventionally UTF-8 encoded, but the type system doesn’t enforce this — a string can hold arbitrary binary data.
| |
When you index a Go string with s[0], you get 72 — the byte value. When you iterate with range, you get rune values (Go’s name for Unicode code points) along with their byte positions. The byte index advances by the number of bytes in each code point, not by 1.
Go strings are immutable — you cannot modify a byte in place. But Go’s choice to expose byte-level indexing reflects Go’s design philosophy: stay close to how things actually work in memory. UTF-8 happens to be a byte-oriented encoding invented by Rob Pike and Ken Thompson (two of Go’s creators), which makes Go’s string model feel natural for UTF-8 text processing.
Rust: Two Types, One Problem
Rust has the most explicit string model of any mainstream language, and it reveals the fundamental tension clearly by splitting it into two types.
String is an owned, heap-allocated, mutable string. You own it; you’re responsible for it; it’s yours to change.
&str is a borrowed, immutable string slice — a reference to a sequence of UTF-8 bytes that someone else owns. It might point into a String, or into a string literal baked into the binary, or into any UTF-8 byte buffer.
| |
Why does Rust refuse integer indexing entirely? Because a String is UTF-8 encoded, and UTF-8 characters are variable-width — 1 to 4 bytes each. owned[0] would be a byte, not a character. And a byte in the middle of a multi-byte character sequence is not a valid character. Rust makes you be explicit: use .chars().nth(0) for a code point, or [0..1] for a byte slice, or iterate with .chars().
The String / &str split encodes ownership semantics directly into the type system. When you see &str in a function signature, you know the function borrows without taking ownership and cannot outlive its input. When you see String, you know ownership is being transferred. This is not documentation — it is a compile-time guarantee enforced by the borrow checker.
Swift: The Indexing Refusal
Swift takes the most principled stance of any language on the question of integer indexing — and it’s one that surprises most developers the first time.
You cannot index a Swift string with an integer. There is no str[0]. You must use String.Index.
| |
Why? Because Swift’s Character type represents a “grapheme cluster” — what a human would call a single visible character, which may be composed of multiple Unicode code points. An accented character like é can be one code point (U+00E9) or two (e + combining accent U+0301). In Swift, both representations are considered a single Character.
Given this, there is no O(1) way to find “the nth character” without scanning from the beginning. Integer indexing would silently be O(n) — or worse, return a result in the middle of a multi-code-point grapheme cluster. Swift’s designers refused to pretend the operation is cheap when it isn’t. The awkward String.Index API is the language’s way of making the cost visible.
Ruby: Mutable by Default, Regretting It
Ruby strings are mutable by default. This was a natural choice for a language designed to be expressive and dynamic — you can modify strings in place without thinking about allocations.
| |
Ruby’s << operator appends in place, mutating the original string. + creates a new string. The distinction matters for performance and for code that holds references to the original.
The problem with mutable strings in a modern context is concurrency. When multiple threads share a mutable string, you need locking. In Ruby, the Global VM Lock (GVL) historically provided some protection, but as Ruby improves its concurrency model (Ractor, introduced in Ruby 3.0, enables true parallelism), mutable strings become a hazard.
Ruby 3.x is gradually migrating toward frozen string literals. Add # frozen_string_literal: true at the top of a file and all string literals in that file become immutable — like Java or Python strings. The migration is opt-in, and it exists because the Ruby community recognized that the performance and safety tradeoffs of mutability have become a liability. This is a rare example of a mainstream language walking back a core design decision.
Lua: Immutable and Interned
Lua strings are immutable byte sequences — you cannot modify a string after creation. But Lua adds a property that Java and Python lack: all strings are interned.
Interning means that if two strings contain the same bytes, they are guaranteed to be the exact same object in memory. There is only ever one copy of any distinct string value.
| |
Equality comparison between any two Lua strings is O(1) — it’s just a pointer comparison, because identical strings are the same pointer. This is an optimization that falls out of immutability: you can only safely intern immutable values, because a mutable string could be changed after interning.
Lua is used heavily as an embedded scripting language (in games, in Nginx via OpenResty, in Redis via scripting). String interning makes dictionary lookups fast, which matters for the table-heavy idioms Lua programs rely on.
FORTRAN: Fixed Width and Space Padding
FORTRAN 77 had a string model that reflects its era: fixed-length character strings declared at compile time.
| |
CHARACTER*10 NAME allocates exactly 10 characters. If you assign a 5-character value, the remaining 5 positions are padded with spaces. If you assign an 11-character value, it’s truncated. Comparing strings required knowing — and accounting for — their declared length.
This is the string model of the batch-processing era: memory was precious and fixed-layout data structures made sense for the record-oriented I/O that scientific computing of the 1960s and 70s required. Modern Fortran (Fortran 90 and later) added allocatable character variables with dynamic length, but fixed-length strings with space padding remain a fixture in legacy FORTRAN 77 codebases that still run in production today.
The Comparison at a Glance
| Language | Mutable? | Encoding | str[0] returns | Ownership explicit? |
|---|---|---|---|---|
| C | Yes (no type) | Bytes (convention) | char (byte value) | Manual |
C++ std::string | Yes | Bytes (convention) | char | Scoped/RAII |
| Java | No | UTF-16 code units | Compile error | GC |
| Python | No | Unicode code points | 1-char str | GC |
| Go | No | UTF-8 bytes | uint8 (byte value) | GC |
Rust String | Yes (if owned) | UTF-8 bytes | Compile error | Borrow checker |
| Swift | No | Grapheme clusters | Compile error | ARC |
| Ruby | Yes (default) | Encoding-tagged bytes | 1-char String | GC |
| Lua | No (interned) | Bytes | string (1-char) | GC |
| FORTRAN 77 | Yes (fixed-width) | Bytes | CHARACTER*1 | Static |
What String Design Reveals
A language’s string model is not an isolated technical detail. It’s a window into the values and assumptions that shaped everything else.
C’s “strings are byte arrays” reflects a language that trusts the programmer completely and provides no safety net — because the cost of that safety net, in 1972, was unacceptable. The result is decades of buffer overflow vulnerabilities.
Java’s immutability reflects an enterprise language designed for multi-threaded servers, where shared mutable state is dangerous. The trade-off was the awkward StringBuilder dance and the existence of a parallel mutable type.
Rust’s String / &str split reflects a language whose entire reason for existing is to make ownership and lifetime explicit. You cannot use a string without knowing who owns it and how long it lives.
Swift’s refusal to support integer indexing reflects a language designed after Unicode grapheme clusters were well-understood, by designers willing to make an API awkward rather than let the performance model be misrepresented.
Ruby’s mutation-by-default and its gradual retreat toward frozen strings reflects the tension between expressiveness and the demands of modern concurrent applications — a tension that didn’t exist when Matz began designing Ruby in 1993.
Go’s byte-level indexing reflects the pragmatic, systems-oriented ethos of its creators: you should know what you’re actually indexing into, and the conversion to code points via range should be explicit.
When you pick up a language, its string handling tells you something about what its authors cared about and what trade-offs they considered acceptable. Strings aren’t solved. They’re a design problem every language answered differently — and the answer reveals the language’s soul.
Explore how these languages actually work: Rust, Go, Python, C, Swift, Ruby, and Lua all have runnable Hello World examples on CodeArchaeology — with Docker so you can test them today without installing anything.
Comments
Loading comments...
Leave a Comment