Strings Are Not Simple: How 12 Languages Model Text (And Why They Disagree)

Every programming language has strings. They’re so universal — so unremarkable — that we rarely stop to ask how they actually work. You type some characters between quotation marks, hand them to a function, and move on.

But pull back the curtain and strings are one of the most philosophically contested corners of language design. What even is a string? Is it mutable or immutable? What encoding does it use? What does indexing mean? Is it a first-class type, or something the type system doesn’t acknowledge at all?

The answers vary wildly. And the disagreements aren’t arbitrary — each one reflects deeper choices about safety, performance, hardware constraints, and who the language was built for.


The Question Nobody Asks

Before we dig into languages, consider what we’re actually asking when we say “how does this language handle strings?”

There are at least four distinct decisions hidden in that question:

  1. Mutability — can you change a string after creation, or must you build a new one?
  2. Encoding — what unit does the language think a string is made of? Bytes? Code units? Code points?
  3. Indexing — what does str[0] return? A byte? A character? Is it even allowed?
  4. Memory model — where does the string live? Who owns it? Who frees it?

Most languages don’t make these decisions explicit. They’re baked into the design and quietly inherited by every program you write. Let’s make them explicit.


C: The Non-Type

C is the best place to start because it has the most honest answer to “what is a string?” — honest in the sense of being brutally minimal.

In C, a string is not a type. There is no String. There is only a null-terminated array of char, where the convention is that the array ends when you hit a byte with value zero ('\0'). That’s it.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
#include <stdio.h>
#include <string.h>

int main() {
    char greeting[] = "Hello";
    // greeting is: {'H', 'e', 'l', 'l', 'o', '\0'}

    // THIS DOES NOT COMPARE STRINGS
    char other[] = "Hello";
    if (greeting == other) {  // Compares pointers, not content
        printf("equal\n");    // May never print — undefined behavior
    }

    // THIS compares strings correctly
    if (strcmp(greeting, other) == 0) {
        printf("equal\n");    // Correct
    }

    return 0;
}

The == operator on two char* values compares their memory addresses — it asks whether both pointers point to the same location, not whether they contain the same characters. This is a silent footgun that has caused bugs for fifty years. strcmp exists precisely because the type system provides no help.

C strings are mutable by default (though you can declare a pointer to a string literal as const). There is no automatic bounds checking. Writing past the end of a string buffer is the source of an entire category of security vulnerabilities — buffer overflows — that have shaped computing history.

C was designed in 1972 to write operating systems. The “strings are byte arrays with a null terminator” approach maps directly to how the hardware works. There is no abstraction overhead because there is no abstraction.


C++: Mutation by Default

C++ kept C’s char* for compatibility but added std::string — a proper heap-allocated string class that manages its own memory.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
#include <iostream>
#include <string>

int main() {
    std::string greeting = "Hello";

    // Mutation is natural — this is an in-place modification
    greeting += ", World";
    greeting[0] = 'h';  // Direct character assignment works

    std::cout << greeting << std::endl;  // "hello, World"

    // == now compares content, not pointers
    std::string other = "hello, World";
    if (greeting == other) {
        std::cout << "equal" << std::endl;  // Correct
    }

    return 0;
}

std::string is mutable and heap-allocated. You can modify characters in place, append to the end, replace substrings — all without creating new objects. This is the direct opposite of what Java, Python, and Go will do.


Java: Immutability and the Cost of +

Java strings are immutable. Every String object, once created, never changes. If you “modify” a string, what you actually do is create a new String object containing the modified content and let the old one become eligible for garbage collection.

1
2
3
String s = "Hello";
s = s + ", World";  // s now points to a NEW String object
                    // "Hello" still exists in memory (until GC)

This has a consequence most Java developers learn the hard way:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
// Catastrophically inefficient — O(n²) object allocations
String result = "";
for (int i = 0; i < 10000; i++) {
    result = result + i;  // Creates a new String on every iteration
}

// Correct approach — StringBuilder is mutable
StringBuilder sb = new StringBuilder();
for (int i = 0; i < 10000; i++) {
    sb.append(i);
}
String result2 = sb.toString();

StringBuilder exists specifically because Java strings are immutable. It’s a mutable buffer you build into, then convert to an immutable String at the end. The immutability of String is what forces the existence of a parallel mutable class.

Why make strings immutable at all? Immutable objects are thread-safe by definition — no synchronization needed when multiple threads read the same string. They can be safely used as keys in hash maps without defensive copies. They’re simpler to reason about. Java was designed for networked, multi-threaded enterprise applications where these properties matter.

Java stores strings as UTF-16 internally — each char is a 16-bit code unit, not a Unicode code point. Java has used 16-bit character storage since Java 1.0 (originally UCS-2), and added full UTF-16 supplementary character support with surrogate pairs in J2SE 5.0. For Basic Multilingual Plane characters, one char equals one code point. For characters outside the BMP — including most emoji — you need a surrogate pair: two consecutive char values that together encode a single code point.

1
2
3
String emoji = "😀";
System.out.println(emoji.length());       // 2 (two UTF-16 code units)
System.out.println(emoji.codePointCount(0, emoji.length())); // 1 (one code point)

This is a persistent source of off-by-one errors in Java string manipulation.


Python: Code Points and Convenient Indexing

Python 3’s str is a sequence of Unicode code points. Not bytes. Not UTF-16 code units. Actual code points — the abstract Unicode identity of each character.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
greeting = "Hello"

# Indexing returns a one-character string, not a numeric code
print(greeting[0])   # "H"  (a str of length 1)
print(type(greeting[0]))  # <class 'str'>

# Emoji are single code points — indexing works correctly
flag = "🇨🇦"
print(len(flag))     # 2 (two regional indicator code points, but visually one flag)

smile = "😀"
print(len(smile))    # 1 (one code point)
print(smile[0])      # "😀"

# Strings are immutable
s = "hello"
# s[0] = "H"  # TypeError: 'str' object does not support item assignment

Python strings are immutable, like Java’s. Repeated concatenation with + has the same performance problem — use str.join() or io.StringIO for building large strings.

The design choice to make str[0] return a one-character string rather than a numeric code is deliberate. In Python, there is no separate char type. A character is just a length-1 string. This is convenient — you don’t need different comparison logic for characters versus strings — but it means Python treats strings as sequences of abstract code points, which requires internal bookkeeping to handle efficiently.


Go: Bytes by Default

Go makes a different choice that surprises many developers coming from Python or Java.

A Go string is a sequence of bytes. That’s the base definition. The bytes are conventionally UTF-8 encoded, but the type system doesn’t enforce this — a string can hold arbitrary binary data.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
package main

import "fmt"

func main() {
    s := "Hello"

    // Indexing yields a BYTE VALUE (uint8), not a character
    fmt.Println(s[0])         // 72 (ASCII/UTF-8 byte value of 'H')
    fmt.Printf("%T\n", s[0]) // uint8

    // To iterate over Unicode code points, use range
    for i, r := range "Hello, 世界" {
        fmt.Printf("%d: %c (%d)\n", i, r, r)
    }
    // 0: H (72)
    // 1: e (101)
    // 2: l (108)
    // 3: l (108)
    // 4: o (111)
    // 5: , (44)
    // 6:   (32)
    // 7: 世 (19990)   <- byte index 7
    // 10: 界 (30028)  <- byte index 10, not 8, because 世 is 3 bytes
}

When you index a Go string with s[0], you get 72 — the byte value. When you iterate with range, you get rune values (Go’s name for Unicode code points) along with their byte positions. The byte index advances by the number of bytes in each code point, not by 1.

Go strings are immutable — you cannot modify a byte in place. But Go’s choice to expose byte-level indexing reflects Go’s design philosophy: stay close to how things actually work in memory. UTF-8 happens to be a byte-oriented encoding invented by Rob Pike and Ken Thompson (two of Go’s creators), which makes Go’s string model feel natural for UTF-8 text processing.


Rust: Two Types, One Problem

Rust has the most explicit string model of any mainstream language, and it reveals the fundamental tension clearly by splitting it into two types.

String is an owned, heap-allocated, mutable string. You own it; you’re responsible for it; it’s yours to change.

&str is a borrowed, immutable string slice — a reference to a sequence of UTF-8 bytes that someone else owns. It might point into a String, or into a string literal baked into the binary, or into any UTF-8 byte buffer.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
fn main() {
    // &str: a borrowed slice, lives in the binary (static lifetime)
    let literal: &str = "Hello, World";

    // String: owned, heap-allocated, mutable
    let mut owned: String = String::from("Hello");
    owned.push_str(", World");  // Mutation is fine — we own it
    owned.push('!');

    // A &str can borrow from a String
    let slice: &str = &owned[0..5];  // "Hello"

    println!("{}", owned);  // "Hello, World!"
    println!("{}", slice);  // "Hello"

    // You cannot index by integer — Rust refuses
    // let c = owned[0];  // Compile error: cannot index into `String`
}

Why does Rust refuse integer indexing entirely? Because a String is UTF-8 encoded, and UTF-8 characters are variable-width — 1 to 4 bytes each. owned[0] would be a byte, not a character. And a byte in the middle of a multi-byte character sequence is not a valid character. Rust makes you be explicit: use .chars().nth(0) for a code point, or [0..1] for a byte slice, or iterate with .chars().

The String / &str split encodes ownership semantics directly into the type system. When you see &str in a function signature, you know the function borrows without taking ownership and cannot outlive its input. When you see String, you know ownership is being transferred. This is not documentation — it is a compile-time guarantee enforced by the borrow checker.


Swift: The Indexing Refusal

Swift takes the most principled stance of any language on the question of integer indexing — and it’s one that surprises most developers the first time.

You cannot index a Swift string with an integer. There is no str[0]. You must use String.Index.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
let greeting = "Hello, 世界!"

// This does NOT compile:
// let first = greeting[0]  // Error: subscript is unavailable

// You must use String.Index
let start = greeting.startIndex
let first = greeting[start]  // Character('H')

// To get the 8th character:
let eighthIndex = greeting.index(greeting.startIndex, offsetBy: 7)
let eighth = greeting[eighthIndex]  // Character('世')

// Iterating is straightforward
for char in greeting {
    print(char)
}

Why? Because Swift’s Character type represents a “grapheme cluster” — what a human would call a single visible character, which may be composed of multiple Unicode code points. An accented character like é can be one code point (U+00E9) or two (e + combining accent U+0301). In Swift, both representations are considered a single Character.

Given this, there is no O(1) way to find “the nth character” without scanning from the beginning. Integer indexing would silently be O(n) — or worse, return a result in the middle of a multi-code-point grapheme cluster. Swift’s designers refused to pretend the operation is cheap when it isn’t. The awkward String.Index API is the language’s way of making the cost visible.


Ruby: Mutable by Default, Regretting It

Ruby strings are mutable by default. This was a natural choice for a language designed to be expressive and dynamic — you can modify strings in place without thinking about allocations.

1
2
3
4
5
6
7
8
9
greeting = "Hello"
greeting << ", World"   # In-place append — same object
greeting[0] = "h"       # In-place character replacement

puts greeting           # "hello, World"

# Strings are objects with methods
puts "hello world".split.map(&:capitalize).join(" ")
# "Hello World"

Ruby’s << operator appends in place, mutating the original string. + creates a new string. The distinction matters for performance and for code that holds references to the original.

The problem with mutable strings in a modern context is concurrency. When multiple threads share a mutable string, you need locking. In Ruby, the Global VM Lock (GVL) historically provided some protection, but as Ruby improves its concurrency model (Ractor, introduced in Ruby 3.0, enables true parallelism), mutable strings become a hazard.

Ruby 3.x is gradually migrating toward frozen string literals. Add # frozen_string_literal: true at the top of a file and all string literals in that file become immutable — like Java or Python strings. The migration is opt-in, and it exists because the Ruby community recognized that the performance and safety tradeoffs of mutability have become a liability. This is a rare example of a mainstream language walking back a core design decision.


Lua: Immutable and Interned

Lua strings are immutable byte sequences — you cannot modify a string after creation. But Lua adds a property that Java and Python lack: all strings are interned.

Interning means that if two strings contain the same bytes, they are guaranteed to be the exact same object in memory. There is only ever one copy of any distinct string value.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
local a = "hello"
local b = "hel" .. "lo"  -- Concatenation

-- In Lua, a == b is O(1) because they are the same object
print(a == b)  -- true

-- String methods live in the string library
print(("hello"):upper())       -- "HELLO"
print(("hello world"):len())   -- 11
print(string.sub("hello", 2, 4))  -- "ell"

Equality comparison between any two Lua strings is O(1) — it’s just a pointer comparison, because identical strings are the same pointer. This is an optimization that falls out of immutability: you can only safely intern immutable values, because a mutable string could be changed after interning.

Lua is used heavily as an embedded scripting language (in games, in Nginx via OpenResty, in Redis via scripting). String interning makes dictionary lookups fast, which matters for the table-heavy idioms Lua programs rely on.


FORTRAN: Fixed Width and Space Padding

FORTRAN 77 had a string model that reflects its era: fixed-length character strings declared at compile time.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
      CHARACTER*10 NAME
      CHARACTER*20 GREETING

      NAME = 'Alice'
      GREETING = 'Hello, ' // NAME

C     NAME is exactly 10 characters: 'Alice     ' (padded with spaces)
C     GREETING is exactly 20 characters: 'Hello, Alice        '

      IF (NAME .EQ. 'Alice     ') THEN
          PRINT *, 'Name matches'
      END IF

CHARACTER*10 NAME allocates exactly 10 characters. If you assign a 5-character value, the remaining 5 positions are padded with spaces. If you assign an 11-character value, it’s truncated. Comparing strings required knowing — and accounting for — their declared length.

This is the string model of the batch-processing era: memory was precious and fixed-layout data structures made sense for the record-oriented I/O that scientific computing of the 1960s and 70s required. Modern Fortran (Fortran 90 and later) added allocatable character variables with dynamic length, but fixed-length strings with space padding remain a fixture in legacy FORTRAN 77 codebases that still run in production today.


The Comparison at a Glance

LanguageMutable?Encodingstr[0] returnsOwnership explicit?
CYes (no type)Bytes (convention)char (byte value)Manual
C++ std::stringYesBytes (convention)charScoped/RAII
JavaNoUTF-16 code unitsCompile errorGC
PythonNoUnicode code points1-char strGC
GoNoUTF-8 bytesuint8 (byte value)GC
Rust StringYes (if owned)UTF-8 bytesCompile errorBorrow checker
SwiftNoGrapheme clustersCompile errorARC
RubyYes (default)Encoding-tagged bytes1-char StringGC
LuaNo (interned)Bytesstring (1-char)GC
FORTRAN 77Yes (fixed-width)BytesCHARACTER*1Static

What String Design Reveals

A language’s string model is not an isolated technical detail. It’s a window into the values and assumptions that shaped everything else.

C’s “strings are byte arrays” reflects a language that trusts the programmer completely and provides no safety net — because the cost of that safety net, in 1972, was unacceptable. The result is decades of buffer overflow vulnerabilities.

Java’s immutability reflects an enterprise language designed for multi-threaded servers, where shared mutable state is dangerous. The trade-off was the awkward StringBuilder dance and the existence of a parallel mutable type.

Rust’s String / &str split reflects a language whose entire reason for existing is to make ownership and lifetime explicit. You cannot use a string without knowing who owns it and how long it lives.

Swift’s refusal to support integer indexing reflects a language designed after Unicode grapheme clusters were well-understood, by designers willing to make an API awkward rather than let the performance model be misrepresented.

Ruby’s mutation-by-default and its gradual retreat toward frozen strings reflects the tension between expressiveness and the demands of modern concurrent applications — a tension that didn’t exist when Matz began designing Ruby in 1993.

Go’s byte-level indexing reflects the pragmatic, systems-oriented ethos of its creators: you should know what you’re actually indexing into, and the conversion to code points via range should be explicit.

When you pick up a language, its string handling tells you something about what its authors cared about and what trade-offs they considered acceptable. Strings aren’t solved. They’re a design problem every language answered differently — and the answer reveals the language’s soul.


Explore how these languages actually work: Rust, Go, Python, C, Swift, Ruby, and Lua all have runnable Hello World examples on CodeArchaeology — with Docker so you can test them today without installing anything.

Last updated:

Comments

Loading comments...

Leave a Comment

2000 characters remaining