Python Generators and yield: Building Memory-Efficient Pipelines

Table of Contents

A generator is a lazy sequence. It produces values one at a time on demand instead of materializing everything into memory at once. For large datasets, streaming APIs, and data processing pipelines, generators are the correct default, not an optimization applied after the fact.

Python generators are one of the most practically useful features in the language, and one of the most underused by engineers who learned Python from web tutorials. This post goes beyond “use yield instead of return” and covers the full picture: the iterator protocol, memory characteristics, yield from, generator pipelines, and real-world streaming use cases.

Basic `yield`: Generators vs Regular Functions
#

When a function contains a yield statement, calling it does not execute the function body. Instead it returns a generator object. The body executes lazily, pausing at each yield and resuming on the next call to __next__.

basic_generator.py
def count_up_to(maximum: int):
    n = 1
    while n <= maximum:
        yield n
        n += 1

# Calling count_up_to() returns a generator object. Nothing executes yet.
gen = count_up_to(3)
print(type(gen))   # <class 'generator'>

# Each call to next() resumes execution until the next yield.
print(next(gen))   # 1
print(next(gen))   # 2
print(next(gen))   # 3
# next(gen) would raise StopIteration here

Compare this to a regular function that builds a list:

list_vs_generator.py
def count_up_to_list(maximum: int) -> list[int]:
    result = []
    n = 1
    while n <= maximum:
        result.append(n)
        n += 1
    return result

# This allocates memory for all 10_000_000 integers at once.
numbers = count_up_to_list(10_000_000)

# This allocates memory for one integer at a time.
for n in count_up_to(10_000_000):
    process(n)

The `next` Protocol and `StopIteration`
#

Any object with __iter__ and __next__ methods is an iterator. Generators implement this protocol automatically. Understanding it matters when you write custom iterables or integrate with frameworks that consume iterators.

iterator_protocol.py
class CountUpTo:
    """A custom iterator that does what count_up_to() does, without yield."""

    def __init__(self, maximum: int) -> None:
        self.maximum = maximum
        self.current = 1

    def __iter__(self):
        return self

    def __next__(self) -> int:
        if self.current > self.maximum:
            raise StopIteration
        value = self.current
        self.current += 1
        return value

# A for loop is syntactic sugar over the iterator protocol:
# it calls __iter__ once, then calls __next__ until StopIteration.
for n in CountUpTo(3):
    print(n)  # 1, 2, 3

Generators produce the same behavior with far less boilerplate. The StopIteration is raised automatically when the function returns.

Memory Comparison: File with `readlines()` vs Generator
#

The memory difference between list-based and generator-based approaches becomes critical when processing large files.

memory_comparison.py
import tracemalloc

def read_file_list(path: str) -> list[str]:
    with open(path) as f:
        return f.readlines()  # Loads all lines into memory at once

def read_file_generator(path: str):
    with open(path) as f:
        yield from f  # Yields one line at a time

# Benchmark against a 1M-line file
tracemalloc.start()
lines = read_file_list("big_file.txt")
current, peak = tracemalloc.get_traced_memory()
tracemalloc.stop()
print(f"readlines: peak={peak / 1_048_576:.1f} MB")  # ~500 MB for a 500 MB file

tracemalloc.start()
for line in read_file_generator("big_file.txt"):
    _ = line.strip()
current, peak = tracemalloc.get_traced_memory()
tracemalloc.stop()
print(f"generator: peak={peak / 1_048_576:.1f} MB")  # ~0.1 MB regardless of file size

The generator version has constant memory usage because only one line is in memory at any time.

Generator Expressions vs List Comprehensions
#

Generator expressions use parentheses and are lazy. List comprehensions use brackets and are eager.

comprehension_vs_expression.py
import sys

# List comprehension: allocates 8 MB for 1M integers
squares_list = [x * 2 for x in range(1_000_000)]
print(sys.getsizeof(squares_list))  # ~8_697_464 bytes (~8.3 MB)

# Generator expression: allocates ~200 bytes regardless of range size
squares_gen = (x * 2 for x in range(1_000_000))
print(sys.getsizeof(squares_gen))   # ~200 bytes

# Use a generator expression when the result is consumed once.
# Use a list when you need random access, len(), or multiple iterations.
total = sum(x * 2 for x in range(1_000_000))  # sum() accepts any iterable

Tip

When passing a generator expression as the sole argument to a function like sum(), any(), all(), or max(), you can drop the inner parentheses: sum(x*2 for x in range(n)) is idiomatic Python.

`yield from` for Generator Delegation
#

yield from delegates to a sub-generator (or any iterable), forwarding values, exceptions, and the return value transparently. It is the correct way to compose generators.

yield_from.py
import os
from collections.abc import Generator

def walk_files(root: str) -> Generator[str, None, None]:
    """Recursively yield all file paths under root."""
    for entry in os.scandir(root):
        if entry.is_dir(follow_symlinks=False):
            yield from walk_files(entry.path)  # Delegate to recursive call
        else:
            yield entry.path

# Without yield from, you would need:
# for path in walk_files(entry.path):
#     yield path
# yield from is not just syntactic sugar -- it also propagates .send() and .throw() correctly.

for path in walk_files("/var/log"):
    print(path)

Generator Pipelines: Composing Processing Stages
#

Generators compose naturally into pipelines. Each stage receives an iterable, processes it lazily, and yields results. The entire pipeline consumes O(1) memory regardless of data size.

pipeline.py
from collections.abc import Generator, Iterable
import csv
import io

def read_lines(path: str) -> Generator[str, None, None]:
    with open(path) as f:
        yield from f

def parse_csv(lines: Iterable[str]) -> Generator[dict, None, None]:
    reader = csv.DictReader(lines)
    yield from reader

def filter_active(rows: Iterable[dict]) -> Generator[dict, None, None]:
    for row in rows:
        if row.get("status") == "active":
            yield row

def normalize(rows: Iterable[dict]) -> Generator[dict, None, None]:
    for row in rows:
        yield {
            "id":    int(row["id"]),
            "email": row["email"].lower().strip(),
            "name":  row["name"].strip(),
        }

def process_file(path: str) -> Generator[dict, None, None]:
    lines   = read_lines(path)
    parsed  = parse_csv(lines)
    active  = filter_active(parsed)
    yield from normalize(active)

# The entire pipeline is lazy. No data is read until we iterate.
for record in process_file("users.csv"):
    save_to_database(record)

Note

Each generator in the pipeline holds one item in memory at a time. A 10 GB CSV file goes through this pipeline with roughly the same memory footprint as a 10 KB file.

The `send()` Method: Generators as Coroutines
#

Generators support bidirectional communication via .send(value). Before async/await was introduced in Python 3.5, this was the primary mechanism for coroutines.

send_method.py
from collections.abc import Generator

def accumulator() -> Generator[float, float, str]:
    """Receives values via send(), yields running average."""
    total = 0.0
    count = 0
    value = yield 0.0  # Initial yield to prime the generator
    while value is not None:
        total += value
        count += 1
        value = yield total / count
    return f"Final average over {count} items"

gen = accumulator()
next(gen)          # Prime the generator (advance to first yield)
print(gen.send(10))  # 10.0
print(gen.send(20))  # 15.0
print(gen.send(30))  # 20.0
try:
    gen.send(None)   # Signals end of input
except StopIteration as e:
    print(e.value)   # "Final average over 3 items"

Note

In modern Python (3.5+), async def and await supersede generator-based coroutines for concurrency. Use send() only when you specifically need a stateful data-processing coroutine that produces intermediate results, not for I/O concurrency.

Infinite Sequences
#

Generators can produce infinite sequences because they never materialize more than one value at a time.

infinite_sequences.py
import itertools
from collections.abc import Generator

# itertools.count: infinite arithmetic sequence
for n in itertools.islice(itertools.count(start=0, step=2), 5):
    print(n)  # 0, 2, 4, 6, 8

# itertools.cycle: repeat a finite sequence forever
for color in itertools.islice(itertools.cycle(["red", "green", "blue"]), 7):
    print(color)

# Custom infinite Fibonacci generator
def fibonacci() -> Generator[int, None, None]:
    a, b = 0, 1
    while True:
        yield a
        a, b = b, a + b

# Take the first 10 Fibonacci numbers
first_ten = list(itertools.islice(fibonacci(), 10))
print(first_ten)  # [0, 1, 1, 2, 3, 5, 8, 13, 21, 34]

Real-World Use Case: Streaming Large API Exports
#

The canonical production use case: a web API endpoint that exports data too large to buffer in memory.

streaming_export.py
import csv
import io
from collections.abc import Generator
from django.http import StreamingHttpResponse
from django.db.models import QuerySet

def iter_csv_rows(queryset: QuerySet) -> Generator[bytes, None, None]:
    """Yield CSV-encoded rows from a Django queryset without loading all into memory."""
    buffer = io.StringIO()
    writer = csv.writer(buffer)

    # Header row
    writer.writerow(["id", "email", "created_at"])
    buffer.seek(0)
    yield buffer.read().encode("utf-8")
    buffer.truncate(0)
    buffer.seek(0)

    # Data rows: QuerySet.iterator() uses a server-side cursor
    for user in queryset.only("id", "email", "created_at").iterator(chunk_size=2000):
        writer.writerow([user.id, user.email, user.created_at.isoformat()])
        buffer.seek(0)
        yield buffer.read().encode("utf-8")
        buffer.truncate(0)
        buffer.seek(0)

def export_users_csv(request):
    queryset = User.objects.filter(active=True).order_by("id")
    response = StreamingHttpResponse(
        iter_csv_rows(queryset),
        content_type="text/csv",
    )
    response["Content-Disposition"] = 'attachment; filename="users.csv"'
    return response

The StreamingHttpResponse in Django (or Response with stream=True in FastAPI) sends data to the client as it is produced. A 5 million row export completes without ever holding more than chunk_size rows in memory.

Common Mistakes
#

Iterating a generator twice

A generator is exhausted after one pass. If you iterate over it a second time, you get nothing. If you need to iterate multiple times, convert to a list first: items = list(my_generator()). Be aware of the memory trade-off.

gen = (x * 2 for x in range(5))
print(list(gen))  # [0, 2, 4, 6, 8]
print(list(gen))  # []  -- already exhausted

Not handling StopIteration in manual next() loops

Calling next() on an exhausted generator raises StopIteration. Use the two-argument form next(gen, default) to provide a sentinel value, or iterate with a for loop which handles the exception automatically.

gen = count_up_to(2)
print(next(gen, None))  # 1
print(next(gen, None))  # 2
print(next(gen, None))  # None -- no exception

Using a list where a generator suffices

Building a list just to iterate over it once is wasteful. [process(x) for x in items] when the list is never used again should be (process(x) for x in items) passed to the consuming function, or a for loop with side effects. The list allocation is pure overhead.

Forgetting to prime send()-based generators

Before calling .send(value) on a generator, you must advance it to the first yield with next(gen) or gen.send(None). Calling .send(non-None-value) on a freshly created generator raises TypeError. A common pattern is a @coroutine decorator that primes the generator automatically.

If you want to go deeper on any of this, I offer 1:1 coaching sessions for engineers working on AI integration, cloud architecture, and platform engineering. Book a session (50 EUR / 60 min) or reach out at manuel.fedele+website@gmail.com.

Basic yield: Generators vs Regular Functions#

The __next__ Protocol and StopIteration#

Memory Comparison: File with readlines() vs Generator#

Generator Expressions vs List Comprehensions#

yield from for Generator Delegation#

Generator Pipelines: Composing Processing Stages#

The send() Method: Generators as Coroutines#

Infinite Sequences#

Real-World Use Case: Streaming Large API Exports#

Common Mistakes#

Related

Basic `yield`: Generators vs Regular Functions
#

The `next` Protocol and `StopIteration`
#

Memory Comparison: File with `readlines()` vs Generator
#

Generator Expressions vs List Comprehensions
#

`yield from` for Generator Delegation
#

Generator Pipelines: Composing Processing Stages
#

The `send()` Method: Generators as Coroutines
#

Infinite Sequences
#

Real-World Use Case: Streaming Large API Exports
#

Common Mistakes
#