python 28 lines · 5 steps

Streaming TSV records with Python generators

A lazy generator pipeline reads optionally-gzipped TSV files line by line and counts matching records without loading the whole file.

Explained by highlit
1import gzip
2from pathlib import Path
3from typing import Iterator
4 
5 
6def read_lines(path: str | Path, *, encoding: str = "utf-8") -> Iterator[str]:
7 path = Path(path)
8 opener = gzip.open if path.suffix == ".gz" else open
9 with opener(path, mode="rt", encoding=encoding) as handle:
10 for line in handle:
11 yield line.rstrip("\n")
12 
13 
14def iter_records(path: str | Path) -> Iterator[dict[str, str]]:
15 lines = read_lines(path)
16 header = next(lines).split("\t")
17 for line in lines:
18 if not line:
19 continue
20 yield dict(zip(header, line.split("\t")))
21 
22 
23def count_errors(path: str | Path) -> int:
24 return sum(
25 1
26 for record in iter_records(path)
27 if record.get("level") == "ERROR"
28 )
01 / 01
STEP 01

Walkthrough

Space play step click any line
Three takeaways
  1. 1Generators let you process arbitrarily large files with constant memory by yielding one item at a time.
  2. 2Selecting the opener by file suffix transparently handles both plain and gzipped inputs through one code path.
  3. 3Layering small generators into a pipeline keeps each stage focused and composable.

Related explainers

Share this explainer

Here's the card — post it anywhere.

Streaming TSV records with Python generators — share card
Made with highlit — turn any snippet into a walkthrough like this in about a minute.
Explain your code