systems programming

why C

most of what i build day-to-day is high-level stuff. frameworks, ORMs, abstractions on abstractions. which is fine. that's how you ship products quickly. but i wanted to understand what's actually happening underneath.

so i started writing C. honestly, C is just better for this kind of work. it forces you to understand what's actually happening. memory management. pointers. how the OS actually works. no garbage collector hiding allocation patterns. no runtime doing magic behind your back.

these projects are all about understanding how things work by building them from scratch.

http server

a basic HTTP/1.1 server. handles GET and POST, serves static files, does basic routing.

built by reading RFC 2616. raw sockets to HTTP responses. no web framework, no http library. just the specification.

parsing. reading from a socket doesn't give you complete requests. you get chunks of bytes. maybe half a header, maybe three requests, maybe nothing. you need to buffer, find request boundaries, handle partial reads.

the RFC tells you HTTP is text-based with \r\n delimiters. implementing it teaches you about edge cases. i initially read until i saw \r\n\r\n for the header end. worked great until someone sent a POST request where the body also contained that sequence. proper parsing requires respecting the Content-Length header.

connection handling. HTTP/1.1 supports keep-alive connections. one TCP connection, multiple requests. efficient but complicates the server. you need timeouts for idle connections, proper pipelining, handling clients that close connections mid-request.

implemented basic keep-alive but not pipelining. pipelining requires handling multiple requests before the first response is sent. adds significant complexity for minimal real-world benefit since most clients don't use it anyway.

lessons. RFCs force you to understand the protocol completely. no abstraction layer hiding the details. you see why certain design decisions were made because you hit the same problems the spec authors did.

frameworks are convenient. but building from the spec teaches you what the framework is actually doing. makes debugging production issues easier when you know what's underneath.

for example, HTTP supports trailing headers that come after the body in chunked transfers. this is why frameworks treat headers as async: they can't assume all headers arrive before you start reading the body.

github.com/karol-broda/http_server_c

dns server

a UDP DNS server that responds to queries. handles A records, CNAME records, proper recursion.

built this by reading RFC 1035. no libraries. just the specification and a lot of bit manipulation.

the protocol. DNS uses a binary format over UDP. everything is bit-packed. domain names use label compression to avoid repeating common suffixes. questions and answers have type codes and class codes and TTLs.

reading the RFC gives you the structure. implementing it teaches you why binary protocols are painful. off-by-one errors crash the server. incorrect length fields cause reads past buffer ends. wireshark becomes your best friend.

recursion. when someone queries for a domain you don't know, you ask another DNS server. that server might redirect you to another one. you follow the chain until you get an answer or give up.

implemented basic recursion with root servers. works fine until you hit rate limits or need DNSSEC validation. then you realize why production DNS servers are complex.

what stuck. RFCs are dense but complete. everything you need is in there, you just have to extract it. reading protocol specifications makes you better at implementing protocols generally.

also learned that UDP is great for simple request/response but terrible for anything requiring reliability. TCP exists for good reasons.

github.com/karol-broda/udp_dns_server_c

fuzzy matching with BK-trees

spell checking and fuzzy string matching using Burkhard-Keller trees.

the problem. you have a dictionary of words. user types something close but not exact. you want to find near matches efficiently. brute force works but scales poorly.

BK-trees. a tree structure organized by edit distance. each node contains a word and children at various edit distances. searching the tree lets you prune entire branches that can't contain matches.

example: if your query has edit distance 5 from a node, and you want matches within distance 2, you only need to check children at distances 3-7 from that node. everything else is impossible to match.

implementation. computing edit distance is the Levenshtein algorithm. dynamic programming, comparing characters, building up a matrix. straightforward but easy to get wrong.

the BK-tree construction is recursive. insert a word, compute distance to root, recurse into the appropriate child. building the tree is O(n log n) on average. searching is way faster than linear scan.

practical results. built a spell checker using an English dictionary. works well for typos and OCR errors. fast enough for interactive use.

the tree structure makes more sense when you build it yourself. it's not complicated, but it's clever. organizing data by distance rather than lexicographic order enables pruning that wouldn't work with a trie or hash table.

what i learned. sometimes the right data structure makes all the difference. brute force edit distance on a full dictionary is slow. BK-trees make it usable.

also that string metrics like Levenshtein distance are more nuanced than they seem. edit distance treats all operations equally, but in practice some typos are more common than others. phonetic matching and keyboard distance can improve results.

github.com/karol-broda/fuzzy-match-bktree

database_c

a relational database with B+ trees, ACID transactions, and crash recovery. supports SQL-like operations with multiple data types.

the immediate goal was understanding database internals. you use postgres every day, query planners and MVCC and WAL are just words in documentation. building a simple version makes them real.

what it does. basic SQL-like interface with CREATE TABLE, INSERT, UPDATE, DELETE, and SELECT. transactions with BEGIN/COMMIT/ROLLBACK. write-ahead logging for crash recovery. buffer pool with LRU page replacement.

supports INT, VARCHAR(n), FLOAT, DOUBLE, TEXT, DATE, TIMESTAMP, and BOOLEAN types. postgres-style meta commands like \d to describe tables, \dt to list tables.

B+ trees. everyone knows binary trees. B+ trees are similar but optimized for disk access. each node holds multiple keys (order 32), and all actual data lives in leaf nodes. this means fewer disk reads for range queries since leaves are linked together.

implementing the split logic took longer than expected. when a node fills up, you split it and propagate keys upward. sounds simple until you're debugging why insertions corrupt the tree structure after the third split.

transactions and logging. every modification writes to a log file first, then modifies pages in memory. if the process crashes, replaying the log rebuilds the correct state. this is write-ahead logging.

the tricky part is coordinating commits. you can't mark a transaction committed until its log entries are on disk. but flushing to disk on every write kills performance. buffering writes helps, but now you need to track which transactions are durable and which aren't.

implementation details. pages are 4KB fixed size. buffer pool holds 100 pages in memory. simple table-level locking. values limited to 256 characters.

what i learned. why databases use page-based storage instead of just malloc'ing structs. why transaction isolation is hard. why query optimization matters even for simple selects.

also learned that writing a production database is significantly harder than this toy version. postgres has been refined for decades. this works for learning, but there's a reason people don't roll their own databases.

github.com/karol-broda/database_c

betanet-c

a privacy-preserving network protocol. this one's not on github yet because it's still rough, but the core idea works.

the problem. most network protocols leak metadata. even with encryption, observers can see who's talking to whom, when, and how much data is moving. traffic analysis is a real thing.

the approach. mix networking. clients send fixed-size packets through multiple intermediate nodes. each node only knows the previous and next hop, not the full path. messages get mixed with other traffic to make timing analysis harder.

this isn't tor. tor is way more sophisticated with directory authorities, onion routing, and a huge network. this is me trying to understand the core concepts by implementing a minimal version.

implementation challenges. fixed-size packets mean padding small messages and chunking large ones. route selection needs to be random but consistent for a session. timing needs to be somewhat randomized without making latency unbearable.

also key exchange. how do you establish shared secrets with nodes you've never talked to? ended up using a simplified version of the station-to-station protocol. probably not cryptographically sound at scale, but good enough for learning.

what it taught me. why Tor's design is the way it is. why traffic analysis is so hard to prevent. why performance and privacy have fundamental tradeoffs.

mostly this project made me appreciate how hard it is to build actually private systems. adding crypto is easy. preventing metadata leakage is genuinely difficult.

github.com/karol-broda/betanet-c

general thoughts

these projects share a theme: taking something i use daily and building a simplified version to understand how it works.

memory management. C forces you to think about allocation, lifetimes, ownership. when does this memory get freed? who owns this pointer? what happens if this allocation fails?

after dealing with manual memory management, automatic garbage collection feels like magic. but you also understand why GC pauses happen and why languages like Rust exist.

error handling. in higher-level languages, exceptions handle errors for you. in C, every function that can fail needs explicit error checking. it's tedious but makes you think about failure modes.

this changed how i write code even in other languages. now i think about what can actually go wrong instead of hoping exceptions catch everything.

performance intuition. you develop a feel for what's fast and what's slow. system calls are expensive. cache misses matter. memory allocation isn't free. abstractions have cost.

this doesn't mean you should optimize everything. but knowing where the costs are helps when you actually need performance.

status

these are learning projects, not production software. they work for their intended purpose but lack polish, proper error handling, security hardening, and all the other things real software needs.

the goal wasn't to build better versions of existing tools. it was to understand how those tools work by building simple versions myself.

if you're learning systems programming, i'd recommend the same approach. pick something you use, build a toy version, break it, fix it, learn from it.

all code is available on github under permissive licenses. use it, modify it, learn from it.