Node Turns Waiting Into Events. Go Moves Context Switching Into User Space.

Node Turns Waiting Into Events. Go Moves Context Switching Into User Space.

Most discussions of Node vs Go concurrency stop at 'async vs threaded.' The real split is deeper — where does context switching happen, and what is the unit of scheduling?

April 27, 2026
Harrison Guo
18 min read
System Design Backend Engineering

Most discussions of TypeScript/Node vs Go concurrency stop at the surface: Node is async, Go is threaded. That framing isn’t wrong — it just isn’t deep enough to be useful when you’re picking a runtime, debugging a tail-latency problem, or explaining to your team why one of the services keeps falling over under CPU load.

The real difference is not async vs threaded. It’s a question about where, in the system, suspended work lives — and what shape it takes when it’s resumed.

tl;dr — Both Node and Go refuse to let the CPU sit idle while a request waits on I/O. They disagree on the unit of scheduling. Node’s unit is the continuation — the tail of an async function captured as a heap closure. Go’s unit is the goroutine — a full call stack the runtime can suspend and resume in user space. That single decision cascades into every other property of each runtime.


The Wrong Question

“Async vs threaded” is the wrong frame because it makes you think the choice is between paradigms. It isn’t. Both runtimes have already made the same fundamental decision: do not block an OS thread waiting for slow external work. The interesting choice is how they implement that.

The actually useful question is:

When a request is waiting for I/O — for a database, an HTTP call, a Redis round-trip, a file read — what does the CPU do, and where does the suspended state of that request live?

Once you frame it that way, Node and Go aren’t opposites. They’re two answers to the same question — and each answer cascades into a different language shape, a different library style, and a different failure mode under load.

The naive blocking model answers the question with “an OS thread waits for the syscall to return.” That model collapses around a few thousand concurrent connections — memory per thread, scheduler overhead, kernel context-switch cost. By 40,000 connections you’re out of RAM, not CPU. Node and Go both refuse to do this. They diverge on which resource gets freed up and how the suspended work is captured for later resumption.


Node’s Answer: Turn Waiting Into an Event

Node’s model can be summarized in one line: the JS main thread only executes code that’s already ready to run.

Look at this:

const user = await db.getUser(id);
return user;

It reads as if the function is paused, blocking on the database. It isn’t. Here’s what V8 actually does at the bytecode level when it compiles an async function: it rewrites the body into a state machine, with each await becoming a state transition.

The function above gets transformed into something equivalent to:

function asyncFn() {
  const promise = new Promise((resolve) => {
    let state = 0;
    const closure = {};                  // heap object holding locals

    function step(value) {
      switch (state) {
        case 0:
          state = 1;
          db.getUser(id).then(step);     // await → register continuation
          return;                         // ← function POPS here
        case 1:
          closure.user = value;           // resume: locals live in closure
          resolve(closure.user);
          return;
      }
    }
    step();
  });
  return promise;
}

Three things to notice:

  1. await is not a pause. It’s the point at which V8 returns from the function and pops the JS stack frame. The “rest of the function” is captured as a continuation registered on the awaited Promise via .then.
  2. Local variables move to the heap. Because the stack frame is gone, locals (user here) live in a heap closure, accessible only when the state machine resumes.
  3. Each await slices the function into another state. A function with two awaits runs in three event-loop turns, with three independently-pushed JS frames, with all live state stored in heap closures between them.

That third point is the most non-obvious. A single async function is not one unit of execution — it’s a sequence of fresh frames separated by event-loop turns:

sequenceDiagram
    autonumber
    participant EL as Event Loop (libuv)
    participant JS as JS Main Thread (V8)
    participant H as Heap (closures)
    participant K as Kernel / I/O

    rect rgb(254, 243, 199)
    Note over EL,K: Turn 1
    EL->>JS: dispatch handler()
    activate JS
    Note over JS: const a = 1
    JS->>JS: call compute1() → returns Promise
    JS->>H: V8 stores closure {state:1, a}
    JS->>H: register step as .then handler
    JS-->>EL: handler frame POPPED, returns Promise
    deactivate JS
    end

    EL->>K: epoll_wait (no microtasks)
    Note over EL,K: ... time passes, OS thread parked ...
    K-->>EL: I/O ready (compute1 resolved)
    EL->>EL: enqueue step in V8 microtask queue

    rect rgb(219, 234, 254)
    Note over EL,K: Turn 2
    EL->>JS: invoke step(value) — NEW frame
    activate JS
    JS->>H: load closure {state:1, a}
    Note over JS: x = value, state → 2
    JS->>JS: call compute2() → returns Promise
    JS->>H: register step (next state)
    JS-->>EL: frame POPPED again
    deactivate JS
    end

    K-->>EL: compute2 resolved
    EL->>EL: enqueue step

    rect rgb(220, 252, 231)
    Note over EL,K: Turn 3
    EL->>JS: invoke step(value) — yet another new frame
    activate JS
    JS->>H: load closure {state:2, a, x}
    Note over JS: y = value, state → done
    JS->>JS: res.json(a + x + y)
    JS-->>EL: handler's Promise resolved
    deactivate JS
    end

There is no “paused” function. There are only captured continuations and fresh frames that resume them. The event loop is the dispatcher: it watches for I/O readiness via libuv, for resolved Promises (via V8’s microtask queue), for timers — and pulls the corresponding continuation onto the JS thread when it’s ready to run. One thread can manage tens of thousands of concurrent connections, because at any moment only a handful of them have work that’s actually ready.

This is event-driven concurrency in its precise sense — the runtime turns “waiting” into a registered event, and only resumes the captured continuation when the event fires.

The Visible Side Effect: Function Color

Because the suspension point has to be marked at compile time, async-ness becomes part of the function’s type. A function that does I/O returns Promise<T>. Its callers must await it. Once they await, they themselves return Promise<T>. The “color” propagates up the call stack until you hit an async-aware entry point — typically the top of an HTTP handler or the event loop itself.

Bob Nystrom named this the function color problem in 2015. It’s not a notation choice — it’s a logical consequence of the stackless coroutine model. V8 cannot save and restore arbitrary JS call stacks. The only way to express suspension is “return a Promise and be marked async,” and once one function does that, every function on the way up has to do the same.

flowchart LR
    subgraph Node["Node — Color Cascades Up the Call Stack"]
        direction TB
        n1["readFromDB() 🟥
→ Promise<Data>
does I/O"] n2["fetchUser() 🟥
→ Promise<User>
must await readFromDB"] n3["handleRequest() 🟥
→ Promise<Response>
must await fetchUser"] n4["route('/user', handler) 🟥
must accept Promise return"] n5["main() 🟥
→ Promise<void>
top-level needs await"] n1 -.color infects.-> n2 n2 -.color infects.-> n3 n3 -.color infects.-> n4 n4 -.color infects.-> n5 end subgraph Go["Go — No Color, No Cascade"] direction TB g1["readFromDB()
→ Data
blocks on I/O internally"] g2["fetchUser()
→ User
plain call"] g3["handleRequest()
→ Response
plain call"] g4["route('/user', handler)
handler is a plain func"] g5["main()
plain func"] g1 --> g2 g2 --> g3 g3 --> g4 g4 --> g5 end Node ~~~ Go classDef redClass fill:#fee2e2,stroke:#dc2626,stroke-width:2px,color:#7f1d1d classDef plainClass fill:#f3f4f6,stroke:#6b7280,stroke-width:2px,color:#111827 class n1,n2,n3,n4,n5 redClass class g1,g2,g3,g4,g5 plainClass

The Hard Limit

The model fails the moment your code stops waiting. A single CPU-bound operation:

while (true) { /* heavy work */ }

…holds the JS main thread, and every other request on this process is dead until it returns. The event loop has nowhere else to go. Worker threads, child processes, or splitting CPU work into a separate service are real fixes, but they’re escape hatches — they exist because the core model has only one main thread executing JS, and there is exactly one of it.


Go’s Answer: Move Context Switching Into User Space

Go writes synchronous code:

user := db.GetUser(id)
sendResponse(user)

There is no await. There is no callback. The function looks like it blocks on the database. And yet the program scales to hundreds of thousands of concurrent operations on modest hardware.

The trick is that the scheduling boundary has been moved. Where Node has the programmer mark the suspension point with await and the runtime captures a continuation, Go lets the programmer write straight-line code and has the runtime suspend the entire goroutine when it hits a blocking I/O call.

This is the central insight, and the cleanest one-line statement of Go’s concurrency model:

Go’s essence is the user-space-ification of context switching.

A goroutine isn’t an OS thread. It’s a small (initially 2 KB) growable stack and a register snapshot, managed by the Go runtime. The runtime maps a large number of goroutines (G) onto a small number of OS threads (M) using scheduling contexts (P). This is the GMP model:

  • G — a goroutine. The unit of scheduling. Cheap to create, cheap to suspend.
  • M — an OS thread. Usually only GOMAXPROCS of them.
  • P — a scheduling context. Decides which G runs on which M.
many G  →  Go scheduler  →  few M  →  CPU cores

When a goroutine hits a blocking syscall or a channel wait, the Go runtime suspends the goroutine — saves its stack and registers — detaches it from the current M, and schedules another runnable goroutine onto that M. When the original goroutine’s wait completes, it’s marked runnable again, and some M eventually picks it up and resumes execution from the suspension point. None of this enters the kernel. No clone(2), no kernel-mediated thread switch, no kernel scheduler queue. The bookkeeping is all in user space.

That’s the user-space-ification. The CPU still has to switch contexts when work shifts between goroutines, but the cost is roughly a function call plus a stack swap — not a kernel-mediated thread switch.

The key contrast with Node’s model is in where the suspended state lives:

flowchart LR
    subgraph Node["Node — Stackless Coroutine"]
        direction TB
        nStack["JS Call Stack
(one frame at a time)
━━━━━━━━━━━━
currently empty
(all async fns popped,
waiting in event loop)"] nHeap["Heap"] nC1["continuation #1
{ state: 1,
  locals: {req, res, a},
  step: fn ptr }"] nC2["continuation #2
{ state: 0, ... }"] nC3["continuation #3
{ state: 2, ... }"] nHeap --> nC1 nHeap --> nC2 nHeap --> nC3 nNote["Each await pops the frame.
State lives only in heap closures.
Stack is reused across all turns."] end subgraph Go["Go — Stackful Coroutine"] direction TB gM["OS Thread (M)
currently running G3 ▶"] gHeap["Heap"] gG1["goroutine G1 (2 KB stack)
━━━━━━━━━━━━
process()
  ↳ slowDouble()
    ↳ time.Sleep() ★parked"] gG2["goroutine G2 (2 KB stack)
━━━━━━━━━━━━
handler()
  ↳ db.Query() ★parked"] gG3["goroutine G3 (2 KB stack)
━━━━━━━━━━━━
currently on M ▶"] gHeap --> gG1 gHeap --> gG2 gHeap --> gG3 gNote["Each goroutine owns its full stack.
Runtime saves/restores entire stack
on suspend. No frame pop needed."] end Node ~~~ Go classDef nodeAlert fill:#fee2e2,stroke:#dc2626,stroke-width:3px,color:#7f1d1d classDef nodeClass fill:#fef3c7,stroke:#d97706,color:#111827 classDef goClass fill:#dbeafe,stroke:#2563eb,color:#111827 classDef noteClass fill:#ffffff,stroke:#374151,stroke-width:1.5px,color:#111827 class nStack nodeAlert class nHeap,nC1,nC2,nC3 nodeClass class gM,gHeap,gG1,gG2,gG3 goClass class nNote,gNote noteClass

In Node, the JS call stack is shared and almost always near-empty — every async function in flight has already popped, with its state sitting in a heap closure. In Go, every goroutine owns its full call chain on its own heap-allocated stack; suspended goroutines look like frozen frames waiting for the runtime to resume them on some OS thread.

This is also why neither language can simply borrow the other’s model. Node runs on V8, which was designed in 2008 for browser JS — single call stack, synchronous semantics, no concept of saving stacks across yields. Adding stackful coroutines would mean rewriting the engine, which is roughly what Java’s Project Loom did to the JVM at huge cost. Go was designed from scratch with a runtime that owns stacks, can grow them, and can save them. The choice is locked in by runtime architecture, not language taste.


What “User-Space” Actually Buys You

The slogan only matters if user-space context switching is meaningfully cheaper than the kernel-mediated kind. It is — by more than an order of magnitude.

Two goroutines pinned to one OS thread (GOMAXPROCS=1), ping-ponging via runtime.Gosched() and via an unbuffered channel. Two pthreads pinned to one core (taskset -c 0), ping-ponging via pthread_mutex + pthread_cond. (Reproduction code at the end of the post.)

Measured on Intel N100, Ubuntu 24.04 (kernel 6.8.0), Go 1.23.4, gcc 13.3:

Operationns / switch
Goroutine yield (runtime.Gosched, GOMAXPROCS=1)~102 ns
Goroutine round-trip via unbuffered channel~436 ns (≈218 ns per G-switch + channel coordination)
pthread switch (mutex+cond ping-pong, single core)~2,900 ns (range 2,818–3,611 across 5 runs of 2M iterations)
Operation Goroutine yield (runtime.Gosched, GOMAXPROCS=1)
ns / switch ~102 ns
Operation Goroutine round-trip via unbuffered channel
ns / switch ~436 ns (≈218 ns per G-switch + channel coordination)
Operation pthread switch (mutex+cond ping-pong, single core)
ns / switch ~2,900 ns (range 2,818–3,611 across 5 runs of 2M iterations)

Ratio: roughly 28× cheaper for the bare scheduler yield, ~13× cheaper for the apples-to-apples synchronized round-trip.

Where the gap comes from:

  • Mode switch. The user → kernel → user round-trip alone is ~100 ns of entry/exit and ABI-mandated register save/restore. A goroutine switch never crosses that line.
  • Scheduler work in kernel space. Linux CFS maintains a red-black tree of runnable threads with locked, cross-CPU runqueues. The Go scheduler does the same job in user space with per-P local runqueues and lock-free fast paths — and skips the kernel locks entirely.
  • Cache and TLB effects. A kernel scheduler may migrate a thread to a different core, costing you cold L1/L2 and an instruction-cache reload. Goroutines normally stay on the same M, so the cache stays warm.

What the model does not buy you: a goroutine that makes a real blocking syscall still pays for a real OS thread switch — the runtime detaches the G from its M and may spin up another M so the rest of the goroutines keep running. Async preemption (Go 1.14+, signal-based) is the runtime’s answer to tight loops that never yield, and it has its own cost. Once you saturate GOMAXPROCS, the user-space runqueue itself starts to show up in profiles.

The “user-space-ification” buys you cheap G-to-G switching on a hot M. That’s where the order-of-magnitude lives. The syscalls, the M-to-M handoffs, the actual kernel work — those are still as expensive as they always were. The model wins by making the common case — many concurrent goroutines, mostly waiting, occasionally running — almost free.

(N100 is a low-power Alder Lake-N E-core; absolute numbers will be smaller on a server-class Xeon or EPYC, but the ratio is expected to hold.)


The Unit of Scheduling

The cleanest comparison is to ask what each runtime actually schedules:

Node / TypeScriptGo
Unit of schedulingcallback / Promise continuationgoroutine
What’s captured at suspensiontail of an async function as a heap closurefull call stack + registers
How code looksexplicit async/awaitstraight-line synchronous
Suspension marked bythe programmer (await)the runtime (any blocking op)
Suspended state lives inV8 microtask queue + heap closuregoroutine stack on the user-space heap
Kernel involvementepoll/kqueue/IOCP via libuvepoll/kqueue/IOCP via netpoller
CPU parallelismone main JS thread; needs workers/cluster for coresM:N scheduler runs goroutines across cores natively
Function coloryes (Promise infects up the call stack)no (any function may block)
What breaks under CPU loadthe entire event loopnothing — scheduler runs another G on another M
Unit of scheduling
Node / TypeScript callback / Promise continuation
Go goroutine
What’s captured at suspension
Node / TypeScript tail of an async function as a heap closure
Go full call stack + registers
How code looks
Node / TypeScript explicit async/await
Go straight-line synchronous
Suspension marked by
Node / TypeScript the programmer (await)
Go the runtime (any blocking op)
Suspended state lives in
Node / TypeScript V8 microtask queue + heap closure
Go goroutine stack on the user-space heap
Kernel involvement
Node / TypeScript epoll/kqueue/IOCP via libuv
Go epoll/kqueue/IOCP via netpoller
CPU parallelism
Node / TypeScript one main JS thread; needs workers/cluster for cores
Go M:N scheduler runs goroutines across cores natively
Function color
Node / TypeScript yes (Promise infects up the call stack)
Go no (any function may block)
What breaks under CPU load
Node / TypeScript the entire event loop
Go nothing — scheduler runs another G on another M

The two columns describe deeply different mental models, but they belong to the same family. They are both user-space concurrency runtimes that avoid kernel thread-per-request. They differ in where the suspension is captured (the language vs. the call stack) and how broad the scheduler’s mandate is.


Where the Boundaries Diverge: CPU-Bound Work

Node and Go look interchangeable on I/O-bound workloads. They diverge sharply the moment CPU work enters the picture.

Node’s event loop has one job: dispatch ready callbacks onto a single JS thread. If a callback runs for 200 ms doing JSON parsing or hashing, the loop is frozen for those 200 ms. Every other suspended continuation has to wait. Throughput collapses.

Go’s runtime has a different mandate. It doesn’t only manage waiting — it also manages execution. If you spawn:

go task1()
go task2()
go task3()

…the scheduler is happy to put each goroutine on a different M, run them on different cores in true parallel, and preempt long-running goroutines so they don’t starve the rest of the runtime. CPU-bound goroutines aren’t a special case to work around. They’re just goroutines.

That’s why Go’s concurrency model covers more ground:

Node’s model mainly solves non-CPU-bound concurrency — network I/O, database waits, downstream API calls. Go’s model solves I/O waiting and CPU parallelism with the same primitive.

This isn’t a knock on Node. The event loop is brilliant at what it’s designed for: lots of slow waits, light per-request CPU. It’s the natural shape of API gateways, BFFs, websocket hubs, real-time aggregation, and most of the JSON-shuffling that makes up modern web backends. But sustained CPU work, mixed CPU + I/O pipelines, long-lived infrastructure services — those are workloads where Go’s scheduler-driven model has more headroom built in.


Two Answers to the Same Question

Strip away the implementation details and the two runtimes are answering the same question with different abstractions:

Concurrency at scale is the problem of what to do with the CPU while a request waits on I/O.

Node’s answer: turn the wait into an event, capture the rest of the function as a continuation, resume the continuation when the event fires. One thread cycling through ready continuations.

Go’s answer: run the request on a goroutine, suspend the goroutine in user space when it blocks, schedule another runnable goroutine onto the OS thread, resume the original when its wait completes.

Two ways of solving the same waste. One state-machines it. The other lowers the cost of context switching far enough that you can afford to keep one execution flow per request.

Two answers to one question: one is events, implemented as a state machine. The other is low-cost user-space context switching.

But there’s a deeper layer worth surfacing. The two answers also disagree about whether suspension should be visible in the type system. Node says yes — Promise<T> is part of the signature, async is part of the contract, function color propagates. Go says no — any function may block, and the type doesn’t carry that information.

This visibility-vs-uniformity trade-off shows up far beyond Node and Go. It’s the same shape as monadic IO vs implicit IO in Haskell, checked vs unchecked exceptions in Java, capability-based security vs ambient authority. Each pair makes the same trade: composable static reasoning vs ergonomic uniform code. Node and Go are picking sides of a much bigger question.

You see the consequence in the libraries. Node libraries publish fs.readFile and fs.readFileSync, two retry helpers (one for sync ops, one for async), p-limit-style bounded-concurrency wrappers around Promise.all. Go libraries publish os.ReadFile (one function), one Retry(op func() error, n int) error, twenty lines of chan + WaitGroup for bounded concurrency. The Go versions aren’t simpler because Go developers are smarter — they’re simpler because the runtime hides the same complexity that Node’s type system insists on exposing.


The Closing Line

If you remember one thing from this:

Node turns waiting into events. Go turns execution flows into schedulable units. Both refuse to let the CPU sit idle while I/O blocks — they just disagree on what the unit of scheduling should be.

Or, if you want the deeper layer:

Node makes “this function might suspend” visible at the type level. Go makes it invisible.

That’s the whole story. Everything else — await vs go, libuv vs the netpoller, V8’s microtask queue vs GMP, single-thread bottleneck vs CPU-bound resilience, libraries that look complicated vs libraries that look simple — falls out of that one disagreement.


Appendix: Reproduce the Benchmark

goroutine_switch_test.goGOMAXPROCS=1 go test -bench=. -benchtime=5s -count=5:

package bench

import (
	"runtime"
	"sync"
	"testing"
)

// Channel ping-pong: each iter is a full round-trip = 2 G-switches.
func BenchmarkGoroutineSwitchChannel(b *testing.B) {
	ch := make(chan struct{})
	done := make(chan struct{})
	go func() {
		for {
			select {
			case <-done:
				return
			case <-ch:
				ch <- struct{}{}
			}
		}
	}()
	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		ch <- struct{}{}
		<-ch
	}
	b.StopTimer()
	close(done)
}

// Bare scheduler yield. Each iter ≈ 1 G-switch.
func BenchmarkGoroutineSwitchGosched(b *testing.B) {
	var wg sync.WaitGroup
	wg.Add(1)
	half := b.N / 2
	go func() {
		for i := 0; i < half; i++ {
			runtime.Gosched()
		}
		wg.Done()
	}()
	b.ResetTimer()
	for i := 0; i < half; i++ {
		runtime.Gosched()
	}
	wg.Wait()
}

pthread_switch.cgcc -O2 -o pthread_switch pthread_switch.c -lpthread && taskset -c 0 ./pthread_switch 2000000:

#define _GNU_SOURCE
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <stdint.h>

static pthread_mutex_t mu  = PTHREAD_MUTEX_INITIALIZER;
static pthread_cond_t  cv  = PTHREAD_COND_INITIALIZER;
static volatile int    turn = 0;
static long            iters;

static void *worker(void *arg) {
    int my_turn = (int)(intptr_t)arg;
    pthread_mutex_lock(&mu);
    for (long i = 0; i < iters; i++) {
        while (turn != my_turn) pthread_cond_wait(&cv, &mu);
        turn = 1 - my_turn;
        pthread_cond_broadcast(&cv);
    }
    pthread_mutex_unlock(&mu);
    return NULL;
}

static double now_ns(void) {
    struct timespec ts;
    clock_gettime(CLOCK_MONOTONIC, &ts);
    return (double)ts.tv_sec * 1e9 + (double)ts.tv_nsec;
}

int main(int argc, char **argv) {
    iters = (argc > 1) ? atol(argv[1]) : 1000000L;
    pthread_t t0, t1;
    double start = now_ns();
    pthread_create(&t0, NULL, worker, (void *)(intptr_t)0);
    pthread_create(&t1, NULL, worker, (void *)(intptr_t)1);
    pthread_join(t0, NULL); pthread_join(t1, NULL);
    double end = now_ns();
    printf("ns / switch: %.1f\n", (end - start) / (2.0 * iters));
    return 0;
}

GOMAXPROCS=1 forces both goroutines onto the same M so we measure pure G-to-G switching, not cross-core migration. taskset -c 0 pins both pthreads to one CPU so they actually have to context-switch (otherwise they run in parallel on two cores and there is nothing to measure). Both benches do the simplest possible synchronized hand-off — no I/O, no real work — so what is left is the cost of the switch itself.

🎧 More Ways to Consume This Content

I occasionally advise small teams on backend reliability, Go performance, and production AI systems. Learn more: /services

Comments

This space is waiting for your voice.

Comments will be supported shortly. Stay connected for updates!

Preview of future curated comments

This section will display user comments from various platforms like X, Reddit, YouTube, and more. Comments will be curated for quality and relevance.