· Updated March 20, 2026 · 7 min read

Cloud Run SIGILL: CPU Swap Broke llama.cpp

cloud-rundebuggingllama-cppgcpai-infrastructure
HAIT Cloud & DevOps Consulting

Problem: Cloud Run crashes with SIGILL (signal 4) during cold starts after months of stable operation - no code changes, no dependency updates.

Root cause: Google silently swapped CPUs in europe-west2 to Sapphire Rapids. These CPUs report AVX-512 in CPUID but have it disabled at the hypervisor level. Libraries like ggml pick AVX-512 code paths and hit an illegal instruction.

Fix: Rebuild llama-cpp-python with AVX-512 disabled:

CMAKE_ARGS='-DGGML_NATIVE=OFF -DGGML_AVX512=OFF' pip wheel llama-cpp-python

Affected: Any library using CPUID-based dispatch (ggml, catboost, FAISS, OpenBLAS) on Cloud Run or any serverless platform where hardware changes without notice.

We had an AI chat service running on Google Cloud Run. Python, FastAPI, llama.cpp with a Llama 3.2 3B model, ChromaDB for vectors, the usual stack. It ran fine for six months. Nobody touched it.

Then one day, every cold start began crashing with SIGILL (signal 4). Illegal instruction.

No code changes. No dependency updates. Nothing.

What the crash looked like

The logs showed the LLM engine starting its warm-up inference, and then… nothing. No Python traceback, no error message. The process just died. Cloud Run killed it after the startup probe timed out.

The exit signal was SIGILL - the CPU hit an instruction it couldn’t execute. That’s not something you see in normal Python code. It means something went wrong deep in a C/C++ extension.

Wrong guesses

Corrupted model? No. The model loaded fine every time. The crash happened during the first actual inference call, not during loading.

Out of memory? Service had 8Gi allocated, peak usage was around 4Gi. Not even close.

Dependency changed? The base Docker image was built back in August 2025 and never rebuilt. Same wheel, same versions.

Thread safety? This was the interesting one. The crash happened in a worker thread. But when I ran the exact same inference code in a separate subprocess, it worked. Same container, same model, same code. Different process.

That last bit was the clue.

What was actually happening

The difference between the main process and the subprocess was what else was loaded. The main process had PyTorch loaded (for a toxicity detection model). The subprocess didn’t.

Both PyTorch and llama.cpp use OpenMP and both query CPU features at startup to pick optimized code paths. When I checked the CPUID flags inside the container:

grep -o 'avx512[a-z_]*' /proc/cpuinfo | sort -u
avx512bw
avx512cd
avx512dq
avx512f
avx512vl

Intel Sapphire Rapids. Google had swapped the hardware in europe-west2 sometime between September 2025 and February 2026.

Here’s the problem: Sapphire Rapids physically has AVX-512 and correctly reports it in CPUID. But Google Cloud Run disables AVX-512 execution on these CPUs.

So ggml (the compute library inside llama.cpp) reads CPUID, sees AVX-512 flags, picks the AVX-512 code path for matrix math, and then… SIGILL. The CPU says “I can do this” but when you actually try, it throws an illegal instruction.

This is a known issue that hits multiple projects. Any library doing CPUID-based dispatch (catboost, FAISS, ggml, OpenBLAS) can run into this on Cloud Run.

Why it worked before

Before the hardware swap, europe-west2 probably had Cascade Lake or Ice Lake CPUs that either didn’t have AVX-512 at all, or had it actually enabled. Google swapped hardware without any notification. No changelog entry, no email, nothing. The service just started dying.

The fix

Rebuild llama_cpp_python with AVX-512 explicitly disabled:

docker run --platform linux/amd64 \
  -v "$(pwd)/assets:/out" \
  python:3.10-slim bash -c "
    apt-get update && apt-get install -y build-essential cmake gcc g++ &&
    CMAKE_ARGS='-DGGML_NATIVE=OFF -DGGML_AVX512=OFF' \
    pip wheel llama-cpp-python==0.3.5 --no-deps -w /out
  "

Two flags matter here:

  • DGGML_NATIVE=OFF - don’t optimize for the build machine’s CPU
  • DGGML_AVX512=OFF - skip AVX-512 paths even if CPUID says they’re available

Then drop the wheel into the Docker image:

COPY assets/llama_cpp_python-0.3.5-cp310-cp310-linux_x86_64.whl /tmp/
RUN pip install --no-deps --force-reinstall /tmp/llama_cpp_python-*.whl

Now ggml falls back to AVX2, which Sapphire Rapids fully supports. Performance difference for a 3B model is negligible.

Why Serverless CPU Changes Break Things

This is not a Cloud Run-specific problem. Every major cloud provider reserves the right to swap underlying hardware at any time. AWS, GCP, and Azure all guarantee vCPU count and memory, but none of them give you an SLA on specific CPU features. The CPUID flags you see today might not be the same ones you see next month.

This affects any workload that relies on CPU-specific instruction sets. ML inference libraries like ggml and ONNX Runtime use AVX-512 for matrix operations. Cryptography libraries use AES-NI for hardware-accelerated encryption. Video encoding tools like FFmpeg use SSE4 and AVX2 for codec optimizations. If your code auto-detects and dispatches to these instruction sets at runtime, a hardware swap can break it silently. AWS saw similar issues when customers migrated from Intel-based instances to Graviton - any binary compiled with x86-specific flags crashed immediately on ARM. The serverless version of this is worse because you don’t control the migration and you don’t get notified.

Takeaways

If you’re running native code (llama.cpp, PyTorch, numpy with MKL, whatever) on serverless, always compile with -DGGML_NATIVE=OFF or equivalent. Don’t let the binary auto-detect CPU features at build time. The machine you build on and the machine you run on are not the same, and the cloud provider can swap hardware whenever they want.

SIGILL with no Python traceback always means the crash is in a C extension. Python never gets a chance to catch it.

Pin your build flags. Don’t rely on auto-detection to pick the right code paths. Explicitly disable CPU features you don’t need. -DGGML_AVX512=OFF is better than hoping every machine in the fleet supports AVX-512. The same applies to any library with compile-time feature flags - always set them explicitly for serverless targets rather than letting the build system probe the build machine’s capabilities.

Monitor CPUID at startup. Add a startup check that logs which instruction sets are actually available - a quick read of /proc/cpuinfo flags or an equivalent CPUID query. When the hardware changes, you’ll see it in your logs immediately on the next cold start instead of debugging a crash after the fact. This takes five lines of code and gives you an early warning system for exactly this class of failure.

And if your service suddenly breaks after months of stability with no code changes - check if the underlying hardware changed. On serverless, you don’t control that. And nobody will tell you when it happens.

Frequently Asked Questions

What causes SIGILL on Cloud Run?

SIGILL (signal 4) means the CPU encountered an instruction it cannot execute. On Cloud Run, this typically happens when Google swaps the underlying hardware to CPUs that report AVX-512 in CPUID but have it disabled at the hypervisor level.

How do I fix SIGILL crashes in llama-cpp-python on Cloud Run?

Rebuild the wheel with AVX-512 explicitly disabled: CMAKE_ARGS='-DGGML_NATIVE=OFF -DGGML_AVX512=OFF' during the pip wheel build. This forces ggml to use AVX2, which works on all Cloud Run CPUs.

Does Google notify you when Cloud Run hardware changes?

No. Google can swap CPU hardware at any time without notification. There is no changelog, email, or API to detect it. This affects any service using CPU-specific optimizations on serverless platforms.

Which libraries are affected by the Cloud Run AVX-512 issue?

Any library doing CPUID-based dispatch: llama.cpp/ggml, catboost, FAISS, OpenBLAS, PyTorch with MKL, and numpy with MKL. Always compile with -DGGML_NATIVE=OFF or equivalent for serverless deployments.


Need help debugging cloud-native AI deployments? Get in touch.