OpenAI compatible API for local LLMs
2025-11-1
Overview
This project provides a fully self-hosted, OpenAI-compatible API for running large language models locally.
It exposes the same /v1/chat/completions interface used by OpenAI, allowing any existing client or application to switch to a local LLM simply by changing the base URL.
The implementation is built around llama-cpp-python, with careful optimization for different hardware backends. I created three specialized Docker images to ensure the server runs efficiently on:
- GPU-accelerated NVIDIA instances (CUDA + cublas)
- Standard x86-64 CPUs
- ARM64 systems (Servers with ARM processors)
Each image is built from source with the correct flags, libraries, and quantization support to minimize startup time and maximize inference throughput.
Project Objectives
Reliability across hardware Build a server that starts quickly and delivers predictable performance whether running on a small laptop or a GPU-attached cloud instance.
Drop-in compatibility Allow any OpenAI client (Python, JS, cURL, Postman, LangChain, etc.) to work immediately without code changes.
Self-custody of data Enable developers and teams to run inference without sending prompts to external APIs—useful for privacy, prototyping, controlled environments, or compliance-sensitive applications.
Portable deployment Package everything in Docker so the API behaves consistently on local machines, remote servers, or cloud GPU instances. I also tested AMI-based distribution on AWS as a possible lightweight PaaS.