OpenAI compatible API for local LLMs

2025-11-1

Overview

This project provides a fully self-hosted, OpenAI-compatible API for running large language models locally. It exposes the same /v1/chat/completions interface used by OpenAI, allowing any existing client or application to switch to a local LLM simply by changing the base URL.

The implementation is built around llama-cpp-python, with careful optimization for different hardware backends. I created three specialized Docker images to ensure the server runs efficiently on:

GPU-accelerated NVIDIA instances (CUDA + cublas)
Standard x86-64 CPUs
ARM64 systems (Servers with ARM processors)

Each image is built from source with the correct flags, libraries, and quantization support to minimize startup time and maximize inference throughput.

Project Objectives

Reliability across hardware Build a server that starts quickly and delivers predictable performance whether running on a small laptop or a GPU-attached cloud instance.
Drop-in compatibility Allow any OpenAI client (Python, JS, cURL, Postman, LangChain, etc.) to work immediately without code changes.
Self-custody of data Enable developers and teams to run inference without sending prompts to external APIs—useful for privacy, prototyping, controlled environments, or compliance-sensitive applications.
Portable deployment Package everything in Docker so the API behaves consistently on local machines, remote servers, or cloud GPU instances. I also tested AMI-based distribution on AWS as a possible lightweight PaaS.