OpenAI compatible API for local LLMs

2025-11-1

Overview

This project provides a fully self-hosted, OpenAI-compatible API for running large language models locally. It exposes the same /v1/chat/completions interface used by OpenAI, allowing any existing client or application to switch to a local LLM simply by changing the base URL.

The implementation is built around llama-cpp-python, with careful optimization for different hardware backends. I created three specialized Docker images to ensure the server runs efficiently on:

  • GPU-accelerated NVIDIA instances (CUDA + cublas)
  • Standard x86-64 CPUs
  • ARM64 systems (Servers with ARM processors)

Each image is built from source with the correct flags, libraries, and quantization support to minimize startup time and maximize inference throughput.

Project Objectives

  • Reliability across hardware Build a server that starts quickly and delivers predictable performance whether running on a small laptop or a GPU-attached cloud instance.

  • Drop-in compatibility Allow any OpenAI client (Python, JS, cURL, Postman, LangChain, etc.) to work immediately without code changes.

  • Self-custody of data Enable developers and teams to run inference without sending prompts to external APIs—useful for privacy, prototyping, controlled environments, or compliance-sensitive applications.

  • Portable deployment Package everything in Docker so the API behaves consistently on local machines, remote servers, or cloud GPU instances. I also tested AMI-based distribution on AWS as a possible lightweight PaaS.