Marko Radojćić On AI from Serbia by Marko Radojčić

Civil Engineer and an experienced IT guy reflecting on AI. Personal blog

Elements of Local LLM Running & Inferrence

Published on January 17, 2025

Locally run Large Language Models (LLM) are Neural Networks for Natural Language Processing usually based on Transformers model architecture. Major breakthrough was publishing of “Attention is all you need” paper on arXiv.

Due to comparatively small size of models published for local inference it has led to proliferation of Large Language Models and further introduction to all users who run them worldwide.

These models come in various shapes (model architectures) and sizes.

Starting with Meta LLaMa many others have followed to release their own weights for pretrained models (Generative Pretrained Transformer - short GPT).

With emergence of projects such as llama.cpp that support multiple platforms (all major operating systems - Windows, Linux, macOS) it became possible to create more “out of the box” solutions capable of packaging llama.cpp functonality and interface into easy to use solutions. Main examples are LM Studio and Ollama which both come with server capabalities for running LLMs locally - this enables a user to have a one more powerful computer system - any system with a relatively recent CPU and 16GB of RAM is surely enough to run these LLMs, and quantization techniques built into inference running engines allow for smaller models, faster downloads and faster inference on low powered hardware.

Both LM Studio and Ollama have the ability to fetch model weights and install an LLM based AI solution in a single command or few clicks. Ollama has a list of supported models while LM-Studio has the ability to download models in various quantizations from huggingface 🤗 directly in adequate format.

What is quantization?

Model weight can be written and processed with different level of detail. Encoding original model weights (usually in 32bit format of tensor weights for training) to 16, 8, 4 bit floating point numbers reduces model size (although it does usually degrade generation quality during inference).

Integer qunatizations do exist and the one 1-bit LLM architecture with 3 way gradients are making an impact on the landscape of increasingly popular small models.

These models - 1B, 1.5B, 2B, 3B are a solution for local inference on lower spec systems or for agentic infrastructures deployments.

The blur between LLMs and Language Models (LMs) that are both based on Transformers architecture is progressively increasing.

What used to be LLMs (7B+) are now almost considered medium size LMs.

?B model size denotes the number of billions of weights parameters a modal has.