The Weekend I Disconnected from the Cloud

In the previous post I wrote about the Fable 5 ban and why we in Europe need to own a piece of our AI stack. That was the strategic argument. This is the practical one. The weekend I actually sat down, turned off the cloud, and forced myself to do real work with models running on my own machine.

I expected it to be frustrating. It was, sometimes. But mostly it was surprising.

Start with the engine, not the car

The most common mistake people make with local models is starting by chasing the perfect model. They read benchmarks, compare parameters, debate Qwen versus Llama on Reddit. Then they realise they do not even have a programme that can run the model.

That is the wrong order. First you need a runtime. A programme that takes a model file and runs it on your hardware. There are really only two names worth knowing.

Ollama is the favourite among developers. It runs from the terminal. One command and the model starts. If you are comfortable with the terminal, it is the fastest way to get going. I run it myself.

LM Studio has a graphical interface. A model browser where you click and it runs. No terminal. If the terminal scares you, start here. It works just as well.

In both cases, you have a model running in 15 minutes. It really is no harder than installing an app.

Parameters, RAM, and where the line is

A model's size is measured in billions of parameters. Bigger means smarter, but bigger also means more memory. The only thing you really need to understand about local AI hardware is this table.

4 billion parameters runs on practically anything. A laptop with 8 GB of RAM. Many phones. Results are limited, but it works as proof of concept.

12 billion parameters is the sweet spot for machines with 16 GB of RAM. This is where most people should live. It is enough for summaries, writing assistance, simpler coding tasks, and most daily questions. This is the level where I was surprised. A 12-billion parameter model on my Mac handled an estimated 80 percent of what I normally use Claude or ChatGPT for.

27 to 35 billion parameters needs a good Mac with 32 GB or more, or a dedicated GPU. Here it starts feeling genuinely capable. Code generation, longer documents, more nuanced reasoning.

70 billion and above requires serious hardware. A maxed-out Mac Studio or a dedicated AI box. Nvidia's DGX Spark with 128 GB of unified memory is built specifically for this. 128 GB of memory, designed to run around the clock, runs Linux, and effectively becomes a small data centre on your desk. You run the model, leave it running, and connect from your phone.

My setup for this weekend was a MacBook Pro with 36 GB of RAM. That was enough for 27-billion parameter models with quantisation, and more than sufficient to do real work.

The five models you need to know

There are dozens of open models. I tested the ones people actually recommend and can confirm the reputation holds for most of them.

Qwen 3.6 from Alibaba is the best all-rounder right now. Strong on coding, strong on multilingual, clean commercial licence. The 27 and 35-billion parameter variants perform at levels surpassing previous-generation models four times their size. If you are only going to learn one model, learn this one.

DeepSeek is best for harder reasoning and coding problems. One thing to know: the reasoning models think for 10 to 30 seconds before responding. That is normal. It is not a performance problem, it is the model actually reasoning.

Gemma from Google runs incredibly compact. There is a version that fits in 16 GB of RAM. It writes beautifully. That Google gives it away for free is actually remarkable. It would not be surprising if Google doubles down on Gemma now that local models have suddenly become a strategic question.

Mistral from Paris is the European challenger. Mixtral with its mixture-of-experts architecture has become a standard for local installations. Mistral Large performs in the same class as GPT-4 and Claude Sonnet. Clean commercial licence, European law, and no conflict of interest with the US Department of Defence. If the European dependency concerns you, and it should, Mistral is the natural first choice.

Llama from Meta has the largest ecosystem. Enormous community, masses of fine-tunings and tutorials. Runs on almost anything. When you do not know which model fits, there is likely a Llama variant for your situation.

Quantisation: the trick that makes it all possible

Quantisation is what actually makes local models practically usable. The technique shrinks a model so it runs on weaker hardware with minimal quality loss.

Think of it as the difference between an uncompressed image file and a high-quality JPEG. The file is much smaller, and the eye can barely tell the difference.

When you download models you will see labels like Q4 or Q5. That is the compression level. Q4 roughly halves the memory requirement with fairly minimal quality loss. This is what makes a model that "really" needs a server suddenly run smoothly on your laptop.

This concept is the key. It is what makes your hardware suddenly capable of doing twice as much.

Connecting to an agent

Running a model and chatting with it in the terminal is one thing. The real leap comes when you point an agent at your local model.

Hermes is currently the most widely used AI agent in the world, and it is specifically built to run locally and never stop. You point a Hermes profile at your local model and suddenly you have an agent that runs for free, runs offline, remembers everything, writes its own tools, and that you can message via Telegram or your preferred messaging app while the heavy work runs on the box on your desk.

It is roughly the difference between having a calculator and having an assistant. The calculator answers what you ask. The assistant takes initiative, runs tasks, and reports back.

Five things I learned this weekend

Here is what I take away from forcing myself to work locally for two days.

The context window is the real limitation. Cloud models give you an enormous context window without you thinking about it. Local models make you pay for it in memory. The more context, the more RAM consumed. Keep your sessions tight. Do not dump your entire life into a thread or the machine will throttle and you will conclude that local models do not work. They work. But they require discipline.

Tools compensate for model size. A small local model with web search, file access, and code execution beats a large model without tools. The capability gap closes fast when you plug in the right tools. The model is the engine. The tools are the wheels.

Sometimes it forgets its tools. This is a quirk I did not expect. Sometimes the local model loses track of the fact that it has access to tools mid-session. It is not a showstopper, but it is something to be aware of. As of June 2026, it is still one of local models' teething problems.

Run local and cloud side by side for a week. That is the fastest way to build the instinct. You will be shocked at how often the free local model is good enough. You stop reaching for the expensive option for things a 12-billion parameter model handles without issue. That instinct, knowing what should run where, is the most important AI skill you can build right now.

The privacy angle is a business opportunity. Everything runs offline. Your data never leaves the machine. That opens entire markets that cloud-based AI tools cannot reach. Healthcare, legal, finance, every industry that legally cannot send its data to a third-party API. The constraint that blocks cloud providers is your competitive advantage if you build locally.

Where the line actually is

I want to be honest. Local models do not replace frontier models for everything. There are tasks where I still need Claude or GPT-5. Long, complex code generations. Deep analyses of large documents. Situations where I need the absolute best model available.

But for 80 percent of daily work? Summaries, drafts, code reviews, data extraction, brainstorming, email suggestions, quick research? The local model handles it. And it does so without my data being sent to a server in Virginia, without API costs, and without the risk of it disappearing with a government letter.

The gap between free-and-local versus expensive-and-cloud closed faster than most people expected. It closed faster than I expected. And it keeps closing. Every new version of Qwen, Gemma, Llama, and DeepSeek is better. Every new quantisation technique means the same model runs on cheaper hardware.

The best time to start with local models was six months ago. The second best is now.

Download Ollama. Pull down Mistral or Qwen 3. Run it on a real task. Not as an experiment. As a working tool. And if you want your AI stack to operate under European law, start with Mistral.

And next time something gets shut off, and it will happen again, you can still run your business, still deliver, still build. From your own machine. With your own generator in the garage.