AI that is all yours…

AI models tend to live in large data centres and use powerful and power-hungry GPU compute.

However, since there’s an ecosystem of openly available models, we can download and run them privately on our own machines. Not unexpectedly, there’s a subreddit for this as well.

I don’t want my LLM tokens soullessly mass produced in some AI Factory and shipped to me from the other side of the world! I want organic, locally produced farm to table artisanal tokens. So I’m growing them (slowly) myself!

Let’s dig into the software:

Software Setup

Note: All tools are cross platform and I’m using the macOS versions.

Ollama 0.9.3 – Link to Installer

Ollama lets your download and run “AI” models locally, it looks like this:

I thought they were sycophantic?

If you want a more familiar interface, install hollama (or any other Ollama frontend/GUI):

To automate running multiple prompts against multiple models, I used Ollama Grid Search.

To build AI Agents using a GUI there’s Flowise (Install Guide):

Testing Device

Apple iMac (Retina 5K, 27-inch, 2020)

3.1GHz 6‑core 10th-generation Intel Core i5, Turbo Boost up to 4.5GHz
32 GB (4 x 8GB) of 2666MHz DDR4 memory
AMD Radeon Pro 5300 with 4GB of GDDR6 memory (Unused, Ollama doesn’t support GPU compute for AMD on macOS – GitHub Issue – Models are running on CPU!)

Model Lineup

For testing, I’m using Alibaba Cloud’s Qwen3 model family. It comes in an array of dense models (0.6b, 1.7b, 4b, 8b, 14b, 32b, 235b) and an MoE model (30b-a3b). For simplicity, I’m not considering quantized versions of these models (or any others) in this post.

Here’s the Ollama page for Qwen3.

Let’s have a play…

What’s the biggest model I can run?

Bigger is better, or smarter, or something. Let’s establish a baseline.

I’ll be using the ollama CLI and running specific commands while monitoring RAM usage with Activity Monitor. Note: Memory Pressure refers to the health of the memory system. When under higher pressure it relies on memory compression and eventually swap. We want green charts.

Here’s my iMac on startup:

After running ollama run qwen3:1.7b –verbose “hello”:

After killing ollama and running ollama run qwen3:4b –verbose “hello” :

After killing ollama and running ollama run qwen3:8b –verbose “hello”:

After killing ollama and running ollama run qwen3:14b –verbose “hello”:

After killing ollama and running ollama run qwen3:30b-a3b –verbose “hello”:

After killing ollama and running ollama run qwen3:32b –verbose “hello”:

Looks like running 30b-a3b forced my machine to compress memory and lean on swap. Running 32b after seemed to fare better. I’m not going to even bother trying 235b!

Looking at the RAM usage:

My 32GB is fine for 32b unless I’m running other RAM-hungry apps at the same time.

How quickly do these models run?

Using the prompt, “Design a web page with a white background that only contains a single Bootstrap 5 outline success button in the centre. When clicked it should shoot green confetti using https://github.com/catdad/canvas-confetti. Only output a single HTML file. I do not want any explanations or extraneous writing.“, running it ten times (using Ollama Grid Search) and averaging:

What is a token anyway? (Answer here)

For those wondering why 30b-a3b is so performant in tokens/s, it’s thanks to it’s MoE design. To summarize (by my understanding), it’s got 30b parameters but only activates 3b at at time. So it uses RAM like a 30b model but generates tokens like a 3b model. Neat.

For the visual learners, here’s the fastest vs the slowest model responding to “hello” :

Do larger models consume more tokens?

Since these are thinking models, do the larger models think more? Let’s use tokens consumed as a loose proxy for more thinking as I expect the answer length to be similar. This one’s just to answer my curiosity.

Using the “Design a web page…” prompt from above, running it ten times for each model and plotting:

Y-axis (Tokens)

By eyeballing (no deep statistical analysis here!), there’s a similar spread of tokens regardless of model size. 14b had some weird behaviour where multiple runs had no “thinking” text hence the lower token counts. Maybe it followed the “I do not want any explanations or extraneous writing.” part of the prompt better than the others.

Using the prompt, “write me a haiku about trees”, running it ten times for each model and plotting:

Y-axis (Tokens)

This is more interesting. The 4b and larger models have a much larger spread of tokens spent on thinking and “checking” that the number of syllables were correct. 14b again being a bit weird.

The elephant in the room

I haven’t mentioned whether any of the outputs were correct! 😃. Did any of the websites work? Did any of the haikus have the correct number of syllables? Is there a correlation between model size and correctness for these tasks?

These are questions for future me! I’ll be back (don’t hold your breath) with an automated model evaluation system based on my actual LLM usage at some time in the future.

For now, let’s gloss over this and move on…

Agent things

Now we can take these local LLMs and plug them into Flowise to build more advanced workflows. So far I’ve been following Leon Van Zyl‘s FlowiseAI v3 Tutorials Playlist. Here’s my version of his “Agent Teams” tutorial, using 30b-a3b locally:

I’m assembling a team (to write code for me)

I still need to play with RAG, MCP, and so many more things here.

An aside, AI in the Browser

With YTNT, the speech to text engine by default is the Web Speech API. Since it’s not supported in Firefox and Chrome uses Google’s servers for transcription (not ideal), I’ve been quietly testing in-browser local speech to text. If you want to have a play, click the 🐞 (ladybug) and choose Vosk-Browser:

It uses a Vosk model (specifically vosk-model-small-en-us-0.15, ±39Mb) and vosk-browser to perform the conversion. As for which is better, I blame my iMac’s microphones for both being pretty bad:

< Chrome Web Speech | vosk-browser >

Moving to transformers.js would let me use other ASR models in YTNT. Since models of multiple types are supported, I may also use this to do other in-browser local AI things.

My Thoughts and The Future

There is so much fun to be had in this space and things are moving rapidly. As such this post is unfortunately a bit of a mishmash.

What’s important to me is the overarching idea of running things locally on my hardware and under my control. Even if it’s slow, that’s OK! I can have long running async tasks chugging along in the background if it makes sense for them to run that way.

In the future (and perhaps future blog posts) I’ll be looking at models outside the Qwen family, exploring Flowise more deeply and delving into browser-based local AI. Maybe I’ll build a dedicated AI server to run them on!

If you have any questions/would like to share your experience with local AI, leave a reply below. To receive an email when I publish a post, subscribe here.

Note: This blog post was generated by a multimodal neural network trained for 3+ decades on Outside.

(Featured image remixed from this photo)

Exploring Local AI

Software Setup

Testing Device

Model Lineup