Phase 1 Complete — Foundation Locked
2026-04-29Day 2Phase 1 of the local AI experiment is done. We picked a foundation model and learned something important about what production actually needs to look like.
What we did
Over four sessions, we tested five different AI models against real battery data from the shop system. We pulled two weeks of power measurements and ran each model through the same test: give it a week of history, ask it to predict the next 24 hours, and measure how close it gets.
A "foundation model" is a pre-trained AI model designed to work with time-series data (sequences of measurements over time). Instead of building a model from scratch, you start with one that already understands general patterns in sequential data and adapt it to your specific use case.
We tested Chronos T5-Tiny, T5-Small, Bolt-Small, TimesFM, and Chronos-2. As a baseline, we also compared against "persistence," which just predicts that tomorrow will look like today.
What we found
Chronos-Bolt-Small looked like the winner at first: it was the fastest and most accurate in early tests.
Then we checked whether it could do the thing we actually need for production. It couldn't. Bolt can only look at one measurement at a time. It can predict future battery power based on past battery power, but it can't factor in other information like voltage levels or time of day. For a real deployment, the model needs to use multiple inputs at once.
We switched to Chronos-2, which can take in additional context beyond just the measurement you're trying to predict. It also has built-in tools for customizing the model on your own data. We tested it on battery data from a period when the system was actively charging and discharging, and confirmed that giving it extra context actually improved its predictions.
Why it matters
The bigger takeaway: a model that only looks at past battery behavior can't predict what the battery will do next, because the real drivers are external. Solar production, customer electricity usage, and grid signals determine what the battery does. Looking only at battery history captures patterns but misses the events that matter.
For production, the system needs to take in data from sensors on the customer side, not just the battery itself. This isn't a setback. It's the architectural answer we needed, and it's backed by what we actually saw in the data.
Where we go next
Phase 2 is about fine-tuning: taking the base Chronos-2 model and training it specifically on our battery data to see if that makes its predictions better. We need to test across two different operating modes: quiet periods when the battery is parked, and active periods when it's cycling.
The shop system ships within a couple weeks. Once it does, we'll have real cycling data flowing in.
Update — 2026-04-29 (later)
When we re-ran the same evaluations with stricter testing rules, the earlier results fell apart. The issue: our test data was too close to our training data, so the model was partly seeing answers it had already been shown. This is called data leakage.
Under clean testing (training and test periods fully separated, with six times more training data):
- Fine-tuning the model on our data produced no measurable improvement over using it out of the box
- The benefit of giving the model extra context (voltage, time of day) hasn't been confirmed yet and needs to be re-tested on active cycling data
What still holds: the base Chronos-2 model, without any customization, still predicts significantly better than the "tomorrow looks like today" baseline on quiet data. The model choice is solid. What's now an open question: whether customizing it actually helps, and whether extra context inputs make a difference.
This is the experiment working as intended. Better to catch a testing error on Day 2 than to build on bad numbers for 100 days. Next session: re-test the extra-context approach on cycling data with clean methodology.