Roadmap

100 days · 5 phases · Started 2026-04-27

Phase 1Setup + First Run

Weeks 1–2

Get hands on the actual technology. Stop theorizing.

Deliverable: Foundation model selected based on real testing. Production architecture direction identified.

  • doneSet up M4 Mac Mini with Python, ML libraries, and model access
  • doneConnected to the battery database over Tailscale and built a reusable data loader
  • donePulled first slice of real battery data and saved it locally for testing
  • doneFirst AI model running predictions on real battery data
  • doneTested five models head-to-head using the same methodology
  • donePicked Chronos-2 as the foundation model based on what it can actually do with multiple inputs

Phase 2First Fine-Tune

Weeks 3–5

See whether training the model on our specific data makes it better. Close the biggest data gaps.

Deliverable: Clear comparison showing what custom training improved and what it didn't.

  • todoTrain Chronos-2 on our battery data and compare against the base model
  • todoCompare predictions across quiet periods and active charge/discharge cycles
  • todoConfigure Cerbo GX to log individual cell voltage and temperature (closes the biggest data gap)
  • todoReconfigure Lynx Shunt to log every second instead of only on events

Phase 3Iterate

Weeks 6–9

Find the model that actually works for our use case.

Deliverable: Honest evaluation showing which approach works for which type of prediction.

  • todoTry larger models and test whether extra context inputs help
  • todoMove longer training runs to the rack server (more memory, can handle bigger models)
  • todoDefine what a "good" prediction looks like for charge level, voltage issues, and load patterns
  • todoIndividual cell voltage and temperature data accumulating from the Cerbo GX change

Phase 4Define Production

Weeks 10–13

Answer the questions that determine the next investment.

Deliverable: Written proposal for what a production system would look like.

  • todoDecide which types of prediction actually worked on our data
  • todoFigure out the hardware limits: do we need a real GPU, a bigger machine, or data from multiple systems?
  • todoSketch how local AI fits into the broader system
  • todoDocument the data collection improvements needed before next phase

Phase 5Synthesis

Week 14

Land the experiment. Hand off to whatever comes next.

Deliverable: Architecture recommendation. Clear written record of what was tried.

  • todoFinal architecture recommendation
  • todoHardware budget discussion if the experiment proved out
  • todoClear written record of what was tried, what worked, what didn't, and why

Decisions

2026-04-27

Mac Mini selected as primary local AI workstation

The M4 Mac Mini has enough power to train and test the size of models we're working with. No need for expensive dedicated AI hardware yet. If the experiment shows we need more, we'll invest then.

2026-04-28

Testing methodology locked

Standardized how we test every model: two weeks of battery power data, give the model one week of history, ask it to predict the next 24 hours, measure how far off it is. Every model gets the same test so comparisons are fair.

2026-04-29

Foundation model: Chronos-2

Started with Chronos-Bolt-Small because it was the fastest and most accurate in early tests. Then found out Bolt can only look at one measurement at a time. It can't factor in extra context like voltage or time of day. Switched to Chronos-2, which can. Tested on real battery data and confirmed it actually uses the extra information. The choice is based on what worked, not what looked good on paper.

2026-04-29

Single-input forecasting isn't enough for production

A model that only looks at past battery behavior can't predict what the battery will do next, because the real drivers are external: solar production, customer electricity usage, grid signals. For production, the system needs to take in data from multiple sources, including sensors on the customer side.

2026-04-29

Earlier results had a testing error

When we re-ran tests with stricter rules (making sure training data and test data don't overlap), the earlier improvements from custom training and extra-context inputs disappeared. The model was partly seeing answers it had already been shown. The base model is still solid on its own. Whether customizing it or giving it extra inputs actually helps is now an open question to re-test.