> Build identity is per-release โ check grove version or /api/version.
| Cell | Host | Class | Notes |
|---|---|---|---|
| mac | localhost | dev | M4 16 GB; cutting edge, direct deploy.sh (local cp) |
| cell2 | 100.121.6.45 | dev | Pi 4 GB; real Pi/systemd โ catches systemd-only bugs; Tailscale SSH |
| nookman | 100.96.243.69 | prod | Pi 8 GB; jump host for cell2/palooza; signed releases |
| familynook (FN) | 100.126.143.83 | prod | x86 WAN gateway; grove.nook.li; nginx TLS; canary-first on --promote |
| funbook | 100.81.15.45 | prod | Linux laptop; relay-connected |
| nook | 100.96.150.107 | prod ยท first-line | Tucker's dogfood cell โ updates via self-update / dashboard UI ONLY, never deploy.sh |
| palooza | 100.86.29.8 | external | A friend's cell (cell1's old 8 GB chassis); self-update path; stays peered/friended |
cell1 (Pi 8 GB, 100.104.249.123) is shelved โ SD pulled, chassis repurposed as palooza.
Port 5678 everywhere. Port 5679 = watchdog panic page. Port 5680 = relay WebSocket.
See CLAUDE.md "Fleet Nodes & config" for the canonical class/topology table.
curl -s https://grove.nook.li/invite/<token> | bash
mkdir -p ~/.grove
pip3 install cryptography flask flask-socketio requests pynacl websocket-client websockets zeroconf
# Copy grove.py, web.py, acme.py, acme_jws.py, watchdog.py, relay.py + assets/ to ~/.grove/
cd ~/.grove && nohup python3 web.py </dev/null >/tmp/grove-web.log 2>&1 &
# Dashboard at http://localhost:5678
Use on nohup invocations (especially over SSH) to prevent SIGHUP.
The canonical fleet deploy script. Ships 5 Python files + relay.py + assets/ dir.
# Deploy to mac (dev, default)
bash deploy.sh mac
# Deploy to all nodes
bash deploy.sh
# Skip test suite
bash deploy.sh --skip-tests
# Roll back the previous build on a node
bash deploy.sh --rollback mac
# Promote the soaked build to production (FN โ nookman โ funbook, canary order)
bash deploy.sh --promote
--promote requires a clean working tree, a full gauntlet pass on mac, and refuses --skip-tests. It deploys familynook first (x86 canary), waits (default 30 s, override with CANARY_SOAK_SEC=N), then nookman, then funbook. nook + palooza are never deploy-pushed โ they pick up the signed release via self-update.
Build hash = SHA-256 of grove.py + web.py + acme.py + acme_jws.py + watchdog.py + 12 frontend assets (assets/), first 12 hex chars. Both deploy.sh and web.py compute the same hash; mismatches surface in /api/version.
FN runs under a dedicated grove system account. deploy.sh scp's files to ~/ (nook's home), then calls bash deploy-grove.sh which uses sudo to copy them into /home/grove/.grove/. This requires a sudoers line on FN:
nook ALL=(grove) NOPASSWD: /bin/cp, /bin/bash
(See fn-grant-nopasswd.sh in the repo root for the exact grant.)
There is no GitHub Actions โ origin is a bare repo on familynook. CI is a single
script, ci.sh, that runs the full hermetic suite (no live fleet needed):
bash ci.sh # py_compile ยท ruff crash-class ยท render-smoke ยท pytest ยท render-diff (info)
HARD gates (fail the build): py_compile, ruff crash-class (F821/E9), render-smoke,
pytest tests/. render-diff runs informationally โ template renders change
legitimately; re-baseline with python3 render-diff.py --update.
It's wired as an automatic pre-push gate (hooks/pre-push, active once you run
git config core.hooksPath hooks): a red suite blocks the push. Bypass a WIP push with
git push --no-verify. The fast per-commit hook only does crash-class checks; the pytest
suite runs at the push boundary.
Optional independent safety net โ a cron runner on the dev box:
*/30 * * * * cd ~/grove && bash ci.sh >> /tmp/grove-ci.log 2>&1 || \
osascript -e 'display notification "Grove CI is RED" with title "Grove CI"'
Truly non-bypassable upgrade (future): a server-side pre-receive hook on familynook's
grove.git that runs ci.sh and rejects a broken push โ needs pytest/ruff installed
on FN, and a buggy hook can block all pushes, so it's deferred.
These unit files live in systemd/ and are installed via sudo bash systemd/install.sh.
sudo bash systemd/install.sh # grove + watchdog
sudo bash systemd/install.sh --with-relay # also installs grove-relay
[Unit]
Description=Grove Distributed Storage
After=network-online.target tailscaled.service
[Service]
Type=simple
User=grove
WorkingDirectory=/home/grove/.grove
ExecStart=/usr/bin/python3 /home/grove/.grove/web.py
Restart=always
RestartSec=5
MemoryMax=512M
ReadWritePaths=/home/grove/.grove /home/grove/GroveHome /tmp
[Service]
Type=simple
ExecStart=/usr/bin/python3 /home/grove/.grove/watchdog.py --no-redirect
Restart=always
RestartSec=3
Watchdog panic page at :5679. When Grove (5678) is down this page lets you restart it from a browser.
Managed via the web UI or:
python3 ~/.grove/web.py ai-service install
python3 ~/.grove/web.py ai-service status
python3 ~/.grove/web.py ai-service uninstall
Type=forking; spawns the llama-server and exits. The watchdog defers to this unit when it is active. Do NOT enable on a Metal Mac โ a full-offload server wires unevictable memory.
Only needed on a dedicated relay host. Most cells auto-connect to the fleet relay; this unit is not required for normal operation.
Pis run grove-watchdog.service instead of grove.service. A */5 cron provides the reboot bootstrap and proactive memory management:
# crontab -e (as the grove user)
*/5 * * * * ~/.grove/grove-health-local.sh >> ~/.grove/health.log 2>&1
deploy.sh ships grove-health-local.sh to each Pi automatically.
# Install the watchdog service on a Pi
python3 ~/.grove/web.py watchdog-service install
sudo systemctl enable --now grove-watchdog
Any cell can act as a WAN gateway. Two options:
Set public_url in ~/.grove/config.json and ensure port 80 is reachable. acme.py handles cert issuance and renewal automatically.
nginx-grove.conf in the repo root is the reference config for grove.nook.li.
SECURITY-CRITICAL: on a public gateway, nginx must inject X-Grove-Public-Edge: 1 on every request proxied to :5678, overwriting any client-supplied value (since :5678 isn't directly reachable from the internet, this header is a trustworthy public-vs-tailnet signal). Grove's check_dashboard_auth uses this header to block the owner dashboard session and admin /api/* routes over the public edge โ they remain tailnet-only. Without this tag, a valid owner session cookie obtained over Tailscale could be replayed over the public internet to unlock the full control panel.
# Required addition to every proxy_pass block that reaches :5678 from the public edge:
proxy_set_header X-Grove-Public-Edge "1";
The public edge serves: /portal, /site/, /invite/, /relay, /.well-known/, /login, /api/version, /api/health, and peer-sync APIs. The owner dashboard (/dashboard, /files, admin /api/*) is tailnet-only.
Relay WebSocket passes through to :5680 (relay.py) โ no X-Grove-Public-Edge needed there.
# Relay โ no grove auth, separate port
location /relay {
proxy_pass http://127.0.0.1:5680;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
proxy_read_timeout 86400;
}
sudo apt install certbot python3-certbot-nginx
sudo certbot --nginx -d grove.nook.li
# Renewal is automatic via certbot.timer
~/.grove/config.json key fields:
{
"self_name": "my-cell",
"web_port": 5678,
"web_host": "0.0.0.0",
"public_url": "https://grove.nook.li",
"desired_factor": 2,
"relay_url": "ws://100.96.243.69:5680",
"relay_host": true,
"storage_cap_gb": 50,
"peers": [],
"ai_enabled": true,
"peer_model_policy": "friends",
"disable_auto_update": false
}
peer_model_policy controls which peers may use this cell's local AI: "friends" (default), "all", "public", "active-only", or "none".
Production = nookman + familynook + funbook (signed releases via --release/--promote). Dev = mac + cell2 (cutting edge, direct deploy.sh). nook is the first-line dogfood cell and palooza a friend's external cell โ both update via the self-update / dashboard path, never deploy.sh. cell1 is shelved.
Normal workflow:
1. Deploy to mac: bash deploy.sh mac
2. Let it soak; run gauntlet: python3 gauntlet.py
3. Promote to prod: bash deploy.sh --promote
If a build passes health checks but misbehaves:
bash deploy.sh --rollback familynook nookman
This restores the .bak snapshot that deploy.sh captured before the deploy.
Grove uses llama.cpp (not ollama). Models live in ~/.grove/.
grove-ai.service supervises the llama-server on Linux/systemd hosts*/5 keepalive cron also starts the AI server if it's configured but not running (ensure_ai in grove-health-local.sh)Manage via CLI:
python3 ~/.grove/web.py ai-service install # install grove-ai.service
python3 ~/.grove/web.py ai-service status
GET /api/version โ build hash, version, uptimeGET /api/health โ disk, chunk count, peer countGET /api/replication-status โ per-peer replication healthGET /api/route-speeds โ route performance data:5679 โ visible when Grove is downLogs:
/tmp/grove-web.log โ main daemon stdout/stderr/tmp/grove-watchdog.log โ watchdog logjournalctl -u grove -f โ on systemd boxes~/.grove/health.log โ keepalive cron log (Pis)
# Via dashboard: Settings โ Backup Keys (downloads ZIP)
# Via API:
curl -sf http://localhost:5678/api/backup-keys > grove-keys.zip
If you lose ~/.grove/.key, your encrypted files are unrecoverable.
# Via dashboard: Settings โ Restore Keys (upload ZIP)
# Via API:
curl -sf -X POST http://localhost:5678/api/restore-keys -F "file=@grove-keys.zip"
Three paths, in order of friction:
1. Logged in elsewhere? Settings โ Recovery โ Generate recovery link. Single-use, valid 24 h, resets only this cell.
2. Locked out everywhere? From a terminal on the device:
grove auth set-password
3. No grove CLI? Visit /setup from localhost to set a new password.
grove apoptosis
Type APOPTOSIS to confirm. Signed tombstone propagates to peers. Irreversible.
Never run network-severing commands over SSH on a remote node. If you sever your own connection, the Pi nodes may be unreachable for days.
Dangerous over SSH (requires a recovery timer or on-box scheduled job):
nmcli con down/modify, systemctl restart NetworkManageriptables -F or any firewall flushtailscale down, tailscale logoutip link set โฆ down, reboots without a recovery at jobpkill web.py over SSH kills your own shell (the SSH session process matches the grep). Restart by PID instead:
# Find the PID
cat ~/.grove/grove-web.pid
# Kill just that PID
kill <pid>
# Restart
cd ~/.grove && nohup python3 web.py </dev/null >/tmp/grove-web.log 2>&1 &
Before restarting Grove, check for queued image-gen jobs:
python3 -c "import json; jobs=json.load(open('$HOME/.grove/ai_image_jobs.json')); print([j for j in jobs if j.get('status') in ('queued','running')])"
A blind restart kills the in-process worker; queued/running jobs become errors and must be re-submitted.
chmod 600~/.grove directory is chmod 700grove auth set-password)X-Grove-Public-Edge: 1 on every request proxied to :5678curl -I https://grove.nook.li/dashboard should redirect to login, not render the dashboard)