> Build identity is per-release โ check grove version or /api/version.
1. Check if the peer is running: curl http://localhost:5678/api/version on that node
2. Check network: can you reach the peer's Tailscale IP?
3. Check all known routes โ Tailscale, LAN, relay
4. Port 5678 must be reachable from peers (firewall check)
5. If direct connection fails, the relay path is the fallback; check relay connectivity below
~/.grove/config.json โ peers array has entriespeer_secret matches between cellsIf a peer shows connected but sync always fails, its route entry may be stale (old IP, rotated identity):
# Dry run โ see what would be pruned
grove peer-prune-stale
# Apply
grove peer-prune-stale --apply
Dashboard โ Peers โ route diagnostics also surfaces ghost routes (entries that have failed probe repeatedly). A "ghost" peer has a ghost_since timestamp in config; run peer-prune-stale to clean it up.
"relay_url": "ws://host:5680" or "wss://domain/relay"curl http://relay-host:5680 should respond (or connect via ws)/relay location block with WebSocket upgrade headers and proxy_read_timeout 86400
grove auth set-password
grove CLI โ visit /setup from localhost to set a new passwordSee ADMIN-GUIDE.md ยง "Forgot dashboard password" โ three recovery paths.
On a public gateway (FN / grove.nook.li), the owner dashboard is intentionally tailnet-only. nginx tags public-edge requests with X-Grove-Public-Edge: 1; Grove blocks owner sessions and admin /api/* over that edge. This is correct behavior โ access the dashboard via Tailscale IP (http://100.126.143.83:5678) or the watchdog page at :5679/watchdog if Grove itself is down.
If the dashboard is incorrectly accessible over the public internet, check that your nginx config injects proxy_set_header X-Grove-Public-Edge "1" on every proxy_pass block pointing to :5678.
curl http://localhost:5678/api/versiontail -20 /tmp/grove-web.loglsof -i :5678:5679 may have more info:5679systemctl status grove and journalctl -u grove -n 50
kill $(cat ~/.grove/grove-web.pid)
sleep 2
cd ~/.grove && nohup python3 web.py </dev/null >/tmp/grove-web.log 2>&1 &
Do not use pkill web.py over SSH โ it matches the SSH session and kills your shell.
/socket.io location block with Upgrade + Connection: upgrade headersThe portal (/portal) is the owner's WAN-facing view. Portal accounts are provisioned via Settings โ Remote Access. A portal admin session grants access to /portal/* routes only, not the native dashboard. If the portal login page shows an error after a valid invite: check that the cell is running and reachable (portal auth hits /api/version during handshake).
grove heal This is almost always test-debris dead manifests, not replication lag. Dead manifests are manifests where factor=0 and no peer holds or has a copy of any chunk. They accumulate from test cycles and old file-removal patterns.
To clean them up:
# Dry run (see what would be removed)
grove gc-manifests --dry-run
# Apply fleet-wide
grove gc-manifests
If the count stays high after cleanup, check replication factor with GET /api/replication-status. A real lag shows chunks with factor > 0 but below desired_factor.
grove replication or GET /api/replication-statusdf -h ~/.grovetail -f /tmp/grove-web.logAll cells must be on the same build hash for seamless sync. Check:
# Per-cell (repeat for each Tailscale IP)
curl -sf http://100.96.243.69:5678/api/version | python3 -c 'import sys,json; print(json.load(sys.stdin)["build"])'
If hashes differ, deploy.sh was run without shipping assets/, or only some files were updated. Fix with a full redeploy:
bash deploy.sh <node>
rm ~/.grove/placement.db and restart Grovedesired_factor in configGET /api/route-speedsdesired_factor to sync fewer copiesMemoryMax=512M in grove.servicegrove-health-local.sh) manages proactive restarts in the yellow/red memory zone during idle windowsExpected โ the Pi builds a local RAG index at boot. Wait for /api/version to respond before calling the node healthy.
1. Check if a model is configured: Settings โ AI โ model file should be set
2. Check if the llama-server is running:
curl http://localhost:8090/health
3. If not running, start from the AI tab ("Start AI server") or:
python3 ~/.grove/web.py ai-start
4. On systemd nodes: systemctl status grove-ai
5. Log: /tmp/grove-ai-start.log
A blind restart kills the in-process image worker. Jobs that were queued or running at restart time are marked error. Check before restarting:
python3 -c "
import json, os
f = os.path.expanduser('~/.grove/ai_image_jobs.json')
jobs = json.load(open(f)) if os.path.exists(f) else []
pending = [j for j in jobs if j.get('status') in ('queued', 'running')]
print(f'{len(pending)} pending jobs:', [j.get('id') for j in pending])
"
If jobs are pending, wait for them to finish before restarting.
GET /api/ai/status on each peer to see what they advertise| Error | Meaning | Fix |
|---|---|---|
Invalid or missing X-Grove-Secret |
Peer auth failed | Check peer_secret matches between cells |
403 Forbidden |
Not authorized | Log in, or check owner-session / public-edge config |
429 Too Many Attempts |
Login lockout | Wait 5 minutes |
ECONNREFUSED on port 5678 |
Grove not running | Start Grove; check watchdog at :5679 |
SSL certificate verify failed |
Bad/expired cert | certbot renew or check acme.py logs |
MemoryError |
Out of RAM | Restart Grove; check keepalive cron |
No space left on device |
Disk full | Free space or reduce storage_cap_gb |
InvalidTag on file decrypt |
Wrong key for chunk | May indicate a shared-file with multiple grant keys; upgrade to current build |
1. Check this doc
2. Logs: /tmp/grove-web.log or journalctl -u grove
3. Dashboard Activity tab for error entries
4. grove doctor โ diagnoses common issues (deps, ports, peers, certs)
5. Ask GroveAI (AI tab) โ it has access to these docs