← Dashboard · Docs ·Troubleshooting

Troubleshooting

> Build identity is per-release — check grove version or /api/version.

Connection Issues

Peer shows red dot (offline)

1. Check if the peer is running: curl http://localhost:5678/api/version on that node

2. Check network: can you reach the peer's Tailscale IP?

3. Check all known routes — Tailscale, LAN, relay

4. Port 5678 must be reachable from peers (firewall check)

5. If direct connection fails, the relay path is the fallback; check relay connectivity below

"0 peers connected" but peers are configured

Check ~/.grove/config.json → peers array has entries
Check that peer_secret matches between cells
Opportunistic sync triggers on page load; manually trigger: dashboard → Sync

Stale routes / ghost peer

If a peer shows connected but sync always fails, its route entry may be stale (old IP, rotated identity):


# Dry run — see what would be pruned
grove peer-prune-stale

# Apply
grove peer-prune-stale --apply

Dashboard → Peers → route diagnostics also surfaces ghost routes (entries that have failed probe repeatedly). A "ghost" peer has a ghost_since timestamp in config; run peer-prune-stale to clean it up.

Relay won't connect

Check relay URL in config: "relay_url": "ws://host:5680" or "wss://domain/relay"
Check relay is running: curl http://relay-host:5680 should respond (or connect via ws)
For WSS: nginx must have a /relay location block with WebSocket upgrade headers and proxy_read_timeout 86400
Check that the peer's pubkey is registered with the relay

Dashboard / Auth Issues

Can't log in

Wrong password — use the password set during setup
Brute-force lockout — wait 5 minutes or restart Grove
Reset password (terminal on the same machine):


  grove auth set-password

No grove CLI — visit /setup from localhost to set a new password

Dashboard locked out everywhere (owner session)

See ADMIN-GUIDE.md § "Forgot dashboard password" — three recovery paths.

Owner dashboard inaccessible from outside Tailscale (expected)

On a public gateway (FN / grove.nook.li), the owner dashboard is intentionally tailnet-only. nginx tags public-edge requests with X-Grove-Public-Edge: 1; Grove blocks owner sessions and admin /api/* over that edge. This is correct behavior — access the dashboard via Tailscale IP (http://100.126.143.83:5678) or the watchdog page at :5679/watchdog if Grove itself is down.

If the dashboard is incorrectly accessible over the public internet, check that your nginx config injects proxy_set_header X-Grove-Public-Edge "1" on every proxy_pass block pointing to :5678.

Dashboard blank / won't load

Check if Grove is running: curl http://localhost:5678/api/version
Check logs: tail -20 /tmp/grove-web.log
Check port conflict: lsof -i :5678
Watchdog page at :5679 may have more info

"Grove server lost" banner

Dashboard lost contact with backend (3 missed heartbeats)
Check the watchdog page at :5679
systemd: systemctl status grove and journalctl -u grove -n 50
Restart by PID (safer than pkill, especially over SSH):


  kill $(cat ~/.grove/grove-web.pid)
  sleep 2
  cd ~/.grove && nohup python3 web.py </dev/null >/tmp/grove-web.log 2>&1 &

Do not use pkill web.py over SSH — it matches the SSH session and kills your shell.

Toasts / real-time updates broken

SocketIO may be blocked by proxy or firewall
Check browser console for WebSocket errors
nginx needs a /socket.io location block with Upgrade + Connection: upgrade headers

Portal login issues

The portal (/portal) is the owner's WAN-facing view. Portal accounts are provisioned via Settings → Remote Access. A portal admin session grants access to /portal/* routes only, not the native dashboard. If the portal login page shows an error after a valid invite: check that the cell is running and reachable (portal auth hits /api/version during handshake).

File Issues

File shows "Under-replicated"

Normal after upload — chunks sync in background (allow up to 5 min)
All peers offline — will heal when peers reconnect
Manual heal: click the file → Sync, or run grove heal

"N to go" backup warning on the Home screen is stuck

This is almost always test-debris dead manifests, not replication lag. Dead manifests are manifests where factor=0 and no peer holds or has a copy of any chunk. They accumulate from test cycles and old file-removal patterns.

To clean them up:


# Dry run (see what would be removed)
grove gc-manifests --dry-run

# Apply fleet-wide
grove gc-manifests

If the count stays high after cleanup, check replication factor with GET /api/replication-status. A real lag shows chunks with factor > 0 but below desired_factor.

"Grove Only" file won't restore

Need at least one online peer that holds the chunks
Check grove replication or GET /api/replication-status
If no peers hold the chunks, the file may be unrecoverable — restore from an external backup

Upload stuck / progress bar frozen

Check disk space: df -h ~/.grove
Check logs: tail -f /tmp/grove-web.log
Large files chunk in the background — check the Activity tab
Refresh and retry

Sync / Replication Issues

Build-hash mismatch across fleet

All cells must be on the same build hash for seamless sync. Check:


# Per-cell (repeat for each Tailscale IP)
curl -sf http://100.96.243.69:5678/api/version | python3 -c 'import sys,json; print(json.load(sys.stdin)["build"])'

If hashes differ, deploy.sh was run without shipping assets/, or only some files were updated. Fix with a full redeploy:


bash deploy.sh <node>

Chunks exist but file shows "incomplete"

Placement DB may be stale
Trigger sync: click the file → Sync
Or rebuild placement: rm ~/.grove/placement.db and restart Grove

"At risk" / single-copy chunks

Some chunks exist on only one peer
Fix: ensure more peers are online and trigger opportunistic sync
Long-term: increase desired_factor in config

Performance Issues

Slow sync

Check route speeds: GET /api/route-speeds
Relay paths are slower than direct Tailscale; set slow peers to relay-preferred if the relay is actually faster for them
Reduce desired_factor to sync fewer copies

High memory / OOM

systemd MemoryMax=512M in grove.service
On Pis, the keepalive cron (grove-health-local.sh) manages proactive restarts in the yellow/red memory zone during idle windows
Manual restart: kill and restart by PID (see above); do NOT use pkill over SSH

Pi startup is slow (15–25 s)

Expected — the Pi builds a local RAG index at boot. Wait for /api/version to respond before calling the node healthy.

AI Issues

AI tab not responding / "no AI available"

1. Check if a model is configured: Settings → AI → model file should be set

2. Check if the llama-server is running:


   curl http://localhost:8090/health

3. If not running, start from the AI tab ("Start AI server") or:


   python3 ~/.grove/web.py ai-start

4. On systemd nodes: systemctl status grove-ai

5. Log: /tmp/grove-ai-start.log

Image generation jobs stuck / errored after a restart

A blind restart kills the in-process image worker. Jobs that were queued or running at restart time are marked error. Check before restarting:


python3 -c "
import json, os
f = os.path.expanduser('~/.grove/ai_image_jobs.json')
jobs = json.load(open(f)) if os.path.exists(f) else []
pending = [j for j in jobs if j.get('status') in ('queued', 'running')]
print(f'{len(pending)} pending jobs:', [j.get('id') for j in pending])
"

If jobs are pending, wait for them to finish before restarting.

AI routes to wrong peer / cold cache

Route-aware probing self-heals stale AI peer hosts automatically
Force re-probe: restart Grove; the probe cycle runs on startup
Check GET /api/ai/status on each peer to see what they advertise

Common Error Messages

Error	Meaning	Fix
`Invalid or missing X-Grove-Secret`	Peer auth failed	Check `peer_secret` matches between cells
`403 Forbidden`	Not authorized	Log in, or check owner-session / public-edge config
`429 Too Many Attempts`	Login lockout	Wait 5 minutes
`ECONNREFUSED` on port 5678	Grove not running	Start Grove; check watchdog at `:5679`
`SSL certificate verify failed`	Bad/expired cert	`certbot renew` or check acme.py logs
`MemoryError`	Out of RAM	Restart Grove; check keepalive cron
`No space left on device`	Disk full	Free space or reduce `storage_cap_gb`
`InvalidTag` on file decrypt	Wrong key for chunk	May indicate a shared-file with multiple grant keys; upgrade to current build

Getting Help

1. Check this doc

2. Logs: /tmp/grove-web.log or journalctl -u grove

3. Dashboard Activity tab for error entries

4. grove doctor — diagnoses common issues (deps, ports, peers, certs)

5. Ask GroveAI (AI tab) — it has access to these docs