โ† Dashboard ยท Docs ยทTroubleshooting

Troubleshooting

> Build identity is per-release โ€” check grove version or /api/version.


Connection Issues

Peer shows red dot (offline)

1. Check if the peer is running: curl http://localhost:5678/api/version on that node

2. Check network: can you reach the peer's Tailscale IP?

3. Check all known routes โ€” Tailscale, LAN, relay

4. Port 5678 must be reachable from peers (firewall check)

5. If direct connection fails, the relay path is the fallback; check relay connectivity below

"0 peers connected" but peers are configured

Stale routes / ghost peer

If a peer shows connected but sync always fails, its route entry may be stale (old IP, rotated identity):


# Dry run โ€” see what would be pruned
grove peer-prune-stale

# Apply
grove peer-prune-stale --apply

Dashboard โ†’ Peers โ†’ route diagnostics also surfaces ghost routes (entries that have failed probe repeatedly). A "ghost" peer has a ghost_since timestamp in config; run peer-prune-stale to clean it up.

Relay won't connect


Dashboard / Auth Issues

Can't log in

Dashboard locked out everywhere (owner session)

See ADMIN-GUIDE.md ยง "Forgot dashboard password" โ€” three recovery paths.

Owner dashboard inaccessible from outside Tailscale (expected)

On a public gateway (FN / grove.nook.li), the owner dashboard is intentionally tailnet-only. nginx tags public-edge requests with X-Grove-Public-Edge: 1; Grove blocks owner sessions and admin /api/* over that edge. This is correct behavior โ€” access the dashboard via Tailscale IP (http://100.126.143.83:5678) or the watchdog page at :5679/watchdog if Grove itself is down.

If the dashboard is incorrectly accessible over the public internet, check that your nginx config injects proxy_set_header X-Grove-Public-Edge "1" on every proxy_pass block pointing to :5678.

Dashboard blank / won't load

"Grove server lost" banner

Do not use pkill web.py over SSH โ€” it matches the SSH session and kills your shell.

Toasts / real-time updates broken

Portal login issues

The portal (/portal) is the owner's WAN-facing view. Portal accounts are provisioned via Settings โ†’ Remote Access. A portal admin session grants access to /portal/* routes only, not the native dashboard. If the portal login page shows an error after a valid invite: check that the cell is running and reachable (portal auth hits /api/version during handshake).


File Issues

File shows "Under-replicated"

"N to go" backup warning on the Home screen is stuck

This is almost always test-debris dead manifests, not replication lag. Dead manifests are manifests where factor=0 and no peer holds or has a copy of any chunk. They accumulate from test cycles and old file-removal patterns.

To clean them up:


# Dry run (see what would be removed)
grove gc-manifests --dry-run

# Apply fleet-wide
grove gc-manifests

If the count stays high after cleanup, check replication factor with GET /api/replication-status. A real lag shows chunks with factor > 0 but below desired_factor.

"Grove Only" file won't restore

Upload stuck / progress bar frozen


Sync / Replication Issues

Build-hash mismatch across fleet

All cells must be on the same build hash for seamless sync. Check:


# Per-cell (repeat for each Tailscale IP)
curl -sf http://100.96.243.69:5678/api/version | python3 -c 'import sys,json; print(json.load(sys.stdin)["build"])'

If hashes differ, deploy.sh was run without shipping assets/, or only some files were updated. Fix with a full redeploy:


bash deploy.sh <node>

Chunks exist but file shows "incomplete"

"At risk" / single-copy chunks


Performance Issues

Slow sync

High memory / OOM

Pi startup is slow (15โ€“25 s)

Expected โ€” the Pi builds a local RAG index at boot. Wait for /api/version to respond before calling the node healthy.


AI Issues

AI tab not responding / "no AI available"

1. Check if a model is configured: Settings โ†’ AI โ†’ model file should be set

2. Check if the llama-server is running:


   curl http://localhost:8090/health

3. If not running, start from the AI tab ("Start AI server") or:


   python3 ~/.grove/web.py ai-start

4. On systemd nodes: systemctl status grove-ai

5. Log: /tmp/grove-ai-start.log

Image generation jobs stuck / errored after a restart

A blind restart kills the in-process image worker. Jobs that were queued or running at restart time are marked error. Check before restarting:


python3 -c "
import json, os
f = os.path.expanduser('~/.grove/ai_image_jobs.json')
jobs = json.load(open(f)) if os.path.exists(f) else []
pending = [j for j in jobs if j.get('status') in ('queued', 'running')]
print(f'{len(pending)} pending jobs:', [j.get('id') for j in pending])
"

If jobs are pending, wait for them to finish before restarting.

AI routes to wrong peer / cold cache


Common Error Messages

Error Meaning Fix
Invalid or missing X-Grove-Secret Peer auth failed Check peer_secret matches between cells
403 Forbidden Not authorized Log in, or check owner-session / public-edge config
429 Too Many Attempts Login lockout Wait 5 minutes
ECONNREFUSED on port 5678 Grove not running Start Grove; check watchdog at :5679
SSL certificate verify failed Bad/expired cert certbot renew or check acme.py logs
MemoryError Out of RAM Restart Grove; check keepalive cron
No space left on device Disk full Free space or reduce storage_cap_gb
InvalidTag on file decrypt Wrong key for chunk May indicate a shared-file with multiple grant keys; upgrade to current build

Getting Help

1. Check this doc

2. Logs: /tmp/grove-web.log or journalctl -u grove

3. Dashboard Activity tab for error entries

4. grove doctor โ€” diagnoses common issues (deps, ports, peers, certs)

5. Ask GroveAI (AI tab) โ€” it has access to these docs