How startup-kg works
A knowledge graph of the DACH startup ecosystem — companies, investors, and the relationships between them.
The Ecosystem Galaxy
The Galaxy view renders every company and investor as a node. A physics simulation pulls connected nodes toward each other and pushes unconnected nodes apart — so the layout encodes real relationship density.
Clusters form around confirmed INVESTED_IN and CO_INVESTS_WITH edges (from VC portfolio text) and COMPETITOR_OF edges (parsed per-company from enrichment prose — named competitors only, not category heuristics). A dense cluster = a sector with real, cited relationships.
Companies with no edges yet — either too niche for category matching, added by the discovery pipeline but not yet linked, or genuinely without close competitors in the dataset.
VC firms and angels. They float near companies they have a POTENTIAL_MATCH edge to (investor focus area overlaps company category). Click to see their portfolio focus.
Why are some dots larger than others?
Node size encodes funding stage — the further along a company is in its funding journey, the bigger its dot. This lets you spot late-stage anchors in a cluster at a glance.
Edge types
COMPETITOR_OF Named competitor of this company Materialised from c.competitors + c.competitive_landscape prose. If the peer isn't in the graph yet, a stub Company node is created so the link survives future enrichment.INVESTED_IN Confirmed investment from investor into company Materialised from notable_exits prose on investor nodes; every edge carries source_field + evidence for audit.CO_INVESTS_WITH Two investors named as co-investors Materialised from co_investors prose on investor nodes. Undirected.POTENTIAL_MATCH Investor mandate aligns with company category Investor focus_areas overlap with company category. Signal only — not an actual investment.FOUNDED / WORKS_AT Person founded / works at a company Person enricher extracts founders + C-level from public sources; cited URLs stored on the Person node.CO_INVESTED_WITH / SECTOR_PEER Inferred co-investment / sector-peer relationships Edge inference — derived automatically after each ingest batch.Every materialised edge carries provenance (source_field, evidence, materialized_at) so we can explain where any link came from. Legacy category-only COMPETITOR_OF edges from the initial import are filtered out of the UI — they had no content-level support.
Company page anatomy — 47 slots
Every company page exposes the same 47-slot canvas so you can see at a glance what we know and what's still pending. Grouped into seven sections:
Classification#1–5: name, short description, tags, category, stageSeed import, refined by enrichmentManagement#6 C-level + #27 teamPerson enricher, LinkedIn + web sourcesBusiness Model Canvas#12–20: nine BMC blocksEnrichment engine with cited URLsStrategic Analysis#21–26: competitive landscape, market pains, etc.Enrichment with explicit citationsFinance#30–37: business model, funding, valuation, ARR, MRRPress releases, PitchBook-style signalsCompliance + Signals#38–47: certifications, licenses, patents, customers, competitors, investors, hiring, GitHub, ratingsMix of enrichment + materialisers + public dataEmpty slots render as "pending enrichment" rather than hiding — so you always see the full canvas. Every card also has an ↑ Enrich Now button to bump that company to the front of the queue.
Web-verified rule
Every fact on every page is backed by a URL retrieved during enrichment. If the pipeline can't cite a source for a field, the field stays empty — it never invents a plausible value from model training data. The "web verified" label in the Sources & References block at the bottom of each page is the baseline, not a feature.
When something looks wrong, open the Sources block and you'll see exactly where it came from. If you catch a hallucination, flag it — it's a bug in the prompt, not expected behaviour.
Engine widget (top bar)
The widget at the top of /galaxy and /admin shows the three pipeline engines at a glance.
Scout +NCandidates discovered in the most recent Scout runLoop interval (e.g. 6h). ▶ / ⏸ toggle starts/stops the loop.Ingest NCandidates in Postgres awaiting promotion to graph nodesLoop interval. Runs dedup + classifier before creating nodes.Enrich N+MN pending + M stuck (in-flight > timeout)Stale cutoff (e.g. 2w) + continuous mode fills 47 slots from cited web sources. Priority = lowest completion first.All / Infra on the right pauses or resumes every engine in one click. Live numbers come from GET /stats.
Pipeline control (Galaxy panel)
The right-hand panel in the Galaxy view controls the 4-layer data pipeline. Each tab has two buttons:
▶ Once Fires a single run immediately, non-repeating Doesn't start or stop the loop⟳ Loop Starts the auto-repeating loop at the selected interval Runs continuously until ■ Stop■ Stop Signals the loop to quit after the current run finishes In-flight batch always completes — this is by design▶ Resume / ⏸ Pause Starts or stops all three engines (Scout + Ingest + Enrich) at once Top-right of the Galaxy header🔭 Scout Runs 4 SearXNG queries for DACH startup terms, extracts names via Ollama, queues as candidates 1h loop is a good default⬆ Ingest Promotes pending/confirmed candidates → Memgraph nodes. Deduplicates by exact name + 88% fuzzy match Run after Scout; 30m loop⚙ Enrich Fills empty fields (description, tags, stage, BMC) via Ollama. Optionally enriches founders/C-level (toggle) Disable "Enrich persons" to focus on company data first📥 Import Bulk import from registries (Handelsregister, Zefix, OpenCorporates) and run edge inference Run Edge Inference after bulk imports to connect nodesData sources
- Seed data — imported from the 32dots AI tools catalog and DACH company lists (~6,700 companies, ~3,500 VCs)
- Discovery pipeline — Scout (SearXNG) → Ingest → Enrich (Ollama qwen3.6) → Edge Inference running continuously
- Registries — Handelsregister (DE daily feed), Zefix (CH), OpenCorporates (AT)
- VC portfolio scraping — portfolio pages scraped per investor to build INVESTED_IN edges
Data reflects public information. No proprietary sources. Funding stages and categories are inferred, not verified.
Navigation
How Aura runs this graph — the agent team
startup-kg is maintained by a nine-agent team. Each agent owns one rubric and one rubric only — they don't step on each other's territory. Aura (CTO) routes work; the specialists execute.
Opus 4.7
Pre-push code + infra gates. Routes every runtime concern to the right specialist. Unrouted-issue inbox.
Sonnet 4.6
Page-level data quality — what a visitor actually sees. Runs a 60s quality loop, auto-enqueues fixes, reports HAS / TRIED-NOT-FOUND / QUEUED.
Haiku 4.5
Retrieves any saved fact — agent roles, past decisions, project state. Never improvises; always cites source file + line.
Sonnet 4.6
Secrets, auth bypass, CVEs, spend deltas, GDPR / PII. Blocks any P1 before merge. Weekly compliance sweep.
Haiku 4.5
Post-deploy health probes, Kuma + SLO checks, GlitchTip error-rate spikes, container restart loops.
Sonnet 4.6
Walks the L3 human happy path on prod with a real browser. Catches "users locked out" before users notice.
Haiku 4.5
Daily liveness probes against every upstream data source (Wikidata, GLEIF, SearXNG, Brave …). Blocks new importer scoping against dead sources.
Haiku 4.5
Bronze → silver → gold throughput, queue depth, enrichment lag. Catches "container up but queue growing 10×" that Iris misses.
Sonnet 4.6
Picks what to build next, kills speculative work, runs the 4-question gate before any feature starts.
Ticket lifecycle
Work items flow through a coordinator ticket system. Each ticket moves through these states:
The dashed branch from diagnosis represents tickets that get parked as blocked or question_for_user before a fix can proceed.
openAnyone files a coordinator ticket (sk-NNN)Aura picks it updiagnosisAura reads the ticket, routes to specialist or handles inlineRoot cause identifiedblockedTicket parked — missing input from user or upstreamBlocker resolved → back to diagnosisfixAura or the named specialist writes the code / config changePR opened + self-reviewedresolvedFelix runs security gate; PR merged to masterDokploy auto-deploysdeployedIris runs post-deploy health probe (/health, /version, GlitchTip)All probes greentestedPia walks the affected L3 user path on prodPath passes end-to-enddoneTicket closed; EPISODIC entry written if non-trivial—FAQ
Why are some companies clustered together?
Clusters form around companies connected by COMPETITOR_OF edges. The physics simulation pulls linked nodes toward each other. A tight cluster means many companies in that sector are catalogued and cross-linked. Lone dots have no edges yet — either a niche with few peers in the dataset, or newly discovered companies not yet linked.
Why does the graph look different every time I load it?
The force simulation starts from random positions and runs until it cools. The final layout is deterministic given enough time, but we stop the simulation early for performance — so exact positions vary slightly between loads. The clustering pattern (which nodes are near each other) is always the same.
How current is the data?
The discovery pipeline runs every 6 hours and the enrichment pipeline runs daily. New companies are added continuously. Funding stage and insolvency status lag reality by days to weeks — we rely on public signals, not real-time filings.
How are competitors determined?
Two companies get a COMPETITOR_OF edge if they share the same category AND at least one tag. This is a structural heuristic — it identifies companies in overlapping spaces, not verified competitive intelligence. False positives are possible.
What does a POTENTIAL_MATCH edge mean?
It means an investor's declared focus areas overlap with a company's category. It is not an actual investment or expression of interest — just a signal that the investor's mandate aligns with the company's space.
Why is my company missing?
Coverage focuses on DACH (Germany, Austria, Switzerland) startups. The gap-analysis pipeline finds companies in sparse categories, but smaller or B2B-only companies may not appear in public sources. If you want a company added, contact us.
Why does the galaxy look sparse / why are there so many lone dots?
Edges are created in two ways: (1) COMPETITOR_OF requires same category AND shared tags — nodes added by the scout pipeline have no tags until the enrichment engine fills them in. (2) CO_INVESTED_WITH and SECTOR_PEER edges are created by Edge Inference, which runs automatically after each ingest batch but can also be triggered manually from the Import tab. Run Enrich first to fill tags, then trigger Edge Inference to connect the dots.
What does "Enrich persons" do and when should I disable it?
When enabled, the enricher runs a second Ollama pass per company to identify founders and C-level people, then creates Person nodes and FOUNDED edges. This doubles enrichment time. Disable it from the Enrich tab toggle when you want to prioritise filling company fields (description, tags, stage, BMC) first.
The Galaxy is slow on my machine — what can I do?
Reduce the node count using the country or category filters in the top bar. Filtering to a single country (e.g. CH) loads ~300–600 nodes instead of ~2000 and runs much faster.
What is the NUC / Ollama mentioned in the pipeline?
We run a local AMD Ryzen AI box (NUC) with Ollama serving qwen3.6 — a 6B-parameter open-weight model — for structured company extraction from raw web text. It avoids sending raw scraped content to external APIs.