How startup-kg works

A knowledge graph of the DACH startup ecosystem — companies, investors, and the relationships between them.

The Ecosystem Galaxy

The Galaxy view renders every company and investor as a node. A physics simulation pulls connected nodes toward each other and pushes unconnected nodes apart — so the layout encodes real relationship density.

Tight clusters

Clusters form around confirmed INVESTED_IN and CO_INVESTS_WITH edges (from VC portfolio text) and COMPETITOR_OF edges (parsed per-company from enrichment prose — named competitors only, not category heuristics). A dense cluster = a sector with real, cited relationships.

Lone dots

Companies with no edges yet — either too niche for category matching, added by the discovery pipeline but not yet linked, or genuinely without close competitors in the dataset.

Green nodes = Investors

VC firms and angels. They float near companies they have a POTENTIAL_MATCH edge to (investor focus area overlaps company category). Click to see their portfolio focus.

Why are some dots larger than others?

Node size encodes funding stage — the further along a company is in its funding journey, the bigger its dot. This lets you spot late-stage anchors in a cluster at a glance.

Growth

Series B

Series A

Seed

Pre-seed

Unknown

Edge types

COMPETITOR_OF Named competitor of this company Materialised from c.competitors + c.competitive_landscape prose. If the peer isn't in the graph yet, a stub Company node is created so the link survives future enrichment.

INVESTED_IN Confirmed investment from investor into company Materialised from notable_exits prose on investor nodes; every edge carries source_field + evidence for audit.

CO_INVESTS_WITH Two investors named as co-investors Materialised from co_investors prose on investor nodes. Undirected.

POTENTIAL_MATCH Investor mandate aligns with company category Investor focus_areas overlap with company category. Signal only — not an actual investment.

FOUNDED / WORKS_AT Person founded / works at a company Person enricher extracts founders + C-level from public sources; cited URLs stored on the Person node.

CO_INVESTED_WITH / SECTOR_PEER Inferred co-investment / sector-peer relationships Edge inference — derived automatically after each ingest batch.

Every materialised edge carries provenance (source_field, evidence, materialized_at) so we can explain where any link came from. Legacy category-only COMPETITOR_OF edges from the initial import are filtered out of the UI — they had no content-level support.

Company page anatomy — 47 slots

Every company page exposes the same 47-slot canvas so you can see at a glance what we know and what's still pending. Grouped into seven sections:

Classification#1–5: name, short description, tags, category, stageSeed import, refined by enrichment

Management#6 C-level + #27 teamPerson enricher, LinkedIn + web sources

Business Model Canvas#12–20: nine BMC blocksEnrichment engine with cited URLs

Strategic Analysis#21–26: competitive landscape, market pains, etc.Enrichment with explicit citations

Finance#30–37: business model, funding, valuation, ARR, MRRPress releases, PitchBook-style signals

Compliance + Signals#38–47: certifications, licenses, patents, customers, competitors, investors, hiring, GitHub, ratingsMix of enrichment + materialisers + public data

Empty slots render as "pending enrichment" rather than hiding — so you always see the full canvas. Every card also has an ↑ Enrich Now button to bump that company to the front of the queue.

Web-verified rule

Every fact on every page is backed by a URL retrieved during enrichment. If the pipeline can't cite a source for a field, the field stays empty — it never invents a plausible value from model training data. The "web verified" label in the Sources & References block at the bottom of each page is the baseline, not a feature.

When something looks wrong, open the Sources block and you'll see exactly where it came from. If you catch a hallucination, flag it — it's a bug in the prompt, not expected behaviour.

Engine widget (top bar)

The widget at the top of /galaxy and /admin shows the three pipeline engines at a glance.

Scout +NCandidates discovered in the most recent Scout runLoop interval (e.g. 6h). ▶ / ⏸ toggle starts/stops the loop.

Ingest NCandidates in Postgres awaiting promotion to graph nodesLoop interval. Runs dedup + classifier before creating nodes.

Enrich N+MN pending + M stuck (in-flight > timeout)Stale cutoff (e.g. 2w) + continuous mode fills 47 slots from cited web sources. Priority = lowest completion first.

All / Infra on the right pauses or resumes every engine in one click. Live numbers come from GET /stats.

Pipeline control (Galaxy panel)

The right-hand panel in the Galaxy view controls the 4-layer data pipeline. Each tab has two buttons:

▶ Once Fires a single run immediately, non-repeating Doesn't start or stop the loop

⟳ Loop Starts the auto-repeating loop at the selected interval Runs continuously until ■ Stop

■ Stop Signals the loop to quit after the current run finishes In-flight batch always completes — this is by design

▶ Resume / ⏸ Pause Starts or stops all three engines (Scout + Ingest + Enrich) at once Top-right of the Galaxy header

🔭 Scout Runs 4 SearXNG queries for DACH startup terms, extracts names via Ollama, queues as candidates 1h loop is a good default

⬆ Ingest Promotes pending/confirmed candidates → Memgraph nodes. Deduplicates by exact name + 88% fuzzy match Run after Scout; 30m loop

⚙ Enrich Fills empty fields (description, tags, stage, BMC) via Ollama. Optionally enriches founders/C-level (toggle) Disable "Enrich persons" to focus on company data first

📥 Import Bulk import from registries (Handelsregister, Zefix, OpenCorporates) and run edge inference Run Edge Inference after bulk imports to connect nodes

Data sources

Seed data — imported from the 32dots AI tools catalog and DACH company lists (~6,700 companies, ~3,500 VCs)
Discovery pipeline — Scout (SearXNG) → Ingest → Enrich (Ollama qwen3.6) → Edge Inference running continuously
Registries — Handelsregister (DE daily feed), Zefix (CH), OpenCorporates (AT)
VC portfolio scraping — portfolio pages scraped per investor to build INVESTED_IN edges

Data reflects public information. No proprietary sources. Funding stages and categories are inferred, not verified.

Navigation

CompaniesSearchable list of all companies with filters by country, category, stage

InvestorsVC firms and angels with focus area, stage preference, and match count

Galaxy ✦Full-ecosystem force-directed graph — pan, zoom, click nodes to open detail

BrowseData browser — tabular view of all companies, investors, and the discovery queue (with bulk approve/reject)

HelpThis page — edge types, pipeline docs, FAQ

How Aura runs this graph — the agent team

startup-kg is maintained by a nine-agent team. Each agent owns one rubric and one rubric only — they don't step on each other's territory. Aura (CTO) routes work; the specialists execute.

CTO Aura

Opus 4.7

Pre-push code + infra gates. Routes every runtime concern to the right specialist. Unrouted-issue inbox.

CQO Maya

Sonnet 4.6

Page-level data quality — what a visitor actually sees. Runs a 60s quality loop, auto-enqueues fixes, reports HAS / TRIED-NOT-FOUND / QUEUED.

Memory Mira

Haiku 4.5

Retrieves any saved fact — agent roles, past decisions, project state. Never improvises; always cites source file + line.

Security Felix

Sonnet 4.6

Secrets, auth bypass, CVEs, spend deltas, GDPR / PII. Blocks any P1 before merge. Weekly compliance sweep.

Uptime Iris

Haiku 4.5

Post-deploy health probes, Kuma + SLO checks, GlitchTip error-rate spikes, container restart loops.

UX Canary Pia

Sonnet 4.6

Walks the L3 human happy path on prod with a real browser. Catches "users locked out" before users notice.

Sources Lukas

Haiku 4.5

Daily liveness probes against every upstream data source (Wikidata, GLEIF, SearXNG, Brave …). Blocks new importer scoping against dead sources.

Pipeline Otto

Haiku 4.5

Bronze → silver → gold throughput, queue depth, enrichment lag. Catches "container up but queue growing 10×" that Iris misses.

Roadmap Rudi

Sonnet 4.6

Picks what to build next, kills speculative work, runs the 4-question gate before any feature starts.

Ticket lifecycle

Work items flow through a coordinator ticket system. Each ticket moves through these states:

The dashed branch from diagnosis represents tickets that get parked as blocked or question_for_user before a fix can proceed.

openAnyone files a coordinator ticket (sk-NNN)Aura picks it up

diagnosisAura reads the ticket, routes to specialist or handles inlineRoot cause identified

blockedTicket parked — missing input from user or upstreamBlocker resolved → back to diagnosis

fixAura or the named specialist writes the code / config changePR opened + self-reviewed

resolvedFelix runs security gate; PR merged to masterDokploy auto-deploys

deployedIris runs post-deploy health probe (/health, /version, GlitchTip)All probes green

testedPia walks the affected L3 user path on prodPath passes end-to-end

doneTicket closed; EPISODIC entry written if non-trivial—

FAQ

Why are some companies clustered together?

Clusters form around companies connected by COMPETITOR_OF edges. The physics simulation pulls linked nodes toward each other. A tight cluster means many companies in that sector are catalogued and cross-linked. Lone dots have no edges yet — either a niche with few peers in the dataset, or newly discovered companies not yet linked.

Why does the graph look different every time I load it?

The force simulation starts from random positions and runs until it cools. The final layout is deterministic given enough time, but we stop the simulation early for performance — so exact positions vary slightly between loads. The clustering pattern (which nodes are near each other) is always the same.

How current is the data?

The discovery pipeline runs every 6 hours and the enrichment pipeline runs daily. New companies are added continuously. Funding stage and insolvency status lag reality by days to weeks — we rely on public signals, not real-time filings.

How are competitors determined?

Two companies get a COMPETITOR_OF edge if they share the same category AND at least one tag. This is a structural heuristic — it identifies companies in overlapping spaces, not verified competitive intelligence. False positives are possible.

What does a POTENTIAL_MATCH edge mean?

It means an investor's declared focus areas overlap with a company's category. It is not an actual investment or expression of interest — just a signal that the investor's mandate aligns with the company's space.

Why is my company missing?

Coverage focuses on DACH (Germany, Austria, Switzerland) startups. The gap-analysis pipeline finds companies in sparse categories, but smaller or B2B-only companies may not appear in public sources. If you want a company added, contact us.

Why does the galaxy look sparse / why are there so many lone dots?

Edges are created in two ways: (1) COMPETITOR_OF requires same category AND shared tags — nodes added by the scout pipeline have no tags until the enrichment engine fills them in. (2) CO_INVESTED_WITH and SECTOR_PEER edges are created by Edge Inference, which runs automatically after each ingest batch but can also be triggered manually from the Import tab. Run Enrich first to fill tags, then trigger Edge Inference to connect the dots.

What does "Enrich persons" do and when should I disable it?

When enabled, the enricher runs a second Ollama pass per company to identify founders and C-level people, then creates Person nodes and FOUNDED edges. This doubles enrichment time. Disable it from the Enrich tab toggle when you want to prioritise filling company fields (description, tags, stage, BMC) first.

The Galaxy is slow on my machine — what can I do?

Reduce the node count using the country or category filters in the top bar. Filtering to a single country (e.g. CH) loads ~300–600 nodes instead of ~2000 and runs much faster.

What is the NUC / Ollama mentioned in the pipeline?

We run a local AMD Ryzen AI box (NUC) with Ollama serving qwen3.6 — a 6B-parameter open-weight model — for structured company extraction from raw web text. It avoids sending raw scraped content to external APIs.