It sounds simple: clone every repo, parse the files, build a graph. Here’s why each ecosystem fights back, and what it actually takes to map cross-repo dependencies automatically.
In the last post, I wrote about the infrastructure dependency visibility gap. The fact that most platform teams have no way to answer “if I change this, what else breaks?” across their repos. The community response confirmed what I’d seen at every client: people are building brittle grep scripts, maintaining stale spreadsheets, or just relying on whoever has been around the longest.
The obvious next question is: why doesn’t anyone just parse the repos and build a graph?
The answer is that people do. Multiple engineers I’ve spoken with have built their own versions; a nightly cron job, some shell scripts, a SQLite database. And those solutions work, for a while, for one org, for the file types they remembered to handle. Then they hit an edge case, go stale, or the person who built it moves on.
The core approach is right: scan every repo, parse the files that declare dependencies, resolve them to actual repos, build a directed graph. But the devil is in the details, and each ecosystem has its own set of devils. This post walks through what it actually takes to auto-discover cross-repo dependencies across Terraform, Docker, CI pipelines, Python, Go, npm, Ansible, Helm, and Kubernetes, and why the cross-ecosystem problem is harder than any individual one.
The approach: parse what’s already there
The principle behind auto-discovery is simple: the dependencies are already declared in the source files. A Terraform module has a source attribute pointing at a git URL. A Dockerfile has a FROM statement naming a base image. A GitLab CI config has include: directives referencing templates in other repos.
You don’t need humans to fill in a catalog. You don’t need a YAML manifest per repo. The dependency information exists. It’s just scattered across a dozen file formats in hundreds of repos, with no unified view.
So the pipeline looks like this:
- Enumerate — list every repo in the GitLab group or GitHub org via API
- Clone — shallow-clone each repo (depth 1, just the default branch)
- Parse — walk the file tree, dispatch each file to the right parser based on filename and path
- Detect artifacts — identify what each repo produces (a Terraform module, a Docker image, a Python package, a Helm chart)
- Resolve — match parsed dependency references to known repos or artifacts in the org
- Store — persist the graph as queryable relationships
Steps 1 and 2 are straightforward. Steps 3 through 5 are where every ecosystem has opinions about how to make your life difficult.
Terraform: where version refs hide in query strings
Terraform is usually the first ecosystem people think of for cross-repo dependencies, and on the surface it looks easy. A module block has a source attribute:
module "vpc" {
source = "git::https://gitlab.com/infra/modules/vpc.git?ref=v2.1.0"
}
You parse the source string, extract the git URL and the ref, normalize the path to infra/modules/vpc, match it against known repos in the org — done.
Except it’s never that clean. Here’s what you actually encounter in the wild:
Multiple URL formats. The same module might be sourced as git::https://..., [email protected]:... (SSH), a bare HTTPS URL without the git:: prefix, or a Terraform registry address like app.terraform.io/org/module/provider. Each format needs different parsing logic to extract the same canonical repo path.
Subdirectory references. Terraform supports the double-slash convention: git::https://gitlab.com/infra/modules.git//networking/vpc?ref=v1.0. The repo is infra/modules, but the module is in a subdirectory. This means one repo can produce multiple distinct modules, and your parser needs to handle that relationship.
Variable interpolation. You’ll find modules like:
module "service" {
source = "git::https://gitlab.com/${var.infra_group}/modules/service.git"
}
You can’t resolve ${var.infra_group} without running Terraform, and the whole point of static analysis is that you don’t run Terraform. The practical choice is to flag these as lower-confidence dependencies and extract what you can from the static portion of the string.
Public registry vs. internal modules. A source like hashicorp/consul/aws points to the public Terraform Registry — it’s not an internal dependency and should be skipped. But app.terraform.io/your-org/vpc/aws is internal. The parser needs to distinguish between public and private registries.
The modules.json trap. Some people suggest parsing .terraform/modules/modules.json, which contains the resolved module tree. The problem: this file only exists if someone has run terraform init, it’s usually in .gitignore, and it reflects one person’s local state — not the repo’s declared dependencies. It’s not a reliable source for org-wide discovery.
The meta-lesson from Terraform: even within a single ecosystem, the same logical relationship (“repo A depends on repo B”) can be expressed in half a dozen syntactically different ways, and your parser needs to normalize all of them to the same canonical form.
Docker: the base image puzzle
Dockerfiles look deceptively simple:
FROM node:18-alpine
That’s a public image — skip it. But:
FROM registry.company.com/platform/base-image:v2.1
That’s an internal base image, and it means this repo depends on whatever repo builds and publishes platform/base-image. Tracking these relationships across an org is one of the highest-value things a dependency graph can do. Docker base image updates are a constant source of surprise breakage.
Here’s where it gets complicated:
Build arguments and variable substitution. Real-world Dockerfiles frequently use ARG to parameterize the base image:
ARG REGISTRY=registry.company.com
ARG BASE_VERSION=latest
FROM ${REGISTRY}/platform/base-image:${BASE_VERSION}
You can resolve ARG defaults by parsing the Dockerfile top-to-bottom, substituting the default values into the FROM statement. But if the ARG is overridden at build time via --build-arg, the static default might be wrong. Again: lower confidence, but still useful signal.
Multi-stage builds. A modern Dockerfile might have several FROM statements:
FROM node:18 AS builder
RUN npm ci && npm run build
FROM registry.company.com/platform/nginx:1.25
COPY --from=builder /app/dist /usr/share/nginx/html
The first FROM is a public image. The second is an internal dependency. The COPY --from=builder is a reference to an earlier stage, not an external image — the parser needs to track named stages and skip internal references. If you naively treat every FROM and COPY --from as a dependency, you’ll generate false edges in the graph.
Docker Compose adds another layer. A docker-compose.yml might reference images directly:
services:
api:
image: registry.company.com/backend/api-service:latest
This is a dependency declaration in a completely different file format (YAML vs. Dockerfile syntax), but it represents the same kind of relationship.
The consumer-side problem. Knowing that repo X uses a Docker image is only half the story. You also need to know which repo builds that image. This isn’t declared in the Dockerfile. It’s usually in the CI pipeline config (docker build -t and docker push commands). Connecting the consumer side (“this repo uses image Y”) to the producer side (“this repo builds image Y”) requires cross-referencing information from different file types within the same repo.
CI templates: the invisible dependency layer
CI pipeline configs are arguably the most important dependency surface to track, and the most neglected.
GitLab CI
GitLab CI supports several forms of cross-repo inclusion:
include:
- project: 'platform/ci-templates'
ref: 'v2.0'
file: '/terraform-plan.yml'
- remote: 'https://gitlab.com/platform/ci-templates/-/raw/main/deploy.yml'
The project: form gives you a clean repo reference and an optional version ref. The remote: form gives you a URL that needs to be parsed back into a repo path.
There’s also trigger:, where one pipeline triggers another project’s pipeline, and image:, where a job specifies a Docker image. Another dependency surface hiding in the CI config.
A subtle gotcha: GitLab CI supports !reference tags for reusing configuration fragments. These are valid YAML tags but not standard YAML. A naive YAML parser will choke on them. Your parser needs to handle this gracefully, either by pre-processing the file or by configuring the YAML loader to ignore unknown tags.
GitHub Actions
GitHub Actions reusable workflows have their own syntax:
jobs:
deploy:
uses: org/shared-workflows/.github/workflows/[email protected]
And composite or JavaScript actions:
steps:
- uses: org/custom-action@main
The uses: string encodes the org, repo, path, and ref in a single string. You need to parse it, separate the org/repo from the workflow path, handle the @ref suffix, and skip public actions (anything under actions/*).
The real challenge with CI templates is that they’re often pinned to a branch rather than a tag — @main instead of @v2.0. This means version tracking is inherently fuzzy. You can tell that repo A depends on repo B’s CI template, but “which version” is just “whatever’s on main right now.” This is exactly the kind of implicit, hard-to-track dependency that causes surprise breakage.
Python, Go, and npm: package ecosystems with their own quirks
These three share a common pattern — they have declared dependency manifests — but each has its own flavour of complexity.
Python
Python dependencies can be declared in at least four places: requirements.txt, pyproject.toml, setup.cfg, and setup.py. Each has a different syntax. And only some of those declarations point at internal packages — most are public PyPI packages that you should skip.
The interesting ones for cross-repo tracking are editable git installs:
-e git+https://gitlab.com/org/[email protected]#egg=internal-utils
And pyproject.toml dependencies pointing at internal packages published to a private PyPI registry. Matching “package name in a requirements file” to “the repo that builds that package” requires knowing what each repo produces — which is why the artifact detection step (identifying that a given repo is the source of a Python package, based on its pyproject.toml or setup.py) is essential.
Go
Go modules are cleaner than most. The go.mod file is authoritative:
require (
gitlab.com/org/shared-lib v1.3.0
github.com/org/internal-sdk v0.9.2
)
The module path is the repo path (more or less). The version is explicit. The main challenge is filtering: most require entries are public modules (github.com/stretchr/testify, golang.org/x/net). You need a way to identify which entries point to repos within your org and skip the rest.
replace directives add a wrinkle — they can redirect a module path to a local directory or a different remote path, which changes the effective dependency.
npm
package.json is the source of truth. For internal cross-repo dependencies, you’re looking for scoped packages or git URL references:
{
"dependencies": {
"@company/ui-components": "^2.1.0",
"@company/shared-utils": "git+https://github.com/org/shared-utils.git#v1.0"
}
}
Scoped packages (@company/...) from a private registry need to be matched to the repo that publishes them. Git URL references can be parsed directly. Public npm packages are filtered out.
One thing all three ecosystems share: the link between “this repo consumes package X” and “this repo produces package X” is not always obvious from a single file. You need to first discover what each repo publishes (by reading its pyproject.toml, go.mod module declaration, or package.json name field), then match consumers to producers across the org.
Ansible and Helm: infrastructure-specific dependency patterns
Ansible
Ansible dependencies appear in requirements.yml (roles and collections), galaxy.yml (collection metadata and dependencies), and meta/main.yml (role dependencies).
# requirements.yml
roles:
- src: git+https://gitlab.com/org/ansible-roles/nginx.git
version: v2.0
collections:
- name: company.shared_collection
version: ">=1.0"
The complexity here is similar to Python: matching a Galaxy-style name (company.shared_collection) to the repo that publishes it requires artifact detection. Git URL sources are more straightforward but still need normalisation.
Helm
Helm chart dependencies are declared in Chart.yaml:
dependencies:
- name: redis
version: "17.x"
repository: "https://charts.bitnami.com/bitnami"
- name: auth-service
version: "1.2.0"
repository: "https://helm.internal.company.com"
Public chart repositories (Bitnami, stable, etc.) are filtered out. Internal repository references need to be matched to the repos that build those charts. And file:// references to local charts in the same repo are internal — not cross-repo dependencies.
Kubernetes and Kustomize: the deployment layer
Kubernetes manifests and Kustomize configurations add another dependency surface — one that’s often overlooked because it’s at the “deployment” end of the pipeline rather than the “build” end.
Kustomize’s kustomization.yaml can reference resources and bases from other repos:
resources:
- https://github.com/org/k8s-base//manifests/monitoring?ref=v1.0
bases:
- github.com/org/shared-platform//overlays/production
These are cross-repo references with the same double-slash subdirectory convention as Terraform. They need the same kind of URL normalisation and repo matching.
Kubernetes manifests themselves reference Docker images in container specs, Helm charts in HelmRelease custom resources, and ConfigMaps or Secrets by name. The image references connect back to the Docker dependency surface — another example of how the graph crosses ecosystem boundaries.
The real challenge: resolution
Parsing is the visible work. But the step that makes or breaks the dependency graph is resolution. Taking a parsed reference and matching it to an actual repo or artifact in your org.
A Dockerfile says FROM registry.company.com/platform/base-image:v2. Which repo builds that image? The registry path might not match the repo path. The image might be built by a CI pipeline in a repo named docker-base-images, pushed to a registry path of platform/base-image. Connecting those requires understanding the producing side, not just the consuming side.
This is why artifact detection matters so much. Before you can resolve “this repo uses artifact X,” you need to know “that repo produces artifact X.” For Docker images, this means scanning CI configs for docker push commands. For Python packages, it means reading pyproject.toml to find the package name. For Helm charts, it means reading Chart.yaml. Each ecosystem has a different way of declaring what a repo produces, and the resolver needs to cross-reference all of it.
Resolution also has to deal with ambiguity. A Docker image reference like base-image:v2, without a full registry prefix, could match multiple repos. A Python package name might be normalised differently (my_package vs. my-package). Terraform module paths might use SSH vs. HTTPS URLs for the same repo. The resolver needs normalisation rules, fuzzy matching strategies, and a confidence model — because some matches are certain and others are best-effort.
Getting this right is the difference between a dependency graph that people trust and one they abandon after finding three false edges.
Why cross-ecosystem matters
Any individual ecosystem’s parsing problem is tractable. The Python community could build a Python dependency tracker. The Terraform community could build a module graph tool. And some have.
But the actual dependency graph in a real organisation doesn’t respect ecosystem boundaries. A Terraform module produces infrastructure that a Docker image is built on, which a CI pipeline deploys, which references a Helm chart, which pulls a shared Ansible role.
If you only see the Terraform slice, you miss the Docker dependency that’s about to break your deployment pipeline. If you only see Docker, you miss the Terraform module change that will change the infrastructure your image runs on.
The value of cross-ecosystem discovery isn’t additive — it’s multiplicative. Each new ecosystem you add doesn’t just give you more nodes in the graph. It reveals connections between ecosystems that were previously invisible. Those connections are exactly where surprise breakage lives.
What I learned building this
I’ve been building Riftmap to solve this problem — auto-discovering cross-repo dependencies across all the ecosystems described above and presenting them as a queryable, visual graph with blast radius analysis.
A few things I’ve learned along the way:
The parser is the easy part. Extracting source = "..." from a Terraform file is straightforward. The hard parts are resolution (matching references to actual repos), freshness (keeping the graph current as repos change), and staleness detection (knowing when a previously-discovered dependency no longer exists because a repo was renamed or archived).
Confidence matters. Not all discovered dependencies are equally certain. A Terraform source with a full git URL and a pinned ref is high confidence. A Docker FROM with variable substitution and no default value is low confidence. Exposing this confidence to users — rather than pretending everything is equally certain — is critical for trust.
The graph is the product, not the report. Every DIY solution I’ve seen generates a static output — a CSV, a SQLite dump, a rendered image. The real value comes when the graph is interactive, queryable, and always current. “Show me every repo affected if I change this module” should be a click, not a pipeline run.
If you’re building infrastructure for a platform team and this problem resonates, I’d love to hear how it shows up in your stack. The edge cases are different for every org, and understanding them is how the tooling gets better.
You can see more at riftmap.dev, or reach me at [email protected].