Methodology

How this dataset is collected, evaluated, and maintained. Read this before citing.

1. What this dataset is

This is an open-source, version-controlled dataset tracking AI-relevant datacenter infrastructure across Africa. Each facility is a YAML file in a public git repository. Every substantive data point carries a source citation with a retrieval date, a trust tier indicating the source's reliability, and a confidence level indicating how well the source supports the specific claim. The dataset is designed to be cited directly by journalists, policy analysts, and researchers who need to reference specific facts with attribution to a primary source.

The provenance chain — from a data point to the field, to the source ID, to the source URL and trust tier — is the primary product. A fact without a source is not in this dataset.

2. What this dataset is not

This is not a comprehensive commercial database. It is a single-maintainer research project that prioritises sourcing rigour over completeness. Gaps are documented openly. A facility that is not in this dataset may exist but lack sufficient publicly available primary sources to meet the sourcing standard — its absence is not a statement about its importance.

This is not a real-time feed. Records are updated manually when new information is found. The last_verified field on each record shows the date of the most recent review. Records older than six months are flagged as potentially stale on the detail page.

This is not an analytical publication. The analysis block on some records contains the maintainer's interpretation, which is explicitly labelled as editorial and should be cited as such. The primary structured fields are sourced facts.

3. Scope criteria

The following are in scope:

Facilities in South Africa, Kenya, and Nigeria (v1 scope; expanding over time)
Hyperscaler cloud regions (AWS, Azure, Google Cloud, and equivalents)
Large carrier-neutral colocation campuses that serve as hyperscaler on-ramps
Purpose-built AI compute and GPU cluster facilities
Facilities announced, under construction, or operationally expanded within the last three years

The following are out of scope:

Retail colocation (small tenants, shared racks without hyperscaler significance)
Edge nodes and CDN points of presence
Enterprise on-premise installations
Facilities with no recent activity and no hyperscaler or AI relevance

The "AI-relevant" standard is applied conservatively. A facility must be designed for, or publicly described as serving, significant cloud or AI compute workloads. The workload_profile field records what a facility actually does based on public statements; the infrastructure.cooling_adequacy_for_ai field records whether the physical infrastructure is capable of supporting AI workloads.

4. Source trust tiers

Every source in the dataset is assigned one of seven trust tiers, ranked from most to least reliable. The tier does not determine whether a fact is included — it determines what confidence level is appropriate and how strongly a single source can support a claim.

regulatory: Government filings, planning applications, environmental impact assessments, and official gazette notices. These are primary legal records subject to perjury or regulatory liability; they are the most reliable source in this dataset. Example: a South African Department of Environment EIA submission listing the facility's proposed power draw.
company_official: Press releases, investor relations filings, annual reports, and product or marketing pages published by the operating or owning company. Self-reported but directly attributable to the company; generally reliable for existence and basic facts, less reliable for forward-looking capacity claims. Example: an Equinix investor relations press release confirming the completion of its Teraco acquisition.
trade_press: Specialist industry outlets with dedicated datacenter and infrastructure reporters, editorial standards, and correction policies. Includes DatacenterDynamics, DataCenter Knowledge, Capacity Media, and equivalent publications. Example: a DatacenterDynamics article reporting on a new facility announcement with named company spokespeople.
satellite: Commercial satellite imagery (Planet Labs, Maxar, Google Earth Engine) analysed for construction activity, facility footprint, or physical evidence of a claimed development. Used to verify or contradict reported construction status. Example: a Planet Labs image showing foundation work at a site described as "under construction."
local_journalism: General-purpose local news outlets covering the region. Editorial standards vary widely. Used as a research lead, for corroboration, or for facts that no higher-tier source has confirmed. A single local journalism source is not sufficient for a confirmed rating. Example: a Business Day or TechCabal article reporting on an announcement based on a company statement.
industry_report: Research and analyst reports from firms such as CBRE, JLL, Synergy Research, IDC, or Omdia. Methodology is often not fully disclosed; primary data sourcing may not be transparent. Often paywalled, which limits verification by third parties. Example: a Synergy Research market sizing report listing South African colocation capacity.
social_media: LinkedIn posts, Twitter/X threads, job listings, and forum discussions. Lowest reliability; used only as a research lead to identify facilities for further investigation. A social media source is never used alone for a structured data field. Example: a LinkedIn post by a construction manager referencing work on a new facility, used to identify a potential record to research.

5. Confidence levels

Every entry in sourced_fields carries one of four confidence levels. The level reflects how well the available sources support the specific value, not just the existence of a source.

confirmed: The value is directly stated in at least one primary source with no ambiguity. The source explicitly names the number, date, or entity in question. Suitable for direct citation.
reported: The value is stated in a source but has not been independently cross-checked against a second primary source. Suitable for citation with the caveat that the figure has not been independently verified.
estimated: The value was calculated or inferred rather than directly stated in a source. The methodology for the estimate is documented in the sourced_fields[field].note field. Common for derived figures such as grid carbon intensity (inferred from national grid data) or annual water use (estimated from IT load and cooling type).
disputed: Conflicting sources exist and give materially different values, or a source makes a claim that a named party has actively denied. Both positions are documented. See Section 8 for how disputed values are handled.

6. The claims model

Some assertions about a facility — whether a specific hyperscaler is an anchor tenant, whether a government contract exists, whether a foreign investor is involved — are contested, unconfirmed, or evolving. Placing these directly in structured fields (such as tenants or governance.foreign_capital_origin) would imply a certainty that does not exist. The v2 schema introduces a separate claims array for exactly this kind of assertion.

A claim carries its own status (confirmed, reported, disputed, denied, withdrawn), the source IDs supporting it, and a separate denial_status field tracking whether either or both parties have explicitly denied the claim. A disputed claim is visually distinct from confirmed facts on the detail page.

Journalists should treat a claim with status reported or disputed differently from a fact in a sourced field with confidence confirmed. The distinction is intentional. Do not cite a disputed claim as a fact; cite it as a claim, note its status, and mention that parties have or have not responded.

Migration of speculation from earlier records into the claims model is an ongoing process. Records that have not yet been migrated may contain speculation in the notes or workload_profile.notes fields rather than as formal claims. The migration TODO report at dist/migration-todos.md lists records pending that conversion.

7. The analysis block

Some facility records contain an analysis block. Everything in that block — strategic significance level, rationale, and the "why this matters" bullet list — is the maintainer's interpretation, not a sourced fact. The block carries a mandatory is_editorial: true flag and records who assessed it and when.

When citing this dataset, treat sourced fields and editorial analysis differently:

Sourced fields (e.g. capacity.it_load_mw with confidence reported) can be cited as: "According to [source], as reported by the Africa AI Datacenter Tracker…"
Editorial analysis should be cited as: "The Africa AI Datacenter Tracker assessed this facility as strategically significant because…" — making clear the assessment is the maintainer's, not the source's.

The significance summary bullets at the top of each facility page are drawn from the analysis block. They are provided as a shorthand for journalists and policymakers who need a quick "why this matters" hook. They carry the same editorial caveat.

8. How we handle conflicting sources

When two sources give materially different values for the same field, the field is assigned confidence: disputed. Both source IDs are listed, and the note field explains the discrepancy. The value stored in the primary field is the one with stronger source support; the alternative is documented in the note.

Contested claims about entities (anchor tenants, investors, government contracts) are moved into the claims model rather than stored as disputed values in sourced fields. This keeps the distinction clear: a disputed fact is a case where sources disagree about a number or date; a disputed claim is a case where an assertion about an entity's involvement has not been confirmed or has been denied.

The maintainer does not adjudicate between conflicting sources. Both positions are documented. If a field would require a judgement call to populate, it is left absent or marked unknown.

9. Update cadence

This dataset is currently maintained by a single author. Update cadence is opportunistic, not scheduled: records are updated when new information is found, when a source is published that materially changes a field, or when a reader flags an error.

The target cadence for re-verification of individual records is every six months. Records not verified within six months are flagged as stale on the detail page. A re-verification means reviewing the primary sources, checking whether the facility's status has changed, and updating last_verified only when the review is complete.

Breaking developments (a facility opening, a cancelled project, a public denial of a reported claim) are prioritised and updated as quickly as possible. The git commit history provides a full audit trail of every change.

10. Known limitations

Naming the gaps is part of the sourcing standard. The following are known limitations of this dataset at the time of writing:

Geographic coverage is incomplete. The current dataset covers South Africa, Kenya, and Nigeria. Egypt, Morocco, Ghana, Ethiopia, Rwanda, and Côte d'Ivoire all have active datacenter development that is not yet tracked. Coverage will expand as sourcing allows.
Some source URLs are placeholders. A small number of records have example.com placeholder URLs where the original article was found but the URL was not recorded. These are flagged in the record's notes and in the maintenance report. They do not affect the validity of the structured data fields, which are sourced from the actual article content.
Archived URLs are missing on most records. Archiving via the Wayback Machine must be done from a human browser; it cannot be automated. Most source URLs have not yet been archived, meaning source content could disappear. This is an ongoing maintenance task.
Capacity figures are often announced, not verified. Most capacity.it_load_mw values are reported from company announcements, not from regulatory filings or satellite verification. They represent planned capacity at full buildout, which may never be reached.
Tenant information is frequently incomplete. Hyperscalers do not always disclose individual facility relationships. Tenant records are based on company product pages and press releases, which may be incomplete.
No satellite verification yet. The schema includes fields for satellite review status, but no facility has been independently verified against satellite imagery. Construction status claims are based entirely on press reporting.
Speculation is being migrated progressively. Earlier records may contain unconfirmed speculation in free-text fields that has not yet been formally converted to the claims model. The migration TODO report identifies these records.

11. External dataset compatibility

The dataset is designed to be joinable with external infrastructure databases. Each record can carry an external_ids object mapping recognised external dataset identifiers to their values. Currently recognised keys are:

epoch_ai — Epoch AI large-scale AI training dataset ID
wikidata — Wikidata QID (preferred for stable cross-referencing)
dgtl_infra — DGTL Infra datacenter database
datacentermap — datacentermap.com facility ID
baxtel — Baxtel datacenter directory ID

An Epoch AI-compatible CSV export is available at dist/exports/epoch-compatible.csv and is generated by running npm run export:epoch. The export follows Epoch's data_centers.csv column format and includes only facilities with at least one capacity figure (it_load_mw or h100_equivalent_gpus). Facilities without capacity data are omitted from the export with a logged reason.

Field paths in sourced_fields use dot notation for nested fields (e.g. capacity.it_load_mw) and bracket notation for array elements (e.g. operators[0], claims[0].entity).

12. How to cite

When citing a specific facility record, include:

The facility name and ID (the ID is the stable URL slug)
The last_verified date of the record you consulted
The dataset URL and version (the git commit hash is the version)

Suggested citation format:

Africa AI Datacenter Tracker. "[Facility Name]" (ID: [facility-id]). Record last verified [last_verified date]. https://africa-ai-datacenters.example/facilities/[facility-id]. Retrieved [your retrieval date].

Example for a facility record last verified 2026-06-29:

Africa AI Datacenter Tracker. "Cosmas Data City Cape Town Campus" (ID: cavaleros-cdc-cape-town-campus). Record last verified 2026-06-29. https://africa-ai-datacenters.example/facilities/cavaleros-cdc-cape-town-campus. Retrieved 2026-06-29.

For the dataset as a whole (rather than a specific record), cite the git repository with the commit hash at the time of access. The dataset is published under CC-BY 4.0; attribution to "Africa AI Datacenter Tracker" is required.

← Back to tracker · Glossary →