Methodology
How this dataset is collected, evaluated, and maintained. Read this before citing.
1. What this dataset is
This is an open-source, version-controlled dataset tracking AI-relevant datacenter infrastructure across Africa. Each facility is a YAML file in a public git repository. Every substantive data point carries a source citation with a retrieval date, a trust tier indicating the source's reliability, and a confidence level indicating how well the source supports the specific claim. The dataset is designed to be cited directly by journalists, policy analysts, and researchers who need to reference specific facts with attribution to a primary source.
The provenance chain — from a data point to the field, to the source ID, to the source URL and trust tier — is the primary product. A fact without a source is not in this dataset.
2. What this dataset is not
This is not a comprehensive commercial database. It is a single-maintainer research project that prioritises sourcing rigour over completeness. Gaps are documented openly. A facility that is not in this dataset may exist but lack sufficient publicly available primary sources to meet the sourcing standard — its absence is not a statement about its importance.
This is not a real-time feed. Records are updated manually when new information is found.
The last_verified field on each
record shows the date of the most recent review. Records older than six months are flagged
as potentially stale on the detail page.
This is not an analytical publication. The analysis
block on some records contains the maintainer's interpretation, which is explicitly labelled
as editorial and should be cited as such. The primary structured fields are sourced facts.
3. Scope criteria
The following are in scope:
- Facilities in South Africa, Kenya, and Nigeria (v1 scope; expanding over time)
- Hyperscaler cloud regions (AWS, Azure, Google Cloud, and equivalents)
- Large carrier-neutral colocation campuses that serve as hyperscaler on-ramps
- Purpose-built AI compute and GPU cluster facilities
- Facilities announced, under construction, or operationally expanded within the last three years
The following are out of scope:
- Retail colocation (small tenants, shared racks without hyperscaler significance)
- Edge nodes and CDN points of presence
- Enterprise on-premise installations
- Facilities with no recent activity and no hyperscaler or AI relevance
The "AI-relevant" standard is applied conservatively. A facility must be designed for, or
publicly described as serving, significant cloud or AI compute workloads. The
workload_profile field records
what a facility actually does based on public statements; the
infrastructure.cooling_adequacy_for_ai
field records whether the physical infrastructure is capable of supporting AI workloads.
4. Source trust tiers
Every source in the dataset is assigned one of seven trust tiers, ranked from most to least reliable. The tier does not determine whether a fact is included — it determines what confidence level is appropriate and how strongly a single source can support a claim.
- regulatory
- Government filings, planning applications, environmental impact assessments, and official gazette notices. These are primary legal records subject to perjury or regulatory liability; they are the most reliable source in this dataset. Example: a South African Department of Environment EIA submission listing the facility's proposed power draw.
- company_official
- Press releases, investor relations filings, annual reports, and product or marketing pages published by the operating or owning company. Self-reported but directly attributable to the company; generally reliable for existence and basic facts, less reliable for forward-looking capacity claims. Example: an Equinix investor relations press release confirming the completion of its Teraco acquisition.
- trade_press
- Specialist industry outlets with dedicated datacenter and infrastructure reporters, editorial standards, and correction policies. Includes DatacenterDynamics, DataCenter Knowledge, Capacity Media, and equivalent publications. Example: a DatacenterDynamics article reporting on a new facility announcement with named company spokespeople.
- satellite
- Commercial satellite imagery (Planet Labs, Maxar, Google Earth Engine) analysed for construction activity, facility footprint, or physical evidence of a claimed development. Used to verify or contradict reported construction status. Example: a Planet Labs image showing foundation work at a site described as "under construction."
- local_journalism
-
General-purpose local news outlets covering the region. Editorial standards vary widely.
Used as a research lead, for corroboration, or for facts that no higher-tier source
has confirmed. A single local journalism source is not sufficient for a
confirmedrating. Example: a Business Day or TechCabal article reporting on an announcement based on a company statement. - industry_report
- Research and analyst reports from firms such as CBRE, JLL, Synergy Research, IDC, or Omdia. Methodology is often not fully disclosed; primary data sourcing may not be transparent. Often paywalled, which limits verification by third parties. Example: a Synergy Research market sizing report listing South African colocation capacity.
- social_media
- LinkedIn posts, Twitter/X threads, job listings, and forum discussions. Lowest reliability; used only as a research lead to identify facilities for further investigation. A social media source is never used alone for a structured data field. Example: a LinkedIn post by a construction manager referencing work on a new facility, used to identify a potential record to research.
5. Confidence levels
Every entry in sourced_fields
carries one of four confidence levels. The level reflects how well the available sources
support the specific value, not just the existence of a source.
- confirmed
- The value is directly stated in at least one primary source with no ambiguity. The source explicitly names the number, date, or entity in question. Suitable for direct citation.
- reported
- The value is stated in a source but has not been independently cross-checked against a second primary source. Suitable for citation with the caveat that the figure has not been independently verified.
- estimated
-
The value was calculated or inferred rather than directly stated in a source. The
methodology for the estimate is documented in the
sourced_fields[field].notefield. Common for derived figures such as grid carbon intensity (inferred from national grid data) or annual water use (estimated from IT load and cooling type). - disputed
- Conflicting sources exist and give materially different values, or a source makes a claim that a named party has actively denied. Both positions are documented. See Section 8 for how disputed values are handled.
6. The claims model
Some assertions about a facility — whether a specific hyperscaler is an anchor tenant,
whether a government contract exists, whether a foreign investor is involved — are contested,
unconfirmed, or evolving. Placing these directly in structured fields (such as
tenants or
governance.foreign_capital_origin)
would imply a certainty that does not exist. The v2 schema introduces a separate
claims array for exactly
this kind of assertion.
A claim carries its own status (confirmed,
reported,
disputed,
denied,
withdrawn),
the source IDs supporting it, and a separate
denial_status field tracking
whether either or both parties have explicitly denied the claim. A
disputed claim is visually
distinct from confirmed facts on the detail page.
Journalists should treat a claim with status
reported or
disputed differently from a
fact in a sourced field with confidence
confirmed. The distinction
is intentional. Do not cite a disputed claim as a fact; cite it as a claim, note its
status, and mention that parties have or have not responded.
Migration of speculation from earlier records into the claims model is an ongoing process.
Records that have not yet been migrated may contain speculation in the
notes or
workload_profile.notes fields
rather than as formal claims. The migration TODO report at
dist/migration-todos.md
lists records pending that conversion.
7. The analysis block
Some facility records contain an
analysis block. Everything
in that block — strategic significance level, rationale, and the "why this matters"
bullet list — is the maintainer's interpretation, not a sourced fact. The block carries a
mandatory is_editorial: true
flag and records who assessed it and when.
When citing this dataset, treat sourced fields and editorial analysis differently:
-
Sourced fields (e.g.
capacity.it_load_mwwith confidencereported) can be cited as: "According to [source], as reported by the Africa AI Datacenter Tracker…" - Editorial analysis should be cited as: "The Africa AI Datacenter Tracker assessed this facility as strategically significant because…" — making clear the assessment is the maintainer's, not the source's.
The significance summary bullets at the top of each facility page are drawn from the analysis block. They are provided as a shorthand for journalists and policymakers who need a quick "why this matters" hook. They carry the same editorial caveat.
8. How we handle conflicting sources
When two sources give materially different values for the same field, the field is assigned
confidence: disputed. Both
source IDs are listed, and the
note field explains the
discrepancy. The value stored in the primary field is the one with stronger source support;
the alternative is documented in the note.
Contested claims about entities (anchor tenants, investors, government contracts) are moved into the claims model rather than stored as disputed values in sourced fields. This keeps the distinction clear: a disputed fact is a case where sources disagree about a number or date; a disputed claim is a case where an assertion about an entity's involvement has not been confirmed or has been denied.
The maintainer does not adjudicate between conflicting sources. Both positions are documented.
If a field would require a judgement call to populate, it is left absent or marked
unknown.
9. Update cadence
This dataset is currently maintained by a single author. Update cadence is opportunistic, not scheduled: records are updated when new information is found, when a source is published that materially changes a field, or when a reader flags an error.
The target cadence for re-verification of individual records is every six months. Records
not verified within six months are flagged as stale on the detail page. A re-verification
means reviewing the primary sources, checking whether the facility's status has changed,
and updating last_verified
only when the review is complete.
Breaking developments (a facility opening, a cancelled project, a public denial of a reported claim) are prioritised and updated as quickly as possible. The git commit history provides a full audit trail of every change.
10. Known limitations
Naming the gaps is part of the sourcing standard. The following are known limitations of this dataset at the time of writing:
- Geographic coverage is incomplete. The current dataset covers South Africa, Kenya, and Nigeria. Egypt, Morocco, Ghana, Ethiopia, Rwanda, and Côte d'Ivoire all have active datacenter development that is not yet tracked. Coverage will expand as sourcing allows.
- Some source URLs are placeholders. A small number of records have
example.complaceholder URLs where the original article was found but the URL was not recorded. These are flagged in the record's notes and in the maintenance report. They do not affect the validity of the structured data fields, which are sourced from the actual article content. - Archived URLs are missing on most records. Archiving via the Wayback Machine must be done from a human browser; it cannot be automated. Most source URLs have not yet been archived, meaning source content could disappear. This is an ongoing maintenance task.
- Capacity figures are often announced, not verified. Most
capacity.it_load_mwvalues are reported from company announcements, not from regulatory filings or satellite verification. They represent planned capacity at full buildout, which may never be reached. - Tenant information is frequently incomplete. Hyperscalers do not always disclose individual facility relationships. Tenant records are based on company product pages and press releases, which may be incomplete.
- No satellite verification yet. The schema includes fields for satellite review status, but no facility has been independently verified against satellite imagery. Construction status claims are based entirely on press reporting.
- Speculation is being migrated progressively. Earlier records may contain unconfirmed speculation in free-text fields that has not yet been formally converted to the claims model. The migration TODO report identifies these records.
11. External dataset compatibility
The dataset is designed to be joinable with external infrastructure databases. Each record
can carry an external_ids
object mapping recognised external dataset identifiers to their values. Currently recognised
keys are:
epoch_ai— Epoch AI large-scale AI training dataset IDwikidata— Wikidata QID (preferred for stable cross-referencing)dgtl_infra— DGTL Infra datacenter databasedatacentermap— datacentermap.com facility IDbaxtel— Baxtel datacenter directory ID
An Epoch AI-compatible CSV export is available at
dist/exports/epoch-compatible.csv
and is generated by running npm run export:epoch.
The export follows Epoch's data_centers.csv
column format and includes only facilities with at least one capacity figure
(it_load_mw or
h100_equivalent_gpus).
Facilities without capacity data are omitted from the export with a logged reason.
Field paths in sourced_fields
use dot notation for nested fields (e.g.
capacity.it_load_mw) and
bracket notation for array elements (e.g.
operators[0],
claims[0].entity).
12. How to cite
When citing a specific facility record, include:
- The facility name and ID (the ID is the stable URL slug)
- The
last_verifieddate of the record you consulted - The dataset URL and version (the git commit hash is the version)
Suggested citation format:
Example for a facility record last verified 2026-06-29:
For the dataset as a whole (rather than a specific record), cite the git repository with the commit hash at the time of access. The dataset is published under CC-BY 4.0; attribution to "Africa AI Datacenter Tracker" is required.