The problem is always the same. An enterprise has been running Microsoft 365 for years. SharePoint sites have proliferated — some owned, some orphaned, some shared with external partners for deals that closed three years ago. No one knows exactly what data is where, who can see it, or what classification it carries.

The compliance team asks: how many documents contain PII? The IT team asks: who are the 1,429 guest users and what can they access? The legal team asks: are any of our M&A documents shared with former counterparties? The answers are sitting in the Microsoft Graph API. They're just not assembled into anything useful.

The SH∀DW platform — SharePoint and OneDrive data governance — is the assembled answer.

SharePoint sites + drives OneDrive personal drives Entra ID guests + members FSx / AWS file systems Crawl workers sharepoint.py + onedrive.py DynamoDB state machine Sharing exposure 4-phase attribution 1,429 guests → companies PII classifier column name patterns restricted → redacted S3 + Athena 11 partitioned datasets QuickSight 5 exec dashboard sheets

The visibility gap

Most enterprise data governance starts from the wrong end. Tools are deployed to classify documents, alert on policy violations, or generate compliance reports — before anyone has a clear inventory of what they're governing.

The first thing SH∀DW does is generate that inventory. A crawl worker iterates every SharePoint site collection and every OneDrive in the tenant via the Microsoft Graph API. For each site, it records: the site ID, the display name, the item count, the total size, the crawl status. The result is a complete, queryable registry of everything that exists — including sites that have never been looked at.

This sounds basic. In practice, it's transformative. When the crawl finishes and the inventory table appears in Athena, teams see their data estate clearly for the first time. Sites created for projects that ended years ago. Drives belonging to employees who left. Guest users who still have access. The visibility itself produces action.

PII classification before it reaches the lake

The crawl produces an inventory — but some of that inventory is sensitive. A SharePoint site containing HR documents or student records shouldn't be treated the same as a site containing marketing collateral.

SH∀DW classifies inventory entries at crawl time, using field name pattern matching against a classification rules engine. An inventory entry containing fields matching PII patterns (names, emails, national IDs, student numbers) gets classified as restricted or highly_restricted.

The critical design decision: classification happens before the data reaches any analytics layer. The S3 export writes two tables — a redacted version where pii_fields and schema_snapshot are stripped for restricted entries, and a full version protected by AWS Lake Formation policies. Standard QuickSight users see the redacted table. Only operators with elevated permissions can query the full PII inventory.

The Lake Formation line: Enforcing data classification at the S3/Athena boundary — not at the BI layer — means there's no path for a misconfigured dashboard to accidentally expose PII. The analytics tool can't return what the data layer doesn't provide. This is the right model for regulated data at any scale.

The sharing exposure problem

Knowing what exists is half the problem. Knowing who can see it is the other half. In a multi-company M365 tenant — where several subsidiary companies share a single Microsoft tenant — sharing attribution is genuinely difficult.

The SH∀DW sharing exposure engine runs as a 4-phase state machine:

PhaseWhat it doesOutput
1 — Tenant settings Reads the SharePoint tenant's external sharing configuration Is external sharing enabled? What policy?
2 — Activity reports Downloads SP/OD activity CSVs via Graph API Which files have been accessed by which users?
3 — Guest enumeration Lists all Entra guest users, extracts email domain 1,429 guests attributed to external companies
4 — Site membership Cross-references guests with site member records Which sites does each guest company have access to?

The attribution problem is subtle. In M365, guest users are tenant-scoped — a guest invited to any site in the parent tenant appears in the parent's Entra directory. But subsidiary companies run their own SharePoint site collections within the same tenant. An employee from one subsidiary appearing as a guest on another subsidiary's site is expected — but a former M&A counterparty appearing as a guest six years after a deal closed is not.

The UPN parsing handles the B2B guest format: user_domain.com#EXT#@tenant.onmicrosoft.com — extracting the real external domain even from the obfuscated Entra representation. This is the mechanism that makes guest attribution accurate across 1,429 external users spread across dozens of external organisations.

The analytics layer

Once the crawl results are in S3, a Lambda export job writes 11 partitioned datasets to S3 in NDJSON format, registers Glue partitions, and makes the data queryable via Athena. QuickSight connects to Athena and surfaces 5 dashboard sheets:

SheetAudienceKey questions answered
Executive summary CISO, CDO Total sites, total risk score, classification breakdown
Exposure detail Security team Risk matrix by site × classification, SP vs OD exposure bar
External access intelligence IT + Legal Guest domains, activity by external party, tenant sharing capability
Company site profile IT per subsidiary Sites per company, classification donut, source breakdown
Site inventory IT operations Full searchable flat inventory with crawl status filter

The risk score formula — guest_user_count + sp_exposure + od_exposure + (pii_count × 10) — is deliberately simple. The 10× multiplier on PII exposure reflects that the consequence of a PII breach is categorically different from a document policy violation. Simple formulas are more defensible to auditors than complex ML models.

What this enables that wasn't possible before

Before SH∀DW, answering "which external party has access to which documents" required a multi-week manual exercise by an IT team. After SH∀DW, it's a QuickSight filter. The same query that would have taken weeks can be answered in under 30 seconds — and it can be answered continuously, not just at audit time.

The architecture generalises beyond M365. The same crawl-classify-expose pattern applies to any unstructured data estate: Salesforce document libraries, Google Drive, Box, legacy file shares. The pipeline topology is the same. The classification rules and the API adapters are the only things that change.

For any organisation that has grown through acquisition — and most large organisations have — this kind of visibility layer is not a nice-to-have. It's the prerequisite for everything else: GDPR compliance, M&A due diligence, zero-trust network policy, data residency enforcement. You cannot govern what you cannot see.

Related
Data governance and GDPR Security scanning at org scale Building a multi-tenant SaaS

Scan any public GitHub repo for dependency risk, secrets, and code quality issues — free, no account needed.

Scan a repo free See governance agents →