Design for failure, detect fast, isolate impact, and self-heal. That’s the playbook.
- Architecture at a glance
- Error taxonomy (where it breaks)
- A–Z of production-grade error handling
- Patterns: quarantine, retries, runbooks
- Code snippets: CE plugin & F&O validation
- Observability & alerting blueprint
- Governance, SLAs, and change control
- Checklists & go-live guardrails
Architecture at a glance
Dual-write gives near real-time sync between CE (Dataverse) and F&O via mapped entity pairs, triggers, and runtime handlers on both sides. Treat it like a distributed system: network hops, auth tokens, schema drift, referential integrity, and throughput limits are all real. Build guardrails where data enters, not only where it fails.
Error taxonomy (where it breaks)
Auth & Connectivity
Expired tokens, revoked permissions, broken connection references, network timeouts.
Schema & Mapping
Missing fields, data type mismatches, option set ↔ enum misalignment, length overflows.
Data Quality
Nulls where required, invalid references, company/legal entity mismatch, number sequence rules.
Business Rules
CE allows, F&O rejects (credit hold, status transitions, posting requirements).
Throughput & Ordering
Out-of-order updates, bursts, throttling, long chains of dependent entities.
A–Z of production-grade error handling
Document entity pairings, directionality, and prerequisites. Keep a living data contract.
Service principals with least privilege; rotate secrets; health probe connections post-deploy.
Validate legal entity/company early; block mismatches before sync.
Mirror required fields and constraints in CE via business rules/plug-ins.
Keep maps small & cohesive. Version them. Avoid “mega maps.”
Pre-validate lengths, formats, and enums in CE; never let bad data leave.
Define RACI, response times, and escalation paths for sync failures.
Daily map status, retry queues, last success timestamps, and error rate thresholds.
Ensure reprocessing the same record doesn’t double-post or corrupt state.
Write compact logs with correlation IDs across CE↔F&O to reconstruct timelines.
Circuit breaker to pause affected maps without nuking the environment.
Respect API limits; batch initial loads; stagger bursts; backoff on 429/5xx.
Map by codes, not labels; centralize enum dictionaries; test round-trips.
Guard F&O-owned identifiers; avoid CE generating values F&O expects to own.
Order parent→child; enforce dependencies; delay children until parents exist.
Divert toxic records into a holding table with a reason & fix hints.
DQ checks in Power Query/CE; reject early; surface friendly errors to users.
One-click requeue per record or batch; track attempts & outcomes.
Key Vault, managed identities where possible, and audit access.
Push CE plug-in traces + F&O logs to a single telemetry sink (e.g., App Insights).
Test maps after solution/PU updates; detect schema drift before prod.
Mirror F&O business validation in CE plug-ins to prevent futile trips.
Pre-baked steps for auth, mapping, DQ, and throughput incidents.
Standard error codes, human-readable messages, remediation tips.
F&O period close adds rules—expect stricter validations & timing windows.
Blue-green map rollout, feature toggles, and rollback plans.
Patterns: quarantine, retries, and runbooks
Quarantine (“Parking Lot”)
Create a custom CE entity (e.g., Dual-write Error) capturing record reference, map name, error class, reason, suggested fix, and retry count. Divert failures here via plug-ins or post-operation handlers. Add a dashboard for Ops to triage.
Deterministic Retries
Use exponential backoff for transient 429/5xx. Hard-fail immediately on schema or DQ errors with actionable messages. Cap attempts, then park.
Selective Resubmission
After fix (e.g., missing parent), reprocess only the impacted records. Avoid mass “select all and pray.” Maintain idempotent upserts.
War-room Runbook (Skeleton)
- Identify: Which map, entity, company?
- Classify: Transient vs permanent.
- Contain: Pause the map (kill switch) if blast radius is growing.
- Fix: Data correction / mapping / permission / env health.
- Reprocess: Targeted retries with logging & confirmation.
- Review: Root cause, action items, and change ticket.
Code snippets: CE plug-in & F&O validation
// CE (Dataverse) C# Plug-in — PreOperation validation for Dual-write-bound entity
public class DualWritePreValidate : IPlugin
{
public void Execute(IServiceProvider serviceProvider)
{
var ctx = (IPluginExecutionContext)serviceProvider.GetService(typeof(IPluginExecutionContext));
var factory = (IOrganizationServiceFactory)serviceProvider.GetService(typeof(IOrganizationServiceFactory));
var svc = factory.CreateOrganizationService(ctx.UserId);
try
{
var target = (Entity)ctx.InputParameters["Target"];
// Example guards: length, required combos, enum codes, company
GuardLength(target, "bn_name", 60);
RequireIf(target, "bn_creditHold", true, requiredField: "bn_creditHoldReason");
ValidateEnumCode(target, "bn_customerTypeCode", new[] { 100000000, 100000001 });
// Optional: push normalized error to custom log entity instead of letting Dual-write fail downstream
}
catch (ValidationException vex)
{
// Make it human. Include a code, cause, and remediation hint.
throw new InvalidPluginExecutionException($"DW-E1001: {vex.Message} — Fix the data and save again.");
}
catch (Exception ex)
{
// Unknowns should surface but not leak internals
throw new InvalidPluginExecutionException($"DW-E1999: Unexpected error. Contact support with CorrelationId={ctx.CorrelationId}.", ex);
}
}
// Helpers omitted: GuardLength, RequireIf, ValidateEnumCode...
}
// F&O (X++) — Validate write to mirror CE rules and throw friendly messages
public boolean validateWrite()
{
boolean isValid = super();
if (!isValid) return false;
// Example: enforce legal entity & required combo
if (this.DataAreaId == '' || this.CustomerGroup == '')
{
error("DW-F1002: Company and Customer group are required for dual-write.");
return false;
}
if (this.CreditMax < 0)
{
error("DW-F1010: Credit limit cannot be negative (check CE 'Credit Limit').");
return false;
}
return true;
}
Observability & alerting blueprint
- Single pane: CE dashboards (quarantine queue, retry counts), F&O Dual-write workspace, and an aggregated telemetry view.
- Correlation IDs: Include in CE plug-in trace, quarantine record, and any F&O log entry.
- Alerts: Error rate > X% over 5 min, consecutive failures on a map, or no successful sync for Y minutes.
- KPIs: Mean time to detect (MTTD), mean time to repair (MTTR), parked records backlog, top 5 error signatures.
Governance, SLAs, and change control
SLA tiers
P1 (financial posting blocked) 1-hour response; P2 4-hours; P3 next business day. Define and enforce.
Change windows
No map changes during period close or payroll; feature flags for risky toggles.
Post-mortems
Blameless, timestamped, with concrete fixes (tests, monitors, docs updates).
Checklists & go-live guardrails
- ✅ CE plug-ins replicate F&O required combos & lengths; blocking errors are human-friendly.
- ✅ Enum/code dictionaries are centralized and tested in both directions.
- ✅ Quarantine entity, dashboard, and Power Automate reprocess action exist.
- ✅ Kill switch per map + documented rollback path.
- ✅ Alerts wired for error rate, no-success window, and backlog growth.
- ✅ Runbooks for auth failure, mapping drift, data burst, and period close.
- ✅ Load tests for peak create/update throughput; backoff verified.
- ✅ Change control for schema updates on either side (contracts reviewed).
Dual-write doesn’t fail in prod—it fails in design. If CE lets bad data exist, F&O will loudly refuse it. Validate earlier, log smarter, and make retries boring.


Leave a comment