Advanced Troubleshooting Workflows Using SentryOne Plan ExplorerSentryOne Plan Explorer is a powerful, free tool for SQL Server professionals who need deeper insight into execution plans, faster troubleshooting, and clearer guidance for query tuning. This article walks through advanced troubleshooting workflows using Plan Explorer, showing how to detect performance issues, prioritize fixes, and validate improvements. Examples assume familiarity with SQL Server Management Studio (SSMS), execution plans, and basic query tuning concepts.
Why use Plan Explorer for advanced troubleshooting
Plan Explorer extends SSMS’s execution plan UI with richer visualizations, clearer operator details, and specialized features designed to surface common performance problems quickly. Key advantages:
- Enhanced graphical plan view for easier spotting of expensive operators and data movement.
- Plan comparison to quickly spot changes between plan versions.
- Operator warnings and suggestions that pinpoint potential issues (missing indexes, memory spills, high CPU).
- Integrated statistics and runtime metrics when available to correlate plan shape with actual execution behavior.
Preparing for troubleshooting: capture the right plans and metrics
Before diagnosing, gather the necessary artifacts:
- Saved actual execution plans (.sqlplan) from production workloads or captured by Extended Events/SQL Server Profiler.
- Estimated plans when actuals are not available.
- Query text and variable input values used during problematic runs.
- Wait statistics and server-level metrics (CPU, memory, I/O) for the timeframe.
- Relevant statistics, index definitions, and schema for the objects involved.
Tip: if you can reproduce the issue on a test system, capture both estimated and actual plans along with SET STATISTICS IO/TIME output for side-by-side analysis.
Workflow 1 — Rapid hotspot identification
Goal: Quickly find which queries or operators are causing the biggest impact.
Steps:
- Open the captured plan in Plan Explorer. Use the Summary and Plan Cost bars to identify the top-cost queries/operators.
- Expand the plan tree and enable the “Top 20 Operators” view to focus on operations contributing the largest costs.
- Look for these red flags:
- Index scans on large tables where seeks would be expected.
- Hash operations with very large build inputs or high memory grants.
- Sorts flagged as spilling to disk.
- Nested loops with large outer/input rowcounts and repeated lookups.
- Use the Properties pane to inspect Actual vs Estimated Rows to find cardinality estimation issues.
Example: If a table scan shows ActualRows 1,000,000 but EstimatedRows 1,000, the plan likely suffers from stale or missing statistics or parameter sniffing.
Workflow 2 — Deep dive into cardinality and statistics problems
Goal: Diagnose incorrect row estimates and root-cause them to stats, parameter sniffing, or complex predicates.
Steps:
- In Plan Explorer, click any operator and inspect Estimated Rows vs Actual Rows, and the Estimated vs Actual Row Size.
- If estimates are off by orders of magnitude, check:
- Are statistics present and up-to-date on the columns used in predicates and joins?
- Are there filtered statistics that would help?
- Is parameter sniffing causing atypical parameter values at compile time?
- Reproduce with literal values: run the query with OPTION (RECOMPILE) or hard-coded literals to see if the estimated plan changes.
- Consider targeted fixes:
- Update statistics WITH FULLSCAN or create filtered statistics.
- Add appropriate indexes to support seeks and better joins.
- Use query hints or OPTIMIZE FOR UNKNOWN / local recompile techniques for parameter sensitivity.
- Rewrite predicates (e.g., avoid implicit conversions or non-SARGable expressions).
Concrete example: An equality predicate on a varchar column compared to an nvarchar parameter can cause implicit conversion and wrong estimates; casting consistently or matching parameter data types fixes it.
Workflow 3 — Memory and spill analysis
Goal: Find operators that cause memory pressure or spill to tempdb and remove the cause.
Steps:
- Identify warnings in Plan Explorer’s operator warnings (look for “Spill to TempDB”, “Insufficient Memory”, or large memory grants).
- Inspect Hash Match and Sort operators: check Build/Input sizes and Granted Memory fields.
- Correlate with server memory metrics and concurrent query workload. Large memory grants from multiple simultaneous queries can exhaust available memory.
- Remedial actions:
- Create better supporting indexes to reduce large sorts/hash builds.
- Force streaming operations (nested loops) where appropriate by improving seek predicates.
- Reduce parallelism or adjust resource governor/workload isolation.
- If spills are unavoidable, ensure tempdb is well-provisioned (multiple files, fast storage).
Example: A large hash join with BuildRows = 10M and BuildSize = 1GB causing spill suggests need for index to avoid full-table joins or staging smaller result sets.
Workflow 4 — I/O, missing indexes, and seek vs scan tradeoffs
Goal: Minimize physical I/O by enabling seeks and covering indexes.
Steps:
- Use Plan Explorer’s Missing Index suggestions; inspect the suggested key and include columns to evaluate practicality.
- Compare IO statistics (logical/physical reads) reported alongside the plan or from SET STATISTICS IO.
- Consider index design tradeoffs:
- Narrow nonclustered indexes for seeks on selective predicates.
- Include columns to make indexes covering for heavy queries, avoiding lookups.
- Avoid overindexing: balance write overhead vs read benefit.
- Validate improvement by generating a hypothetical execution plan after adding index (or use Database Tuning Advisor/hypothetical index testing).
Concrete rule: If a plan shows repeated RID/Key lookups on large rowcounts, a covering index can often convert nested loops + lookups into a single seek.
Workflow 5 — Plan comparison and regression detection
Goal: Determine what changed between a known-good plan and a problematic one.
Steps:
- Load both plans into Plan Explorer and use the Plan Comparison feature.
- Focus on differences in:
- Join order and join types (nested loops vs hash vs merge).
- Index usage (seek vs scan).
- Estimated vs Actual Row mismatches and operator costs.
- Warnings or missing index hints present only in one plan.
- Investigate possible causes for plan regression:
- Statistics updates or lack thereof.
- New/changed indexes or schema changes.
- Parameter value changes leading to different cardinality estimates.
- Changes in server load, memory, or parallelism settings.
- Rollback or recompile options:
- Use plan forcing (Query Store or plan guides) carefully after root cause confirmed.
- Test forced plans for different parameter values to ensure stability.
Workflow 6 — Correlating wait stats and runtime behavior
Goal: Connect plan-level problems to observed waits and server symptoms.
Steps:
- Collect wait stats during the problematic period (sys.dm_os_wait_stats or monitoring tool data).
- Map common waits to plan causes:
- SOS_SCHEDULER_YIELD or CXPACKET → CPU contention or parallelism issues.
- PAGEIOLATCH_* → physical I/O problems, likely due to scans or missing indexes.
- PAGELATCH* or LCK* → contention on memory or locks, possibly from tempdb spills or hot pages.
- Use Plan Explorer to find operators that align with these waits (large scans for PAGEIOLATCH, sorts/spills for PAGELATCH).
- Remediate at the appropriate layer (query rewrite, indexing, server/IO tuning, resource governor).
Validating fixes and establishing repeatable testing
- After applying a fix (index, stats update, hint), capture new actual plans and compare to prior plans in Plan Explorer.
- Use realistic parameter sets and concurrency to validate under load.
- Automate capture of baseline plans for key queries and schedule periodic comparison to detect regressions early.
Best practices and operational tips
- Keep statistics current: schedule regular UPDATE STATISTICS jobs with FULLSCAN for critical tables, or use asynchronous sampling carefully.
- Use Query Store for historical plan capture and forcing stable plans when necessary.
- Use descriptive test harnesses that replay production-like workloads to validate fixes before production deployment.
- Train the team to read Plan Explorer’s warnings; many common issues are highlighted directly in the UI.
- When forcing plans, monitor for edge-case regressions—forced plans can hurt other parameter sets.
Example case study (concise)
Problem: A reporting query suddenly slowed from 5s to 90s after a data refresh. Diagnosis with Plan Explorer:
- Comparison showed a switch from Index Seek + Nested Loops to Table Scan + Hash Join.
- ActualRows for a key join were 10x higher than estimated in the slow plan.
- Operator warnings indicated spills to tempdb for the Hash Join.
Fixes applied:
- Updated statistics WITH FULLSCAN and added a selective nonclustered index covering the join and filter columns.
- Query performance returned to 6s; new plan reverted to seeks and nested loops with no spills.
Conclusion
SentryOne Plan Explorer accelerates advanced troubleshooting by making execution plan differences, cardinality issues, memory pressure, and missing-index opportunities easy to spot and act on. Use it as part of a disciplined workflow: capture accurate artifacts, prioritize hotspots, root-cause with statistics and waits, apply targeted fixes, and validate under realistic conditions. With these workflows you’ll reduce mean time to resolution for complex SQL Server performance problems.
Leave a Reply