Download Powerpoint - SQL Saturday

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Microsoft Jet Database Engine wikipedia , lookup

Tandem Computers wikipedia , lookup

Database model wikipedia , lookup

Clusterpoint wikipedia , lookup

Relational model wikipedia , lookup

Object-relational impedance mismatch wikipedia , lookup

Open Database Connectivity wikipedia , lookup

SQL wikipedia , lookup

Microsoft SQL Server wikipedia , lookup

PL/SQL wikipedia , lookup

Transcript
Monitor SQL Server Efficiently
Remus Rusanu
#sqlsaturday565
24 Sept 2016, Bucharest
24 Sept 2017
SQLSaturday 565 Bucharest
whoami




Worked in the SQL Server team with Microsoft since 2001
Now founder with DBHistory.com
Spent many hours investigating performance
On-call performance engineer for Azure SQL DB
 Troubleshooting based on telemetry alone
24 Sept 2017 SQLSaturday 565 Bucharest
Why do we monitor?

Troubleshooting




Post Mortem analysis




Understand long term direction, estimate capacity needs, justify spending requests
Contiguous, low resolution, long retention
Baselining



After an incident we need to look back to understand why it occurred
Contiguous, medium resolution, discardable after a grace period
Missing data feedback loop
Trending



We need to investigate incidents as they occur and we need to measure right now
On demand, high resolution, short duration, discardable
Requires access to the system being monitored
Periodically collect same data we would collect in troubleshooting so we can compare an incident with normal activity
On schedule, high resolution, medium duration, long retention
Alerting


Detect incidents, notify on-call team and trigger investigation
Alerting coupled with automation: mitigation bots for know issues
24 Sept 2017 SQLSaturday 565 Bucharest
What do we monitor?
 Activity
 What is running
 Capacity





Free Disk, space used
CPU utilization
IO utilization
Memory use, paging
Network use, bandwidth, latency
24 Sept 2017 SQLSaturday 565 Bucharest
 Availability
 Uptime, SLA
 Errors
 Recoverability
 Backups
 Availability Groups
 Specific features
 Replication
How do we monitor?
 Performance Counters
 The Golden Standard when it comes to
measurement
 Easy to collect, low impact, rich toolset,
cheap to store
 DMVs
 Difficult to collect, many require
snapshot-store-and-compare
 Some have significant impact
 XEvents
 Abundant information
 Easy to filter at source
 Difficult to collect
24 Sept 2017 SQLSaturday 565 Bucharest




ETW
Logs
Event Notifications
Query Store
USE methodology
http://www.brendangregg.com/usemethod.html
 Utilization, Saturation and Errors
 Identify resources in the system
 For each resource, identify metrics that represent utilization (in use vs. idle) and
saturation (queueing, blocking, waiting). Identify errors indicators (events, logs
etc)
 When investigating, iterate through resources
 Look at error indicators
 Look for saturation indicators
 Look for high utilization percentage
 Generic methodology, the trick is identifying resources and
collecting/finding the associated metrics
 Can be applied at host level (CPU, IO, network) but also at SQL Server
internals level
24 Sept 2017 SQLSaturday 565 Bucharest
SQL Server Query Execution
http://rusanu.com/2013/08/01/understanding-how-sql-server-executes-a-query
24 Sept 2017 SQLSaturday 565 Bucharest
Performance Counters
 Extremely cheap for a process to produce performance counters
 Increment a memory location in a shared memory area
 Extremely cheap for monitoring to read a value
 Read the value via shared memory
 Low impact
 Rich toolset





SDK: .Net, PDH native
OS service for collecting them (Data Collection Sets)
perfmon.exe, logman.exe, typeperf.exe
Can write directly to SQL (and this is supported by Data Collection Sets)
PowerShell supports direct counter querying Get-Counter
 Data Collector Sets must go through COM object Pla.DataCollectorSet
24 Sept 2017 SQLSaturday 565 Bucharest
Performance Counters tools
 perfmon.exe
 Interactive GUI for Data Collector Sets management
 Interactive GUI for on-demand counters collection
 Graphic visualization for both real time and historic logs
 logman.exe
 CLI for Data Collector Sets (counters, ETW, alerts)
 Countless 3rd party tools
 DMV sys.dm_os_performance_counters
 I advice against it’s use: expensive to query, difficult to read correctly
 Has the advantage of being available over TDS
24 Sept 2017 SQLSaturday 565 Bucharest
Performance Counters SQL logging option
 Read and Write counter values into an ODBC defined destination
 Define an ODBC 64bit System DSN, specify database name explicitly
 A feature of PDH, available for all PDH consumers
 SDK: PdhOpenLog (…, PDH_LOG_TYPE_SQL, …)
 perfmon, logman, typeperf
 On first connect it will deploy the SQL tables
 Schema is documented at https://msdn.microsoft.com/enus/library/windows/desktop/aa373198(v=vs.85).aspx
 CounterData table is a perfect columnstore candidate
24 Sept 2017 SQLSaturday 565 Bucharest
SQL Logging table schema
24 Sept 2017 SQLSaturday 565 Bucharest
Interpreting the SQL logged data
24 Sept 2017 12
SQLSaturday
|
565 Bucharest
Reading SQL logged counters using Perfmon
24 Sept 2017 SQLSaturday 565 Bucharest
What to collect: CPU
 Processor (_Total)\% Processor Time
 Processor Object(*) seldom justifies the overhead
 Processor (_Total)\% Priviledged Time
 Process(*)\% Processor Time
 Can help track down when other processes starve CPU
 Expensive to collect, each process is a separate instance
 Collect Process(sqlservr)\% Processor Time and Process(_Total)\%
Processor Time can at least shift the blame, but not pinpoint the culprit.
 Buffer Manager\Page lookups/sec
 Indicative of scans, it helps explain high CPU
24 Sept 2017 SQLSaturday 565 Bucharest
What to collect: memory (OS)
 Memory\Page Reads/sec
 This are hard page faults. Page Faults\sec is soft faults.




Memory\% Committed Bytes In Use
Memory\Available Bytes
Memory\Commit Limit
Process(sqlservr)\Private Bytes
24 Sept 2017 SQLSaturday 565 Bucharest
What to collect: memory (SQL)
 Memory Manager\*
 Really: collect every counter if you can afford it.
 No magic formula for ‘good’ vs. ‘bad’ values, but can be compared
with baseline
 Memory Grants Outstanding
 Memory Grants Pending
 Buffer Manager\Buffer cache hit ratio
 Buffer Node(*)\Page Life Expectancy
24 Sept 2017 SQLSaturday 565 Bucharest
What to collect: IO (OS)
 Process(sqlservr)\





IO Read Operations/sec
IO Read Bytes/sec
IO Write Operations/sec
IO Write Bytes/sec
The counters measure all IO (disk, network, devices)
 Physical Disk\
 What to capture and how to interpret it is a black art
 Windows Performance Monitor Disk Counters Explained
https://blogs.technet.microsoft.com/askcore/2012/03/16/windows-performance-monitor-disk-counters-explained/
 ‘Good’ vs. ‘Bad’ values are highly dependent on hardware
 Memory\Pages/sec
24 Sept 2017 SQLSaturday 565 Bucharest
What to collect: IO (SQL)
 Buffer Manager/








Page reads/sec
Readahead pages/sec
Page writes/sec
Background writer pages/sec
Checkpoint pages/sec
Extension page reads/sec
Extension pages writes/sec
Lazy writes/sec
 Database(*)/
 Log Bytes Flushed/sec
 Backup/Restore Throughput/sec
24 Sept 2017 SQLSaturday 565 Bucharest
Understanding how SQL Server executes a query
http://rusanu.com/2013/08/01/understanding-how-sql-server-executes-a-query/
 TL/DR:
CPU
Wait
C
P
U
Wait
Time
24 Sept 2017 SQLSaturday 565 Bucharest
CPU
Wait
CPU
What to collect: blocking (the aspiration slide)
 sys.dm_os_wait_stats
 sys.dm_os_latch_stats
 sys.dm_os_spinlock_stats
 TODO: fill up when decent monitoring possible




Ever Increasing values
Require snapshot-store-and-compare
Can be reset, difficult to detect in processing logic
No differentiation between idle wait and busy wait resulting in tribal knowledge ‘benign
waits’
 However, still important to collect… somehow
 sys.dm_session_wait_stats
24 Sept 2017 SQLSaturday 565 Bucharest
What to collect: blocking (the pragmatic slide)
 Wait Statistics(*)\Average wait time (ms)
 A performance counter, easy to collect and analyze
 There is an instance per wait type
 but only few selected wait types are represented





Latches\Average Latch Wait Time (ms)
Latches\Total Latch Wait Time (ms)
Locks\Average Wait Time (ms)
Locks\Number of Deadlocks/sec
General Statistics\Processes blocked
24 Sept 2017 SQLSaturday 565 Bucharest
What to collect: workload
 SQL Statistics\Batch Requests/sec
 SQL Statistics\SQL Attention rate
 ‘Attention’ is the TDS jargon for client command timeout
 Transactions\Transactions
 General Statistics\User Connections
 SQL Errors(_Total)\Errors/sec
24 Sept 2017 SQLSaturday 565 Bucharest
What to collect: miscelaneous






LogicalDisk(*)\% Free Space
Databases(*)\Data File(s) Size (KB)
Databases(*)\Log Growths
Plan Cache(*)\Cache Objects Count
General Statistics\Temp tables creation rate
Access Methods
 Full scans/sec, Probe Scans/sec, Range Scans/sec,
Index Searches/sec
 Page Splits/sec
 Skipped Ghosted Records/sec, Forwarded Records/sec
24 Sept 2017 SQLSaturday 565 Bucharest
Measure workload response distribution
 Batch Resp Time(*)\*
 “one latency distribution plot is worth a thousand throughput
measurements”
 Typical SQL Server has a heterogeneous workload
 X batches complete in .01s Y in 1s and Z in 100.0
 Is it one query over a latency distribution? Or is it 3 very different queries?
 Latency for specific queries better measured in app
 It is trivial to expose new counters from apps
http://rusanu.com/2009/04/11/using-xslt-to-generate-performance-counters-code/
24 Sept 2017 SQLSaturday 565 Bucharest
Collecting via querying
 Collecting data by periodically querying catalog views/DMVs




Expensive, some DMVs have serious performance overhead
Snapshot-store-and-compare
Many DMVs arbitrarily reset internally, unreliable for monitoring
Some information, still, hard to discover any other way
 sys.dm_io_virtual_file_stats
 sys.dm_db_index_usage_stats
24 Sept 2017 SQLSaturday 565 Bucharest
Collecting query execution stats




Don’t. Use Query Store instead. But if you must:
sys.dm_exec_query_stats
sys.dm_exec_procedure_stats
Avoid the join with sys.dm_exec_sql_text and
sys.dm_exec_sql_plan
 Very expensive
 Honor query_hash and plan_hash
24 Sept 2017 SQLSaturday 565 Bucharest
Using Event Notifications
create queue notifications;
go
create service notifications
on queue notifications
([http://schemas.microsoft.com/SQL/Notifications/PostEventNotification]);
go
create event notification [sqlsaturday]
on server
for ddl_events
to service N'notifications', N'current database';
go
waitfor(receive cast(message_body as xml) from notifications);
24 Sept 2017 SQLSaturday 565 Bucharest
DDL_EVENTS
 Captures any DDL, in any database, for any type
 CREATE/ALTER/DROP
 GRANT/DENY/REVOKE
 sp_rename, sp_tableoptions etc
 Captures any configuration change
 sp_configure
 Database scoped options
 It can be made more granular if desired
 Capture only specific database
 Capture only specific events:
 select * from sys.server_events
24 Sept 2017 SQLSaturday 565 Bucharest
Event Notifications for Profiler events
 Bridges Profiler events as Event Notification messages
 select * from sys.trace_events
 It can do everything administrative trace can do plus:
+ Deliver the message remotely
+ Trigger activated procedure
24 Sept 2017 SQLSaturday 565 Bucharest
Collect all warnings? File Growth? Deadlocks? Logins?
create event notification [trace_events]
on server
for Hash_Warning, Execution_Warnings,
Sort_Warnings, Bitmap_Warning,
Log_File_Auto_Grow, Deadlock_graph,
Audit_Login_Failed, Audit_Login
to service …
24 Sept 2017 SQLSaturday 565 Bucharest
Collecting Event Notification
 Events can be delivered remotely
 All events look the same, an XML payload
 Must shred the XML and discover object types, object names and event
types
 Some events apply to multiple objects eg. CREATE INDEX
 Coupled with Activation can trigger warnings or mitigation
 Reliable delivery, can survive disconnects
 Also means that events always refer to past, automated mitigation must
check current state before proceeding
 No toolset whatsoever
24 Sept 2017 SQLSaturday 565 Bucharest
Query Data Store
 Absolutely best option for monitoring query performance
 Optimized, hooked deep into execution
 Cannot be simulated with DMV snapshotting
 Excellent troubleshooting tool
 Runtime stats, Compilation Stats
24 Sept 2017 SQLSaturday 565 Bucharest
Query Store missing features
 Collection of wait info
 Centralized aggregation of multiple sources
 Collect info from readable secondaries
24 Sept 2017 SQLSaturday 565 Bucharest
XEvents
 Overwhelmingly rich information
 Cheap to produce
 Built in analytical capabilities
 Counter, Histogram, Pair Matching
 Difficult to collect
 Require ETL through a SQL Server
 Poor toolset
24 Sept 2017 SQLSaturday 565 Bucharest
ETW, Windows Performance Analyzer
 Very powerful for analyzing the entire stack
 A must when the issue is not SQL Server
 Quick Start Guide: WPA Basics
https://msdn.microsoft.com/en-us/library/ff190975.aspx
 Bruce Dawson blog: https://randomascii.wordpress.com/
24 Sept 2017 SQLSaturday 565 Bucharest
Sampling Profiling
 Requires Visual Studio toolset vsperf.exe
 Captures execution stacks on all threads, about 10k/sec
 Only as an on-demand troubleshooting option
 Difficult to setup
 Can have impact
 Incredibly rich insight into what the server is doing
 Provided you manage to resolve the symbols…
 Requires code understanding
 Educated guess of execution role from function names
24 Sept 2017 SQLSaturday 565 Bucharest
Sqlservr sample profile
24 Sept 2017 SQLSaturday 565 Bucharest
Tanks, Q&A and please review
http://speakerscore.com/ZFJ9
24 Sept 2017 SQLSaturday 565 Bucharest