How to monitor and tune Microsoft SQL server 2008 performance


Jephe Wu - http://linuxtechres.blogspot.com 

Environment: MS SQL server 2008 R2, Windows server 2008
Objective: SQL server performance monitoring and tuning, database backup and restore procedures


Steps:


Configuration Changes:
According to - How to determine proper SQL Server configuration settings http://support.microsoft.com/kb/319942

SQL Server can obtain a very high level of performance with relatively little configuration tuning. You can obtain high levels of performance by using good application and database design, and not by extensive configuration tuning. See the "References" section of this article for information about how to troubleshoot various SQL Server performance issues. 

When you address a performance problem, the degree of improvement that is available from configuration tuning is typically modest unless you do not currently have the system properly configured. In SQL Server version 7.0 and later, SQL Server uses automatic configuration tuning and it is extremely rare that configuration settings (especially advanced settings) need any changes. Generally, do not make a SQL Server configuration change without overwhelming reason and not without careful methodical testing to verify the need for the configuration change. You must establish a baseline before the configuration change so that you can measure the benefit after the change. 

If you do not have SQL Server properly configured, some settings might de-stabilize the server or might make SQL Server behave erratically. Years of support experience with many different environments indicate that non-default configuration settings might have results that range from neutral to highly negative. 

If you do make a configuration change, you must perform rigorous methodical performance testing both before and after the change to assess the degree of improvement.

Based on actual support scenarios, SQL Server version 7.0 and later can achieve an extremely high level of performance without any manual configuration tuning.

In SQL Server version 7.0 and later, do not make any configuration changes to user connectionslocks, and open objects because, by default, SQL Server dynamically tunes these settings.


Part I: Performance monitoring:

Performance Counter threshold value:
http://www.grumpyolddba.co.uk/monitoring/Performance%20Counter%20Guidance%20-%20SQL%20Server.htm

http://www.grumpyolddba.co.uk/monitoring/Performance%20Counter%20Guidance%20-%20Windows%20Server.htm

http://www.grumpyolddba.co.uk/sql2000/KB702_SQL%20Performance%20Tuning%20using%20Waits%20and%20Queues.mht

http://msdn.microsoft.com/en-us/library/ms189628.aspx - performance objects explanation

1. overview
check CPU and memory first, then disk I/O, lastly network.
because memory problem can trigger disk I/O problem.

then check database internal performance counters, server minimum/maximum memory configuration, data/log files, tempdb data/log files, paging files.

2. CPU bottleneck check
a. firstly check task manager to see if any other application is using most of cpu, rather than sql server. If it's sqlserver, then check further
b. run 'perfmon', add performance counter

  • processor information - % Processor time the percentage of elapsed time that all of process threads used the processor to execution instructions  should <80%. Process is the object created when you run a program; thread is object that spawned by progress to be used to execute instruction.
  • Processor - % user time  should be > 70%
  • Processor - % privileges time -  The operating system switches application threads to privileged mode to access operating system services. should be < 20% 
SQL Server runs in User mode. Privileged mode, is designed for operating system components and allows direct access to hardware and all memory.
  • System - Processor Queue Length Number of threads waiting to be scheduled for CPU time
System:Processor Queue Length is the processor queue for all CPUs on a Windows system. If System: Processor Queue Length is greater than two for each CPU, it may indicates a CPU bottleneck


Some caused for high processor queue length:
1. Unnecessary compilation and recompilation.

2. memory pressure
3. lack of proper indexing

  • System - Context Switches/sec - if it's greater than 20,000 per second, that's a lot.

sys.dm_os_wait_stats for Signal waits and Total waits, if Signal waits > 25% of total waits indicates cpu bottleneck

Note: to counter the number of cpu in sql server:
select cpu_count from sys.dm_os_sys_info

·         SQL Server: SQL Statistics: Batch Requests/sec  the number of batch requests that SQL Server receives per second, indicates how busy you sql server are. This is relative number, depending your CPU power and how busy your database is.
   a 100M ethernet can roughly handle 1000 batch request/sec
·         SQL Server: SQL Statistics: SQL Compilations/sec  (should be as lower as possible < 100 maybe)
·         SQL Server: SQL Statistics: SQL Recompilations/sec  (better to be nearly 0)
Measurement:
 sql recompilation/sec : sql compilation/sec should be < 10%.
 sql compilation/sec : batch request/sec  should be < 10%



3. Memory bottleneck check
The default setting for min server memory is 0, and the default setting for max server memory is 2147483647 MB in MS SQL 2008 


run 'perfmon', check the following:

  • SQL server: Buffer Manager - Average Page Life Expectancy Counter - should not less than 300 (5mins) 

How long can  a page stay in the memory before being flushed out to disk. too lower indicates lack of memory or missing index.(full table scan)

  • SQL server: Memory Manager - memory grant pending  - task waiting for RAM to execute.
  • Memory: - paging file % usage - should be as lower as possible, < 70% based on KB 889654
  • Memory: - paging file % usage peak - should be as lower as possible, < 70% based on KB 889654
  • Memory: - Available MBytes - should not be less than 200
  • Memory: - Page fault /s 
Page fault /s includes both hard faults (those that require disk access) and soft faults (where the faulted page is found elsewhere in physical memory.) Most processors can handle large numbers of soft faults without significant consequence. However, hard faults, which require disk access, can cause significant delays 

  • Memory: - pages/sec - indicator of the amount of paging that Windows is performing
  • Memory: - Page Reads/sec (should <=5)
 This counter is probably the best indicator of a memory shortage because it indicates how often the system is reading from disk because of hard page faults. The system is always using the pagefile even if there is enough RAM to support all of the applications. Thus, some number of page reads will always be encountered. However, a sustained value over 5 Page Reads/sec is often a strong indicator of a memory shortage

If Memory: Pages/sec is greater than zero or Memory: Page Reads/sec is greater than five, Windows is using disk to resolve memory references (hard page fault). This costs disk I/O + CPU resources. 

Note: 
a. If you have less memory, the disk I/O will increase, they are related, you should consider both sides if you suspect your system has performance issue.  

4. Disk I/O bottleneck
We should concentrate on disk queuing instead of on the actual I/O for each disk as actual disk I/O speed cannot be adjusted.
In perfmon, 

  •  PhysicalDisk - avg. Disk Queue Length  (if exceeds twice the number of spindles, then you are likely developing a bottleneck, you have to know how many disk behind the system, if it's SAN system, it might be difficult to get that information, even SAN administrator might not know exactly)

Note: Disk queuing might be caused by short of memory, check pages/sec value also.

  • Physical disk - avg. disk seconds/read - indicates the average time, in seconds, of a read of data from the disk.  (4-8ms is ideal, if > 15ms might indicate disk  bottlenecks

refer to http://technet.microsoft.com/en-US/library/cc917690 for IO bottleneck part



Less than 10 ms - very good
Between 10 - 20 ms - okay
Between 20 - 50 ms - slow, needs attention
Greater than 50 ms – Serious I/O bottleneck

According to http://download.microsoft.com/download/F/1/0/F10BC023-9396-4D67-BAC9-7C43AE954BF4/SQLServer2000_SQL_Performance_Tuning_using_Waits_and_Queues.pdf

If disk sec/read > normal read time (ask vendor for normal read time) you can consider the following options:
1. Resolve IO bottleneck by adding more drives; spreading IO across new drives if possible e.g. move files such as database, transaction log, other application files that are being written to or read from.
2. Check for memory pressure – see memory component.
3. Check for proper indexing of SQL tables. Proper indexing can save IO. Check SQL query plans looking for scans and sorts, etc. Showplan identifies sorting steps.
4. Run SQL Profiler to identify TSQL statements doing scans. In Profiler, select the scans event class & scan stopped event. Go to the data column tab and add object Id. Run the trace. Save the profiler trace to a trace table, and then search for the scans event. Alternately, you can search for high duration, reads, and writes.




  • Physical disk - avg. disk seconds/write (< 8ms (non cached)  and <=1ms for highend SAN) The throughput for high volume OLTP applications is dependent on fast sequential transaction log writes. 


Non cached Writes
Excellent < 08 Msec ( .008 seconds )
Good < 12 Msec ( .012 seconds )
Fair < 20 Msec ( .020 seconds )
Poor > 20 Msec ( .020 seconds )

Cached Writes Only
Excellent < 01 Msec ( .001 seconds )
Good < 02 Msec ( .002 seconds )
Fair < 04 Msec ( .004 seconds )
Poor > 04 Msec ( .004 seconds





  • Physical disk: % disk time - should be < 50%percentage of elapsed time that the selected disk drive was busy servicing read or write requests.




  • Physical disk - avg. disk reads/s  and Physical disk - avg. disk write/s 

Note: this 2 values indicate how busy is your database server, just for measuring purpose to know how busy your system is , compare to 3 month or a year ago


Should maintain above 2 values < 85% disk capacity


Example:
When you use these counters, you may need to adjust the values for RAID configurations using the following formulas:
·         Raid 0 -- I/Os per disk = (reads + writes) / number of disks
·         Raid 1 -- I/Os per disk = [reads + (2 * writes)] / 2
·         Raid 5 -- I/Os per disk = [reads + (4 * writes)] / number of disks
·         Raid 10 -- I/Os per disk = [reads + (2 * writes)] / number of disks
For example, you might have a RAID-1 system with two physical disks with the following values of the counters.

Disk Reads/sec            80
Disk Writes/sec           70
Avg. Disk Queue Length    5



Then to maintain a raid configuration below, you need:
Raid 0 - need 75 IO per disk
Raid 1 - need 110 IO per disk
Raid 5 - need 180 IO per disk
Raid 10 - need 110 IO per disk

The method of watching out for disk queuing associated with SQL Server log files is different from SQL Server database files. For log file, we use 


SQL Server: database: log flush wait time 
SQL server: database: log flush waits/sec

c. raid level - use raid 10 if possible

Each SQL Server write to the mirrorset results in two disk I/O operations

When dynamiclly adding new disk into raid array, The RAID controller will move some existing SQL Server data to these new drives so data is evenly distributed across all drives in the RAID array

d. sequential I/O
sequentional I/O is faster than non-sequential one, most of hard disk perform 2 times better for sequential than non-sequential I/O operations

Note: Logs generally are not a major concern because transaction log data is always written sequentially to the log file in sizes ranging up to 32 KB.

5. SQL Server

See SQL Server, Buffer Manager Object -  http://msdn.microsoft.com/en-us/library/ms189628.aspx for various objects explanation

  • SQL Server: Buffer Manager: Buffer Cache Hit Ratio, should be at least 90%

Note: a major part of any database server environment is the management of memory buffer cache. less than 90% may indicate memory pressure or missing index.

See SQL Profiler: Stored Procedure: CacheHit, CacheMiss, and CacheInsert to see what stored procedure query plans are already in cache (Hit), vs. those not in cache (Miss,Insert)

  • SQL server: cache manager: Cache Hit Ratio - Percentage of time the procedure plan pages are already in cache e.g. procedure cache hits. I.e. how often a compiled procedure is found in the procedure cache (thus avoiding the need to recompile).


  • SQL server: Buffer Manager: Page Life expectancy -

Sudden big drop in page life expectancy - DW applications (e.g. big transactions) could experience big drops in page life expectancy. This is due to a cache flush from a big read
or
average page life expectancy less than 300 which is 5 minutes - OLTP database - indicate memory pressure, missing indexes, or a cache flush.



  • SQL server: access method : Forwarded Records/sec


Number of records fetched through forwarded record pointers.
Tables with NO clustered index. If you start out with a short row, and update the row creating a wider row, the row may no longer fit on the data page. A pointer will be put in its place and the row will be forwarded to another page.

Look at code to determine where the short row is inserted followed by an update.
Can be avoided by:
1. Using Default values (so that an update will not result in a longer row that is the root cause of forwarded records).
2. Using Char instead of varchar (fixes length so that an update will not result in a longer row

  • SQL Server: access method: Full Scan/sec The number of unrestricted full scans. These can either be base table or full index scans.  -  should be as lower as possible > 1 might indicate problem

  SQL Profiler can be used to identify which TSQL statements do scan. Select the scans event class & events scan:started and scan:completed. Include the object Id data column. Save the profiler trace to a trace table, and then search for the scans event.
The scan:completed event will provide associated IO so you can also search for high reads, writes, and duration.

  • MS SQL Server: Access method: index search/s  (ideally index search/s : full scan /s should > 1000)
Index searches are used to start range scans, single index record fetches, and to reposition within an index. Compare to Full Scan/sec. You want to see high values for index searches.

  • MS SQL Server: Access method: Page Splits/secNumber of page splits occurring as the result of index pages overflowing. Normally associated with leaf pages of clustered indexes and non-clustered indexes.

Page splits are extra IO overhead that results from random inserts.
When there is no room on a data page, and the row must be inserted on the page (due to index order), SQL will split the page moving half the rows to a new page, and then insert the new row.
Correlate to Disk: page sec/write. If this is very high, you may reorg the index(es) on the table(s) causing the page splits, to reduce page splits temporarily. Fillfactor will leave a certain amount of space available for inserts. see http://msdn.microsoft.com/en-us/library/aa933139(v=sql.80).aspx for what's fillfactor.

  • MS SQL Server: Buffer Manager: Checkpoint pages/secPages written to disk during the checkpoint process, freeing up SQL cache
Indicates the number of pages flushed to disk per second by a checkpoint or other operation that require all dirty pages to be flushed.

Memory pressure is indicated if this counters is high along with high lazy writes/sec and low page life expectancy (<300 seconds)

  • MS SQL Server: Buffer Manager: Lazy writes/sec Pages written to disk by the lazywriter, freeing up SQL cache
Indicates the number of buffers written per second by the buffer manager's lazy writer. The lazy writer is a system process that flushes out batches of dirty, aged buffers (buffers that contain changes that must be written back to disk before the buffer can be reused for a different page) and makes them available to user processes. The lazy writer eliminates the need to perform frequent checkpoints in order to create available buffers.


Memory pressure is indicated if above this counters is high along with high lazy writes/sec and low page life expectancy (<300 seconds)

  • MS SQL Server: Buffer Manager: Readahead pages/sec -  indicates the number of pages read per second in anticipation of use.

If memory shortages, cold cache, or low hit rates, SQL may use worker threads to readahead (bring in pages ahead of time) to raise hit rates. By itself readahead is not a problem unless users are flushing each other’s pages consistently.
Check for proper indexing and bad query plans (scans in profiler)


  • SQL Server: database: log flush wait time (waiting for transaction log write request to complete <ms>)  and SQL server: database: log flush waits/sec (Tranlog writes per second) and
and  SQL server: database: log growths (Windows will automatically grow transaction log to accommodate insert, update, and delete activity.)

In general, growths of the transaction log will temporarily freeze writes to the transaction log while Windows grows the transaction log file. Check to see that the growth increment is large enough. If not, performance will suffer as log growths will occur more often.
  • SQL Server: log file size and used log file size
  • SQL Server: database file size
  • SQL server: tempdb file size
  • SQL server: database: Transactions /sec SQL Server transactions per second

  • MS SQL server: Transactions - Longest Transaction Running time.


check long transaction records view
sys.dm_tran_database_transactions - columns of particular interest include the time of the first log record (database_transaction_begin_time), the current state of the transaction (database_transaction_state), and the log sequence number (LSN) of the begin record in the transaction log (database_transaction_begin_lsn).
In perfmon, check

  • General statistics: login/s (number of logins per second), logout/s 
  • General statistics:  User Connections : not very accurate since some application uses connection pool, there are many users behind connection pool
  • Latches : Average Latch Wait Time(ms)
Latches are short term light weight synchronization object. Latches are not held for the duration of a transaction. Typical latching operations during row transfers to memory, controlling modifications to row offset table, etc

  • Latches: Latch Waits/sec
  • Lock - Average Wait Time(ms)  Transactions should be as short as possible to limit the blocking of other users.
  • Lock - Lock Waits/sec
  • Lock - number of deadlocks/sec 



sys.dm_os_wait_stats - check top wait statistics
sys.dm_db_index_operational_stats - check lock and latch wait

8. Network bottlenecks
in perfmon, choose the following:
a. Network Interface - Current Bandwidth.
b. packets/sec 

Note: actual bandwidth is computed as packets/sec * 1500 * 8 /1000000 Mbps.

9. tempdb

Also check paging file, separate data file and log file drive, tempdb data/log should be at separate drive. 


Monitor the following Performance Monitor counters for any unusual increase in the temporary objects allocation/deal location activity:
·         SQL Server:Access Methods\Workfiles Created /Sec
·         SQL Server:Access Methods\Worktables Created /Sec
·         SQL Server:Access Methods\Mixed Page Allocations /Sec
·         SQL Server:General Statistics\Temp Tables Created /Sec
·         SQL Server:General Statistics\Temp Tables for destruction

Solution:  Increase the tempdb data files by an equal amount to distribute the workload across all of the disks and files. Ideally, you want to have as many files as there are CPUs (taking into account the affinity).

Move tempdb data and logfile to new location:

alter database tempdb modify file 
     (name='tempdev',filename= 'e:\mssql7\tempnew_location.mDF')
alter database tempdb modify file 
     (name='templog',filename= 'c:\temp\tempnew_loglocation.mDF')


Part II - performance tuning
1. CPU/Memory bottlenecks - upgrade server or adding more RAM

Note:
AWE: check it on 32-bit servers with >4GB of memory, and unchecked the rest of the time.

2. reduce disk I/O 

  • buy faster disk (15k rpm)
  • using database partition to reduce I/O
  • separate data and log file disk
  • increase temp tablespace initial size and put to separate disk other than data/log files
  • configuring suitable maximum server memory in sql server to leave some for OS
  • suitable raid level:raid 0 or raid 0+1(raid 1/0 or raid10) , mirrored stripes (ms) , stripe first, mirror later, raid 0 first. Raid 10 is the bestfiles.


3. MS SQL server

  • create and maintain good index

  recently used index is in sys.dm_db_index_usage_stats, any defined index not included in this DMV has not been used since the last re-start of SQL Server.

  •   Too many indexes need frequent updates or unused indexes that need frequently maintenance

(insert/select/update) will create overhead, but without benefiting any users.

  •   create appropriate indexing to avoid excessive sorting operations
  •   Big IOs such as table and range scans due to missing indexes


Reduce table join for frequent queries. join overuse will result in long run query and waste resources.


  • Partition large data sets and indexes



  • Tune applications and queries.

  It is very important for application developers to understand SQL Server architectural basics and how to take full advantage of SQL Server indexes to minimize I/O.


  • Lack of useful statistics:

  http://msdn.microsoft.com/en-us/library/ms187348(v=sql.105).aspx
  http://msdn.microsoft.com/en-us/library/ms190397(v=sql.105).aspx#UpdateStatistics
the following command calls sp_updatestats to update all statistics for the database.

EXEC sp_updatestats

other usage:
UPDATE STATISTICS Sales.SalesOrderDetail;
DBCC SHOW_STATISTICS

AUTO_CREATE_STATISTICS
AUTO_UPDATE_STATISTICS
AUTO_UPDATE_STATISTICS_ASYNC


Master database files:

The master database, msdb, and model databases are not used much during production compared to user databases, so it is typically not necessary to consider them in I/O performance tuning considerations. The master database is usually used only for adding new logins, databases, devices, and other system objects.


Part III - References:
1. http://sqlcat.com/sqlcat/b/top10lists/archive/2007/11/21/top-sql-server-2005-performance-issues-for-oltp-applications.aspx
2. SQL server performance tuning guide for dataware house -  http://technet.microsoft.com/library/Cc966420
3. http://msdn.microsoft.com/en-us/library/ms178067.aspx
4. http://sqlcat.com/sqlcat/b/top10lists/archive/2007/11/21/top-10-sql-server-2005-performance-issues-for-data-warehouse-and-reporting-applications.aspx
5. sql long transaction view - http://msdn.microsoft.com/en-us/library/ms186957(v=sql.105).aspx
6. update statistics - http://msdn.microsoft.com/en-us/library/ms187348(v=sql.105).aspx
7. Performance Monitor Counters -  http://technet.microsoft.com/en-us/library/cc768048.aspx
8. performance and tuning howto - http://msdn.microsoft.com/en-us/library/ms191511(v=sql.105).aspx
9. http://technet.microsoft.com/en-us/library/ms190619(v=sql.105)
10. establishing a performance baseline - http://technet.microsoft.com/en-us/library/cc781394(v=WS.10).aspx
11. storage best practise - http://technet.microsoft.com/library/Cc966534

12. Troubleshooting performance problem in sql server 2008 - http://msdn.microsoft.com/en-us/library/dd672789(v=SQL.100).aspx#phrss   - excellent doc


13. best practice for sql sever maintenance - http://technet.microsoft.com/en-US/library/cc966447
14. SAN storage best practise for sql server - http://www.brentozar.com/sql/sql-server-san-best-practices/
15. sql server 2008 I/O performance - http://blogs.technet.com/b/josebda/archive/2009/03/31/sql-server-2008-i-o-performance.aspx
16. diagnosing and resolving Latch Content for SQL server 2008 R2 - http://www.microsoft.com/en-us/download/details.aspx?id=26665

17. How to: Configure SQL Server to Use Soft-NUMA- http://msdn.microsoft.com/en-us/library/ms178144(v=sql.105).aspx

Part IV - useful DMVs

1. DMV list: http://msdn.microsoft.com/en-us/library/ms179984.aspx
2. commands:
select * from sys.dm_os_wait_stats order by wait_time_ms
select * from sys.dm_os_latch_stats;


Select  wait_type,
        waiting_tasks_count,
        wait_time_ms
from  sys.dm_os_wait_stats 
where wait_type like 'PAGEIOLATCH%' 
order by wait_type

I/O wait at this moment

select
    database_id,
    file_id,
    io_stall,
    io_pending_ms_ticks,
    scheduler_address
from  sys.dm_io_virtual_file_stats(NULL, NULL)t1,
        sys.dm_io_pending_io_requests as t2
where t1.file_handle = t2.io_handle



top 10 wait type

select top 10 *
from sys.dm_os_wait_stats
order by wait_time_ms desc



Part V - best practise
1. Isolate log from data at the physical disk level - 

  •     separate database data and log file disk,  Pre-size data and log files.

          According to http://technet.microsoft.com/library/Cc966534It is recommended to have .25 to 1 data files (per filegroup) for each CPU on the host server. 

  •     tempdb data/logfile should on different disk spindle

       create one data file for tempdb per cpu  (Dual core counts as 2 CPUs; logical procs (hyperthreading) do not)
       depending on tempdb usage, use raid10 for storage if possible

  • backup should go to separate disk
  • If you have real time anti-virus monitoring, it is recommended that you exclude the SQL Server database files (including data files, transaction log files, tempdb and other system database files) from real time monitoring
  •  Update statistics on the largest tables weekly or monthly
  • Rebuild or defrag the most important indexes

2.  storage disk for data file should not be shared with other application etc, dedicated for data file with enough spindle
3.  vcpu and memory amount should be same as physical machine baseline if running in virtual environment.
4. If raid 1/0 or raid 1 for logfile disk if possible.  RAID 1+0 provides better write performance than any other RAID level providing data protection, including RAID 5.


Part VI - testing tools
1. sql nexus
2. sql diag
3. sqlIO , alternatively , use IOMETER
download sqlio at http://www.microsoft.com/en-us/download/details.aspx?id=20163
http://www.brentozar.com/archive/2008/09/finding-your-san-bottlenecks-with-sqlio/

SAN performance tuning with sqlio - http://sqlserverpedia.com/wiki/SAN_Performance_Tuning_with_SQLIO

4. perfmon and sql profiler
http://www.simple-talk.com/sql/database-administration/correlating-sql-server-profiler-with-performance-monitor/  good article talking about how to corelate profiler and perfmon data

5. latch wait


For a non-production environment only, clear the sys.dm_os_wait_stats DMV with
the following command:
dbcc SQLPERF ('sys.dm_os_wait_stats', 'CLEAR')
A similar command can be run to clear the sys.dm_os_latch_stats DMV:
dbcc SQLPERF ('sys.dm_os_latch_stats', 'CLEAR')


Part VII - backup and restore


1. restore master datbase in single user mode -  http://technet.microsoft.com/en-us/library/ms190679.aspx
2. Performing a Complete Database Restore (Full Recovery Model)
http://msdn.microsoft.com/en-us/library/ms187495(v=sql.105).aspx

http://msdn.microsoft.com/en-us/library/ms175199(v=sql.105).aspx

Part VIII - FAQ
1. if performance counter doesn't load or corrupted, you can run this to fix:


cd \windows\system32
lodctr /R

No comments:

Post a Comment