Thứ Tư, 22 tháng 3, 2017

How to troubleshoot Linux performance bottlenecks

Here is a relevant quote from Tuning Red Hat Enterprise Linux on IBM Eserver xSeries Servers (2005)

Identifying bottlenecksThe following steps are used as our quick tuning strategy:
  1. Know your system.
  2. Back up the system.
  3. Monitor and analyze the system’s performance.
  4. Narrow down the bottleneck and find its cause.
  5. Fix the bottleneck cause by trying only one single change at a time.
  6. Go back to step 3 until you are satisfied with the performance of the system.
 4.1.1 Gathering information
Mostly likely, the only first-hand information you will have access to will be statements such as “There is a problem with the server.” It is crucial to use probing questions to clarify and document the problem. Here is a list of questions you should ask to help you get a better picture of the system.
Can you give me a complete description of the server in question?
  • – Model
  • – Age
  • – Configuration
  • – Peripheral equipment
  • – Operating system version and update level
Can you tell me exactly what the problem is?
  • What are the symptoms?
  • Describe any error messages.Some people will have problems answering this question, but any extra information the customer can give you might enable you to find the problem. For example, the customer might say “It is really slow when I copy large files to the server.” This might indicate a network problem or a disk subsystem problem.
  • Who is experiencing the problem? Is one person, one particular group of people, or the entire organization experiencing the problem? This helps determine whether the problem exists in one particular part of the network, whether it is application-dependent, and so on. If only one user experiences the problem, then the problem might be with the user’s PC (or their imagination).The perception clients have of the server is usually a key factor. From this point of view, performance problems may not be directly related to the server: the network path between the server and the clients can easily be the cause of the problem. This path includes network devices as well as services provided by other servers, such as domain controllers.
  • Can the problem be reproduced? All reproducible problems can be solved. If you have sufficient knowledge of the system, you should be able to narrow the problem to its root and decide which actions should be taken.
Tip: You should document each step, especially the changes you make and their effect on performance.
The fact that the problem can be reproduced enables you to see and understand it better.
Document the sequence of actions that are necessary to reproduce the problem:
  • – What are the steps to reproduce the problem? Knowing the steps may help you reproduce the same problem on a different machine under the same conditions. If this works, it gives you the opportunity to use a machine in a test environment and removes the chance of crashing the production server.
  • – Is it an intermittent problem? If the problem is intermittent, the first thing to do is to gather information and find a path to move the problem in the reproducible category. The goal here is to have a scenario to make the problem happen on command.
  • – Does it occur at certain times of the day or certain days of the week? This might help you determine what is causing the problem. It may occur when everyone arrives for work or returns from lunch. Look for ways to change the timing (that is, make it happen less or more often); if there are ways to do so, the problem becomes a reproducible one.
  • – Is it unusual? If the problem falls into the non-reproducible category, you may conclude that it is the result of extraordinary conditions and classify it as fixed. In real life, there is a high probability that it will happen again. A good procedure to troubleshoot a hard-to-reproduce problem is to perform general  maintenance on the server: reboot, or bring the machine up to date on drivers and patches.
  • When did the problem start? Was it gradual or did it occur very quickly? If the performance issue appeared gradually, then it is likely to be a sizing issue; if it appeared overnight, then the problem could be caused by a change made to the server or peripherals.
  • Have any changes been made to the server (minor or major) or are there any changes in the way clients are using the server?
  • Did the customer alter something on the server or peripherals to cause the problem?
  • Is there a log of all network changes available? Demands could change based on business changes, which could affect demands on a servers and network systems.
  • Are there any other servers or hardware components involved? Are any logs available?
  • What is the priority of the problem? When does it have to be fixed?
    • – Does it have to be fixed in the next few minutes, or in days? You may have some time to fix it; or it may already be time to operate in panic mode.
    • – How massive is the problem?
    • – What is the related cost of that problem?
4.1.2 Analyzing the server’s performance
At this point, you should begin monitoring the server. The simplest way is to run monitoring tools from the server that is being analyzed. (See Chapter 2, “Monitoring tools” on page 15, for information.)
A performance log of the server should be created during its peak time of operation (for example, 9:00 a.m. to 5:00 p.m.); it will depend on what services are being provided and on who is using these services. When creating the log, if available, the following objects should be included:
  • Processor
  • System
  • Server work queues
  • Memory
  • Page file
  • Physical disk
  • Redirector
  • Network interface
Before you begin, remember that a methodical approach to performance tuning is important.
Our recommended process, which you can use for your xSeries server performance tuning process, is as follows:
1. Understand the factors affecting server performance. This Redpaper and the redbook Tuning IBM Eserver xSeries Servers for Performance, SG24-5287 can help.
2. Measure the current performance to create a performance baseline to compare with your future measurements and to identify system bottlenecks.
3. Use the monitoring tools to identify a performance bottleneck. By following the instructions in the next sections, you should be able to narrow down the bottleneck to the subsystem level.
4. Work with the component that is causing the bottleneck by performing some actions to improve server performance in response to demands.
5. Measure the new performance. This helps you compare performance before and after the tuning steps.
When attempting to fix a performance problem, remember the following:
Take measurements before you upgrade or modify anything so that you can tell whether the change had any effect. (That is, take baseline measurements.)
Examine the options that involve reconfiguring existing hardware, not just those that involve adding new hardware.
Important: Before taking any troubleshooting actions, back up all data and the configuration information to prevent a partial or complete loss.
Note: It is important to understand that the greatest gains are obtained by upgrading a component that has a bottleneck when the other components in the server have ample “power” left to sustain an elevated level of performance.
4.2.1 Finding CPU bottlenecks
Determining bottlenecks with the CPU can be accomplished in several ways. As discussed in
Chapter 2, “Monitoring tools” on page 15, Linux has a variety of tools to help determine this;
the question is: which tools to use?
One such tool is uptime. By analyzing the output from uptime, we can get a rough idea of what has been happening in the system for the past 15 minutes. For a more detailed explanation of this tool, see 2.2, “uptime” on page 16.
Example 4-1 uptime output from a CPU strapped system
18:03:16 up 1 day, 2:46, 6 users, load average: 182.53, 92.02, 37.95
Using KDE System Guard and the CPU sensors lets you view the current CPU workload.
Using top, you can see both CPU utilization and what processes are the biggest contributors
to the problem (Example 2-3 on page 18). If you have set up sar, you are collecting a lot of
information, some of which is CPU utilization, over a period of time. Analyzing this information
can be difficult, so use isag, which can use sar output to plot a graph. Otherwise, you may
wish to parse the information through a script and use a spreadsheet to plot it to see any
trends in CPU utilization. You can also use sar from the command line by issuing sar -u or
sar -U processornumber. To gain a broader perspective of the system and current utilization
of more than just the CPU subsystem, a good tool is vmstat (2.6, “vmstat” on page 21).
4.2.2 SMP
SMP-based systems can present their own set of interesting problems that can be difficult to
detect. In an SMP environment, there is the concept of CPU affinity, which implies that you
bind a process to a CPU.
The main reason this is useful is CPU cache optimization, which is achieved by keeping the
same process on one CPU rather than moving between processors. When a process moves
between CPUs, the cache of the new CPU must be flushed. Therefore, a process that moves
between processors causes many cache flushes to occur, which means that an individual
process will take longer to finish. This scenario is very hard to detect because, when
Note: There is a common misconception that the CPU is the most important part of the
server. This is not always the case, and servers are often overconfigured with CPU and
underconfigured with disks, memory, and network subsystems. Only specific applications
that are truly CPU-intensive can take advantage of today’s high-end processors.
Tip: Be careful not to add to CPU problems by running too many tools at one time. You
may find that using a lot of different monitoring tools at one time may be contributing to the
high CPU load.
monitoring it, the CPU load will appear to be very balanced and not necessarily peaking on
any CPU. Affinity is also useful in NUMA-based systems such as the xSeries 445 and xSeries
455, where it is important to keep memory, cache, and CPU access local to one another.
4.2.3 Performance tuning options
The first step is to ensure that the system performance problem is being caused by the CPU
and not one of the other subsystems. If the processor is the server bottleneck, then a number
of steps can be taken to improve performance. These include:
Ensure that no unnecessary programs are running in the background by using ps -ef. If
you find such programs, stop them and use cron to schedule them to run at off-peak
hours.
Identify non-critical, CPU-intensive processes by using top and modify their priority using
renice.
In an SMP-based machine, try using taskset to bind processes to CPUs to make sure that
processes are not hopping between processors, causing cache flushes.
Based on the running application, it may be better to scale up (bigger CPUs) than scale
out (more CPUs). This depends on whether your application was designed to effectively
take advantage of more processors. For example, a single-threaded application would
scale better with a faster CPU and not with more CPUs.
General options include making sure you are using the latest drivers and firmware, as this
may affect the load they have on the CPU.
4.3 Memory bottlenecks
On a Linux system, many programs run at the same time; these programs support multiple
users and some processes are more used than others. Some of these programs use a
portion of memory while the rest are “sleeping.” When an application accesses cache, the
performance increases because an in-memory access retrieves data, thereby eliminating the
need to access slower disks.
The OS uses an algorithm to control which programs will use physical memory and which are
paged out. This is transparent to user programs. Page space is a file created by the OS on a
disk partition to store user programs that are not currently in use. Typically, page sizes are
4 KB or 8 KB. In Linux, the page size is defined by using the variable EXEC_PAGESIZE in the
include/asm-<architecture>/param.h kernel header file. The process used to page a process
out to disk is called pageout.
4.3.1 Finding memory bottlenecks
Start your analysis by listing the applications that are running on the server. Determine how
much physical memory and swap each application needs to run. Figure 4-1 on page 75
shows KDE System Guard monitoring memory usage.
Figure 4-1 KDE System Guard memory monitoring
The indicators in Table 4-1 can also help you define a problem with memory.
Table 4-1 Indicator for memory analysis
Paging and swapping indicators
In Linux, as with all UNIX-based operating systems, there are differences between paging
and swapping. Paging moves individual pages to swap space on the disk; swapping is a
bigger operation that moves the entire address space of a process to swap space in one
operation.
Swapping can have one of two causes:
A process enters sleep mode. This usually happens because the process depends on
interactive action, as editors, shells, and data entry applications spend most of their time
waiting for user input. During this time, they are inactive.
Memory indicator Analysis
Memory available This indicates how much physical memory is available for use. If, after you start your application,
this value has decreased significantly, you may have a memory leak. Check the application that
is causing it and make the necessary adjustments. Use free -l -t -o for additional information.
Page faults There are two types of page faults: soft page faults, when the page is found in memory, and hard
page faults, when the page is not found in memory and must be fetched from disk. Accessing
the disk will slow your application considerably. The sar -B command can provide useful
information for analyzing page faults, specifically columns pgpgin/s and pgpgout/s.
File system cache This is the common memory space used by the file system cache. Use the free -l -t -o
command for additional information.
Private memory for
process
This represents the memory used by each process running on the server. You can use the pmap
command to see how much memory is allocated to a specific process.
A process behaves poorly. Paging can be a serious performance problem when the
amount of free memory pages falls below the minimum amount specified, because the
paging mechanism is not able to handle the requests for physical memory pages and the
swap mechanism is called to free more pages. This significantly increases I/O to disk and
will quickly degrade a server’s performance.
If your server is always paging to disk (a high page-out rate), consider adding more memory.
However, for systems with a low page-out rate, it may not affect performance.
4.3.2 Performance tuning options
It you believe there is a memory bottleneck, consider performing one or more of these
actions:
Tune the swap space using bigpages, hugetlb, shared memory.
Increase or decrease the size of pages.
Improve the handling of active and inactive memory.
Adjust the page-out rate.
Limit the resources used for each user on the server.
Stop the services that are not needed, as discussed in 3.3, “Daemons” on page 38.
Add memory.
4.4 Disk bottlenecks
The disk subsystem is often the most important aspect of server performance and is usually
the most common bottleneck. However, problems can be hidden by other factors, such as
lack of memory. Applications are considered to be I/O-bound when CPU cycles are wasted
simply waiting for I/O tasks to finish.
The most common disk bottleneck is having too few disks. Most disk configurations are based
on capacity requirements, not performance. The least expensive solution is to purchase the
smallest number of the largest-capacity disks possible. However, this places more user data
on each disk, causing greater I/O rates to the physical disk and allowing disk bottlenecks to
occur.
The second most common problem is having too many logical disks on the same array. This
increases seek time and greatly lowers performance.
The disk subsystem is discussed in 3.12, “Tuning the file system” on page 52.
A recommendation is to apply the diskstats-2.4.patch to fix problems with disk statistics
counters, which can occasionally report negative values.
4.4.1 Finding disk bottlenecks
A server exhibiting the following symptoms may be suffering from a disk bottleneck (or a
hidden memory problem):
Slow disks will result in:
– Memory buffers filling with write data (or waiting for read data), which will delay all
requests because free memory buffers are unavailable for write requests (or the
response is waiting for read data in the disk queue)
– Insufficient memory, as in the case of not enough memory buffers for network requests,
will cause synchronous disk I/O
Chapter 4. Analyzing performance bottlenecks 77
Disk utilization, controller utilization, or both will typically be very high.
Most LAN transfers will happen only after disk I/O has completed, causing very long
response times and low network utilization.
Disk I/O can take a relatively long time and disk queues will become full, so the CPUs will
be idle or have low utilization because they wait long periods of time before processing the
next request.
The disk subsystem is perhaps the most challenging subsystem to properly configure.
Besides looking at raw disk interface speed and disk capacity, it is key to also understand the
workload: Is disk access random or sequential? Is there large I/O or small I/O? Answering
these questions provides the necessary information to make sure the disk subsystem is
adequately tuned.
Disk manufacturers tend to showcase the upper limits of their drive technology’s throughput.
However, taking the time to understand the throughput of your workload will help you
understand what true expectations to have of your underlying disk subsystem.
Table 4-2 Exercise showing true throughput for 8 KB I/Os for different drive speeds
Random read/write workloads usually require several disks to scale. The bus bandwidths of
SCSI or Fibre Channel are of lesser concern. Larger databases with random access
workload will benefit from having more disks. Larger SMP servers will scale better with more
disks. Given the I/O profile of 70% reads and 30% writes of the average commercial
workload, a RAID-10 implementation will perform 50% to 60% better than a RAID-5.
Sequential workloads tend to stress the bus bandwidth of disk subsystems. Pay special
attention to the number of SCSI buses and Fibre Channel controllers when maximum
throughput is desired. Given the same number of drives in an array, RAID-10, RAID-0, and
RAID-5 all have similar streaming read and write throughput.
There are two ways to approach disk bottleneck analysis: real-time monitoring and tracing.
Real-time monitoring must be done while the problem is occurring. This may not be
practical in cases where system workload is dynamic and the problem is not repeatable.
However, if the problem is repeatable, this method is flexible because of the ability to add
objects and counters as the problem becomes well understood.
Tracing is the collecting of performance data over time to diagnose a problem. This is a
good way to perform remote performance analysis. Some of the drawbacks include the
potential for having to analyze large files when performance problems are not repeatable,
and the potential for not having all key objects and parameters in the trace and having to
wait for the next time the problem occurs for the additional data.
Disk speed Latency Seek
time
Total random
access timea
a. Assuming that the handling of the command + data transfer < 1 ms, total random
access time = latency + seek time + 1 ms.
I/Os per
second
per diskb
b. Calculated as 1/total random access time.
Throughput
given 8 KB I/O
15 000 RPM 2.0 ms 3.8 ms 6.8 ms 147 1.15 MBps
10 000 RPM 3.0 ms 4.9 ms 8.9 ms 112 900 KBps
7 200 RPM 4.2 ms 9 ms 13.2 ms 75 600 KBps
vmstat command
One way to track disk usage on a Linux system is by using the vmstat tool. The columns of
interest in vmstat with respect to I/O are the bi and bo fields. These fields monitor the
movement of blocks in and out of the disk subsystem. Having a baseline is key to being able
to identify any changes over time.
Example 4-2 vmstat output
[root@x232 root]# vmstat 2
r b swpd free buff cache si so bi bo in cs us sy id wa
2 1 0 9004 47196 1141672 0 0 0 950 149 74 87 13 0 0
0 2 0 9672 47224 1140924 0 0 12 42392 189 65 88 10 0 1
0 2 0 9276 47224 1141308 0 0 448 0 144 28 0 0 0 100
0 2 0 9160 47224 1141424 0 0 448 1764 149 66 0 1 0 99
0 2 0 9272 47224 1141280 0 0 448 60 155 46 0 1 0 99
0 2 0 9180 47228 1141360 0 0 6208 10730 425 413 0 3 0 97
1 0 0 9200 47228 1141340 0 0 11200 6 631 737 0 6 0 94
1 0 0 9756 47228 1140784 0 0 12224 3632 684 763 0 11 0 89
0 2 0 9448 47228 1141092 0 0 5824 25328 403 373 0 3 0 97
0 2 0 9740 47228 1140832 0 0 640 0 159 31 0 0 0 100
iostat command
Performance problems can be encountered when too many files are opened, being read and
written to, then closed repeatedly. This could become apparent as seek times (the time it
takes to move to the exact track where the data is stored) start to increase. Using the iostat
tool, you can monitor the I/O device loading in real time. Different options enable you to drill
down even farther to gather the necessary data.
Example 4-3 shows a potential I/O bottleneck on the device /dev/sdb1. This output shows
average wait times (awaitof about 2.7 seconds and service times (svctmof 270 ms.
Example 4-3 Sample of an I/O bottleneck as shown with iostat 2 -x /dev/sdb1
[root@x232 root]# iostat 2 -x /dev/sdb1
avg-cpu: %user %nice %sys %idle
11.50 0.00 2.00 86.50
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s avgrq-sz
avgqu-sz await svctm %util
/dev/sdb1 441.00 3030.00 7.00 30.50 3584.00 24480.00 1792.00 12240.00 748.37
101.70 2717.33 266.67 100.00
avg-cpu: %user %nice %sys %idle
10.50 0.00 1.00 88.50
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s avgrq-sz
avgqu-sz await svctm %util
/dev/sdb1 441.00 3030.00 7.00 30.00 3584.00 24480.00 1792.00 12240.00 758.49
101.65 2739.19 270.27 100.00
avg-cpu: %user %nice %sys %idle
10.95 0.00 1.00 88.06
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s avgrq-sz
avgqu-sz await svctm %util
/dev/sdb1 438.81 3165.67 6.97 30.35 3566.17 25576.12 1783.08 12788.06 781.01
101.69 2728.00 268.00 100.00
The iostat -x (for extended statistics) command provides low-level detail of the disk
subsystem. Some things to point out:
%util Percentage of CPU consumed by I/O requests
svctm Average time required to complete a request, in milliseconds
await Average amount of time an I/O waited to be served, in milliseconds
avgqu-sz Average queue length
avgrq-sz Average size of request
rrqm/s Number of read requests merged per second that were issued to the device
wrqms Number of write requests merged per second that were issued to the device
For a more detailed explanation of the fields, see the man page for iostat(1).
Changes made to the elevator algorithm as described in “Tune the elevator algorithm in kernel
2.4” on page 55 will be seen in avgrq-sz (average size of request) and avgqu-sz (average
queue length). As the latencies are lowered by manipulating the elevator settings, avgrq-sz
will decrease. You can also monitor the rrqm/s and wrqm/s to see the effect on the number of
4.4.2 Performance tuning options
After verifying that the disk subsystem is a system bottleneck, several solutions are possible.
These solutions include the following:
If the workload is of a sequential nature and it is stressing the controller bandwidth, the
solution is to add a faster disk controller. However, if the workload is more random in
nature, then the bottleneck is likely to involve the disk drives, and adding more drives will
improve performance.
Add more disk drives in a RAID environment. This spreads the data across multiple
physical disks and improves performance for both reads and writes. This will increase the
number of I/Os per second. Also, use hardware RAID instead of the software
implementation provided by Linux. If hardware RAID is being used, the RAID level is
hidden from the OS.
Offload processing to another system in the network (users, applications, or services).
Add more RAM. Adding memory increases system memory disk cache, which in effect
improves disk response times.
4.5 Network bottlenecks
A performance problem in the network subsystem can be the cause of many problems, such
as a kernel panic. To analyze these anomalies to detect network bottlenecks, each Linux
distribution includes traffic analyzers.
4.5.1 Finding network bottlenecks
We recommend KDE System Guard because of its graphical interface and ease of use. The
tool, which is available on the distribution CDs, is discussed in detail in 2.10, “KDE System
Guard” on page 24. Figure 4-2 on page 80 shows it in action.
Figure 4-2 KDE System Guard network monitoring
It is important to remember that there are many possible reasons for these performance
problems and that sometimes problems occur simultaneously, making it even more difficult to
pinpoint the origin. The indicators in Table 4-3 can help you determine the problem with your
network.
Table 4-3 Indicators for network analysis
Network indicator Analysis
Packets received
Packets sent
Shows the number of packets that are coming in and going out of the
specified network interface. Check both internal and external interfaces.
Collision packets Collisions occur when there are many systems on the same domain. The
use of a hub may be the cause of many collisions.
Dropped packets Packets may be dropped for a variety of reasons, but the result may affect
performance. For example, if the server network interface is configured to
run at 100 Mbps full duplex, but the network switch is configured to run at
10 Mbps, the router may have an ACL filter that drops these packets. For
example:
iptables -t filter -A FORWARD -p all -i eth2 -o eth1 -s 172.18.0.0/24
-j DROP
Errors Errors occur if the communications lines (for instance, the phone line) are of
poor quality. In these situations, corrupted packets must be resent, thereby
decreasing network throughput.
Faulty adapters Network slowdowns often result from faulty network adapters. When this

kind of hardware fails, it may begin to broadcast junk packets on the network.
4.5.2 Performance tuning options
These steps illustrate what you should do to solve problems related to network bottlenecks:
Ensure that the network card configuration matches router and switch configurations (for
example, frame size).
Modify how your subnets are organized.
Use faster network cards.
Tune the appropriate IPV4 TCP kernel parameters. (See Chapter 3, “Tuning the operating
system” on page 35.) Some security-related parameters can also improve performance,
as described in that chapter.
If possible, change network cards and recheck performance.
Add network cards and bind them together to form an adapter team, if possible.

Không có nhận xét nào:

Đăng nhận xét