tjeuba0: How to troubleshoot Linux performance bottlenecks

Here is a relevant quote from Tuning Red Hat Enterprise Linux on IBM Eserver xSeries Servers (2005)

Identifying bottlenecksThe following steps are used as our quick tuning strategy:

Know your system.

Back up the system.

Monitor and analyze the system’s performance.

Narrow down the bottleneck and find its cause.

Fix the bottleneck cause by trying only one single change at a time.

Go back to step 3 until you are satisfied with the performance of the system.

4.1.1 Gathering information

Mostly likely, the only first-hand information you will have access to will be statements such as “There is a problem with the server.” It is crucial to use probing questions to clarify and document the problem. Here is a list of questions you should ask to help you get a better picture of the system.

Can you give me a complete description of the server in question?

– Model
– Age
– Configuration
– Peripheral equipment
– Operating system version and update level

Can you tell me exactly what the problem is?

What are the symptoms?
Describe any error messages.Some people will have problems answering this question, but any extra information the customer can give you might enable you to find the problem. For example, the customer might say “It is really slow when I copy large files to the server.” This might indicate a network problem or a disk subsystem problem.
Who is experiencing the problem? Is one person, one particular group of people, or the entire organization experiencing the problem? This helps determine whether the problem exists in one particular part of the network, whether it is application-dependent, and so on. If only one user experiences the problem, then the problem might be with the user’s PC (or their imagination).The perception clients have of the server is usually a key factor. From this point of view, performance problems may not be directly related to the server: the network path between the server and the clients can easily be the cause of the problem. This path includes network devices as well as services provided by other servers, such as domain controllers.
Can the problem be reproduced? All reproducible problems can be solved. If you have sufficient knowledge of the system, you should be able to narrow the problem to its root and decide which actions should be taken.

Tip: You should document each step, especially the changes you make and their effect on performance.

The fact that the problem can be reproduced enables you to see and understand it better.

Document the sequence of actions that are necessary to reproduce the problem:

– What are the steps to reproduce the problem? Knowing the steps may help you reproduce the same problem on a different machine under the same conditions. If this works, it gives you the opportunity to use a machine in a test environment and removes the chance of crashing the production server.
– Is it an intermittent problem? If the problem is intermittent, the first thing to do is to gather information and find a path to move the problem in the reproducible category. The goal here is to have a scenario to make the problem happen on command.
– Does it occur at certain times of the day or certain days of the week? This might help you determine what is causing the problem. It may occur when everyone arrives for work or returns from lunch. Look for ways to change the timing (that is, make it happen less or more often); if there are ways to do so, the problem becomes a reproducible one.
– Is it unusual? If the problem falls into the non-reproducible category, you may conclude that it is the result of extraordinary conditions and classify it as fixed. In real life, there is a high probability that it will happen again. A good procedure to troubleshoot a hard-to-reproduce problem is to perform general maintenance on the server: reboot, or bring the machine up to date on drivers and patches.
When did the problem start? Was it gradual or did it occur very quickly? If the performance issue appeared gradually, then it is likely to be a sizing issue; if it appeared overnight, then the problem could be caused by a change made to the server or peripherals.
Have any changes been made to the server (minor or major) or are there any changes in the way clients are using the server?
Did the customer alter something on the server or peripherals to cause the problem?
Is there a log of all network changes available? Demands could change based on business changes, which could affect demands on a servers and network systems.
Are there any other servers or hardware components involved? Are any logs available?
What is the priority of the problem? When does it have to be fixed?

– Does it have to be fixed in the next few minutes, or in days? You may have some time to fix it; or it may already be time to operate in panic mode.
– How massive is the problem?
– What is the related cost of that problem?

4.1.2 Analyzing the server’s performance

At this point, you should begin monitoring the server. The simplest way is to run monitoring tools from the server that is being analyzed. (See Chapter 2, “Monitoring tools” on page 15, for information.)

A performance log of the server should be created during its peak time of operation (for example, 9:00 a.m. to 5:00 p.m.); it will depend on what services are being provided and on who is using these services. When creating the log, if available, the following objects should be included:

Processor
System
Server work queues
Memory
Page file
Physical disk
Redirector
Network interface

Before you begin, remember that a methodical approach to performance tuning is important.

Our recommended process, which you can use for your xSeries server performance tuning process, is as follows:

1. Understand the factors affecting server performance. This Redpaper and the redbook Tuning IBM Eserver xSeries Servers for Performance, SG24-5287 can help.

2. Measure the current performance to create a performance baseline to compare with your future measurements and to identify system bottlenecks.

3. Use the monitoring tools to identify a performance bottleneck. By following the instructions in the next sections, you should be able to narrow down the bottleneck to the subsystem level.

4. Work with the component that is causing the bottleneck by performing some actions to improve server performance in response to demands.

5. Measure the new performance. This helps you compare performance before and after the tuning steps.

When attempting to fix a performance problem, remember the following:

Take measurements before you upgrade or modify anything so that you can tell whether the change had any effect. (That is, take baseline measurements.)

Examine the options that involve reconfiguring existing hardware, not just those that involve adding new hardware.

Important: Before taking any troubleshooting actions, back up all data and the configuration information to prevent a partial or complete loss.

Note: It is important to understand that the greatest gains are obtained by upgrading a component that has a bottleneck when the other components in the server have ample “power” left to sustain an elevated level of performance.

4.2.1 Finding CPU bottlenecks

Determining bottlenecks with the CPU can be accomplished in several ways. As discussed in

Chapter 2, “Monitoring tools” on page 15, Linux has a variety of tools to help determine this;

the question is: which tools to use?

One such tool is uptime. By analyzing the output from uptime, we can get a rough idea of what has been happening in the system for the past 15 minutes. For a more detailed explanation of this tool, see 2.2, “uptime” on page 16.

Example 4-1 uptime output from a CPU strapped system

18:03:16 up 1 day, 2:46, 6 users, load average: 182.53, 92.02, 37.95

Using KDE System Guard and the CPU sensors lets you view the current CPU workload.

Using top, you can see both CPU utilization and what processes are the biggest contributors

to the problem (Example 2-3 on page 18). If you have set up sar, you are collecting a lot of

information, some of which is CPU utilization, over a period of time. Analyzing this information

can be difficult, so use isag, which can use sar output to plot a graph. Otherwise, you may

wish to parse the information through a script and use a spreadsheet to plot it to see any

trends in CPU utilization. You can also use sar from the command line by issuing sar -u or

sar -U processornumber. To gain a broader perspective of the system and current utilization

of more than just the CPU subsystem, a good tool is vmstat (2.6, “vmstat” on page 21).

4.2.2 SMP

SMP-based systems can present their own set of interesting problems that can be difficult to

detect. In an SMP environment, there is the concept of CPU affinity, which implies that you

bind a process to a CPU.

The main reason this is useful is CPU cache optimization, which is achieved by keeping the

same process on one CPU rather than moving between processors. When a process moves

between CPUs, the cache of the new CPU must be flushed. Therefore, a process that moves

between processors causes many cache flushes to occur, which means that an individual

process will take longer to finish. This scenario is very hard to detect because, when

Note: There is a common misconception that the CPU is the most important part of the

server. This is not always the case, and servers are often overconfigured with CPU and

underconfigured with disks, memory, and network subsystems. Only specific applications

that are truly CPU-intensive can take advantage of today’s high-end processors.

Tip: Be careful not to add to CPU problems by running too many tools at one time. You

may find that using a lot of different monitoring tools at one time may be contributing to the

high CPU load.

monitoring it, the CPU load will appear to be very balanced and not necessarily peaking on

any CPU. Affinity is also useful in NUMA-based systems such as the xSeries 445 and xSeries

455, where it is important to keep memory, cache, and CPU access local to one another.

4.2.3 Performance tuning options

The first step is to ensure that the system performance problem is being caused by the CPU

and not one of the other subsystems. If the processor is the server bottleneck, then a number

of steps can be taken to improve performance. These include:

Ensure that no unnecessary programs are running in the background by using ps -ef. If

you find such programs, stop them and use cron to schedule them to run at off-peak

hours.

Identify non-critical, CPU-intensive processes by using top and modify their priority using

renice.

In an SMP-based machine, try using taskset to bind processes to CPUs to make sure that

processes are not hopping between processors, causing cache flushes.

Based on the running application, it may be better to scale up (bigger CPUs) than scale

out (more CPUs). This depends on whether your application was designed to effectively

take advantage of more processors. For example, a single-threaded application would

scale better with a faster CPU and not with more CPUs.

General options include making sure you are using the latest drivers and firmware, as this

may affect the load they have on the CPU.

4.3 Memory bottlenecks

On a Linux system, many programs run at the same time; these programs support multiple

users and some processes are more used than others. Some of these programs use a

portion of memory while the rest are “sleeping.” When an application accesses cache, the

performance increases because an in-memory access retrieves data, thereby eliminating the

need to access slower disks.

The OS uses an algorithm to control which programs will use physical memory and which are

paged out. This is transparent to user programs. Page space is a file created by the OS on a

disk partition to store user programs that are not currently in use. Typically, page sizes are

4 KB or 8 KB. In Linux, the page size is defined by using the variable EXEC_PAGESIZE in the

include/asm-<architecture>/param.h kernel header file. The process used to page a process

out to disk is called pageout.

4.3.1 Finding memory bottlenecks

Start your analysis by listing the applications that are running on the server. Determine how

much physical memory and swap each application needs to run. Figure 4-1 on page 75

shows KDE System Guard monitoring memory usage.

Figure 4-1 KDE System Guard memory monitoring

The indicators in Table 4-1 can also help you define a problem with memory.

Table 4-1 Indicator for memory analysis
Paging and swapping indicators

In Linux, as with all UNIX-based operating systems, there are differences between paging

and swapping. Paging moves individual pages to swap space on the disk; swapping is a

bigger operation that moves the entire address space of a process to swap space in one

operation.

Swapping can have one of two causes:

A process enters sleep mode. This usually happens because the process depends on

interactive action, as editors, shells, and data entry applications spend most of their time

waiting for user input. During this time, they are inactive.

Memory indicator Analysis

Memory available This indicates how much physical memory is available for use. If, after you start your application,

this value has decreased significantly, you may have a memory leak. Check the application that

is causing it and make the necessary adjustments. Use free -l -t -o for additional information.

Page faults There are two types of page faults: soft page faults, when the page is found in memory, and hard

page faults, when the page is not found in memory and must be fetched from disk. Accessing

the disk will slow your application considerably. The sar -B command can provide useful

information for analyzing page faults, specifically columns pgpgin/s and pgpgout/s.

File system cache This is the common memory space used by the file system cache. Use the free -l -t -o

command for additional information.

Private memory for

process

This represents the memory used by each process running on the server. You can use the pmap

command to see how much memory is allocated to a specific process.

A process behaves poorly. Paging can be a serious performance problem when the

amount of free memory pages falls below the minimum amount specified, because the

paging mechanism is not able to handle the requests for physical memory pages and the

swap mechanism is called to free more pages. This significantly increases I/O to disk and

will quickly degrade a server’s performance.

If your server is always paging to disk (a high page-out rate), consider adding more memory.

However, for systems with a low page-out rate, it may not affect performance.

4.3.2 Performance tuning options

It you believe there is a memory bottleneck, consider performing one or more of these

actions:

Tune the swap space using bigpages, hugetlb, shared memory.

Increase or decrease the size of pages.

Improve the handling of active and inactive memory.

Adjust the page-out rate.

Limit the resources used for each user on the server.

Stop the services that are not needed, as discussed in 3.3, “Daemons” on page 38.

Add memory.

4.4 Disk bottlenecks

The disk subsystem is often the most important aspect of server performance and is usually

the most common bottleneck. However, problems can be hidden by other factors, such as

lack of memory. Applications are considered to be I/O-bound when CPU cycles are wasted

simply waiting for I/O tasks to finish.

The most common disk bottleneck is having too few disks. Most disk configurations are based

on capacity requirements, not performance. The least expensive solution is to purchase the

smallest number of the largest-capacity disks possible. However, this places more user data

on each disk, causing greater I/O rates to the physical disk and allowing disk bottlenecks to

occur.

The second most common problem is having too many logical disks on the same array. This

increases seek time and greatly lowers performance.

The disk subsystem is discussed in 3.12, “Tuning the file system” on page 52.

A recommendation is to apply the diskstats-2.4.patch to fix problems with disk statistics

counters, which can occasionally report negative values.

4.4.1 Finding disk bottlenecks

A server exhibiting the following symptoms may be suffering from a disk bottleneck (or a

hidden memory problem):

Slow disks will result in:

– Memory buffers filling with write data (or waiting for read data), which will delay all

requests because free memory buffers are unavailable for write requests (or the

response is waiting for read data in the disk queue)

– Insufficient memory, as in the case of not enough memory buffers for network requests,

will cause synchronous disk I/O

Chapter 4. Analyzing performance bottlenecks 77

Disk utilization, controller utilization, or both will typically be very high.

Most LAN transfers will happen only after disk I/O has completed, causing very long

response times and low network utilization.

Disk I/O can take a relatively long time and disk queues will become full, so the CPUs will

be idle or have low utilization because they wait long periods of time before processing the

next request.

The disk subsystem is perhaps the most challenging subsystem to properly configure.

Besides looking at raw disk interface speed and disk capacity, it is key to also understand the

workload: Is disk access random or sequential? Is there large I/O or small I/O? Answering

these questions provides the necessary information to make sure the disk subsystem is

adequately tuned.

Disk manufacturers tend to showcase the upper limits of their drive technology’s throughput.

However, taking the time to understand the throughput of your workload will help you

understand what true expectations to have of your underlying disk subsystem.

Table 4-2 Exercise showing true throughput for 8 KB I/Os for different drive speeds

Random read/write workloads usually require several disks to scale. The bus bandwidths of

SCSI or Fibre Channel are of lesser concern. Larger databases with random access

workload will benefit from having more disks. Larger SMP servers will scale better with more

disks. Given the I/O profile of 70% reads and 30% writes of the average commercial

workload, a RAID-10 implementation will perform 50% to 60% better than a RAID-5.

Sequential workloads tend to stress the bus bandwidth of disk subsystems. Pay special

attention to the number of SCSI buses and Fibre Channel controllers when maximum

throughput is desired. Given the same number of drives in an array, RAID-10, RAID-0, and

RAID-5 all have similar streaming read and write throughput.

There are two ways to approach disk bottleneck analysis: real-time monitoring and tracing.

Real-time monitoring must be done while the problem is occurring. This may not be

practical in cases where system workload is dynamic and the problem is not repeatable.

However, if the problem is repeatable, this method is flexible because of the ability to add

objects and counters as the problem becomes well understood.

Tracing is the collecting of performance data over time to diagnose a problem. This is a

good way to perform remote performance analysis. Some of the drawbacks include the

potential for having to analyze large files when performance problems are not repeatable,

and the potential for not having all key objects and parameters in the trace and having to

wait for the next time the problem occurs for the additional data.

Disk speed Latency Seek
time
Total random
access timea

a. Assuming that the handling of the command + data transfer < 1 ms, total random

access time = latency + seek time + 1 ms.

I/Os per
second
per diskb

b. Calculated as 1/total random access time.

Throughput
given 8 KB I/O

15 000 RPM 2.0 ms 3.8 ms 6.8 ms 147 1.15 MBps

10 000 RPM 3.0 ms 4.9 ms 8.9 ms 112 900 KBps

7 200 RPM 4.2 ms 9 ms 13.2 ms 75 600 KBps

vmstat command

One way to track disk usage on a Linux system is by using the vmstat tool. The columns of

interest in vmstat with respect to I/O are the bi and bo fields. These fields monitor the

movement of blocks in and out of the disk subsystem. Having a baseline is key to being able

to identify any changes over time.

Example 4-2 vmstat output

[root@x232 root]# vmstat 2

r b swpd free buff cache si so bi bo in cs us sy id wa

2 1 0 9004 47196 1141672 0 0 0 950 149 74 87 13 0 0

0 2 0 9672 47224 1140924 0 0 12 42392 189 65 88 10 0 1

0 2 0 9276 47224 1141308 0 0 448 0 144 28 0 0 0 100

0 2 0 9160 47224 1141424 0 0 448 1764 149 66 0 1 0 99

0 2 0 9272 47224 1141280 0 0 448 60 155 46 0 1 0 99

0 2 0 9180 47228 1141360 0 0 6208 10730 425 413 0 3 0 97

1 0 0 9200 47228 1141340 0 0 11200 6 631 737 0 6 0 94

1 0 0 9756 47228 1140784 0 0 12224 3632 684 763 0 11 0 89

0 2 0 9448 47228 1141092 0 0 5824 25328 403 373 0 3 0 97

0 2 0 9740 47228 1140832 0 0 640 0 159 31 0 0 0 100

iostat command

Performance problems can be encountered when too many files are opened, being read and

written to, then closed repeatedly. This could become apparent as seek times (the time it

takes to move to the exact track where the data is stored) start to increase. Using the iostat

tool, you can monitor the I/O device loading in real time. Different options enable you to drill

down even farther to gather the necessary data.

Example 4-3 shows a potential I/O bottleneck on the device /dev/sdb1. This output shows

average wait times (await) of about 2.7 seconds and service times (svctm) of 270 ms.

Example 4-3 Sample of an I/O bottleneck as shown with iostat 2 -x /dev/sdb1

[root@x232 root]# iostat 2 -x /dev/sdb1

avg-cpu: %user %nice %sys %idle

11.50 0.00 2.00 86.50

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s avgrq-sz

avgqu-sz await svctm %util

/dev/sdb1 441.00 3030.00 7.00 30.50 3584.00 24480.00 1792.00 12240.00 748.37

101.70 2717.33 266.67 100.00

avg-cpu: %user %nice %sys %idle

10.50 0.00 1.00 88.50

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s avgrq-sz

avgqu-sz await svctm %util

/dev/sdb1 441.00 3030.00 7.00 30.00 3584.00 24480.00 1792.00 12240.00 758.49

101.65 2739.19 270.27 100.00

avg-cpu: %user %nice %sys %idle

10.95 0.00 1.00 88.06

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s avgrq-sz

avgqu-sz await svctm %util

/dev/sdb1 438.81 3165.67 6.97 30.35 3566.17 25576.12 1783.08 12788.06 781.01

101.69 2728.00 268.00 100.00

The iostat -x (for extended statistics) command provides low-level detail of the disk

subsystem. Some things to point out:

%util Percentage of CPU consumed by I/O requests

svctm Average time required to complete a request, in milliseconds

await Average amount of time an I/O waited to be served, in milliseconds

avgqu-sz Average queue length

avgrq-sz Average size of request

rrqm/s Number of read requests merged per second that were issued to the device

wrqms Number of write requests merged per second that were issued to the device

For a more detailed explanation of the fields, see the man page for iostat(1).

Changes made to the elevator algorithm as described in “Tune the elevator algorithm in kernel

2.4” on page 55 will be seen in avgrq-sz (average size of request) and avgqu-sz (average

queue length). As the latencies are lowered by manipulating the elevator settings, avgrq-sz

will decrease. You can also monitor the rrqm/s and wrqm/s to see the effect on the number of

4.4.2 Performance tuning options

After verifying that the disk subsystem is a system bottleneck, several solutions are possible.

These solutions include the following:

If the workload is of a sequential nature and it is stressing the controller bandwidth, the

solution is to add a faster disk controller. However, if the workload is more random in

nature, then the bottleneck is likely to involve the disk drives, and adding more drives will

improve performance.

Add more disk drives in a RAID environment. This spreads the data across multiple

physical disks and improves performance for both reads and writes. This will increase the

number of I/Os per second. Also, use hardware RAID instead of the software

implementation provided by Linux. If hardware RAID is being used, the RAID level is

hidden from the OS.

Offload processing to another system in the network (users, applications, or services).

Add more RAM. Adding memory increases system memory disk cache, which in effect

improves disk response times.

4.5 Network bottlenecks

A performance problem in the network subsystem can be the cause of many problems, such

as a kernel panic. To analyze these anomalies to detect network bottlenecks, each Linux

distribution includes traffic analyzers.

4.5.1 Finding network bottlenecks

We recommend KDE System Guard because of its graphical interface and ease of use. The

tool, which is available on the distribution CDs, is discussed in detail in 2.10, “KDE System

Guard” on page 24. Figure 4-2 on page 80 shows it in action.

Figure 4-2 KDE System Guard network monitoring

It is important to remember that there are many possible reasons for these performance

problems and that sometimes problems occur simultaneously, making it even more difficult to

pinpoint the origin. The indicators in Table 4-3 can help you determine the problem with your

network.

Table 4-3 Indicators for network analysis
Network indicator Analysis

Packets received

Packets sent

Shows the number of packets that are coming in and going out of the

specified network interface. Check both internal and external interfaces.

Collision packets Collisions occur when there are many systems on the same domain. The

use of a hub may be the cause of many collisions.

Dropped packets Packets may be dropped for a variety of reasons, but the result may affect

performance. For example, if the server network interface is configured to

run at 100 Mbps full duplex, but the network switch is configured to run at

10 Mbps, the router may have an ACL filter that drops these packets. For

example:

iptables -t filter -A FORWARD -p all -i eth2 -o eth1 -s 172.18.0.0/24

-j DROP

Errors Errors occur if the communications lines (for instance, the phone line) are of

poor quality. In these situations, corrupted packets must be resent, thereby

decreasing network throughput.

Faulty adapters Network slowdowns often result from faulty network adapters. When this

kind of hardware fails, it may begin to broadcast junk packets on the network.

4.5.2 Performance tuning options

These steps illustrate what you should do to solve problems related to network bottlenecks:

Ensure that the network card configuration matches router and switch configurations (for

example, frame size).

Modify how your subnets are organized.

Use faster network cards.

Tune the appropriate IPV4 TCP kernel parameters. (See Chapter 3, “Tuning the operating

system” on page 35.) Some security-related parameters can also improve performance,

as described in that chapter.

If possible, change network cards and recheck performance.

Add network cards and bind them together to form an adapter team, if possible.

tjeuba0

Thứ Tư, 22 tháng 3, 2017

How to troubleshoot Linux performance bottlenecks

Không có nhận xét nào:

Đăng nhận xét