Skip to main content

3 posts tagged with "Kernel"

View All Tags

· 2 min read


想必大家一定都有使用過 CPU Limit 的經驗,透過這個機制能夠確保每個 Container 使用的 CPU 資源量,也可以保證每個節點上面會有足夠 CPU 供 Kubernetes 原生服務 (kubelet) 使用。 然而本篇文章就要來跟大家分享一個設定 CPU Limit 反而造成效能更差的故事,故事中當 CPU 設定為 800ms 的時候,卻發現實際運行的 Container 最高大概就只有 200ms 左右,這一切的一切都是因為 Liniux Kernel 的臭蟲導致! 一個直接的做法就是針對那些本來就沒有過高 CPU 使用量服務取消其 CPU Limit,作者於文章中也探討了一些機制要如何保護與應對這些被移除 CPU 限制的服務。 這個臭蟲於 Linux Kernel 4.19 後已經修復,但是要注意你使用的發行版本是否有有包含這個修復,作者列出一些已知的發行版本修復狀況 Debian: The latest version buster has the fix, it looks quite recent (august 2020). Some previous version might have get patched. Ubuntu: The latest version Ubuntu Focal Fosa 20.04 has the fix. EKS has the fix since December 2019, Upgrade your AMI if necessary. kops: Since June 2020, kops 1.18+ will start using Ubuntu 20.04 as the default host image. GKE: THe kernel fix was merged in January 2020. But it does looks like throttling are still happening.

· 7 min read

blktrace is a block layer IO tracing mechanism which provide detailed information about request queue operations up to user space.

blkparse will combine streams of events for various devices on various CPUs, and produce a formatted output the the event information. It take the output of above tool blktrace and convert those information into fency readable form.

In the following, We will use those tools blktrace and blkparse to help us to observe sector numbers which has been written by fio requests. We will use the fil to generate two diffenrt IO pattern requests, sequence write and random write.


OS: Ubuntu 14.04 Storage: NVME FIO: fio-2.19-12-gb94d blktrace: 2.0.0 blkparse: 1.1.0

you can use following commands to install blktrace and blkparse

apt-get install -y blktrace



In order to make the output of blkparse more easily to read, we set the numjobs to 1. Following is my fio config





After we setup the fio config, use the fio to generate the IO request. In this example, we ask the fio to generate the IO via sequence write pattern.

fio ${path_of_config} section=sw

During the experiment, you can use the tool iostat to monitor the I/O information about the device we want to observe.


Open other terminal and use blktrace to collection the data, there are two parameter we need to use, First one is -d, which indicate what target device blktrace will monitor to. Second, is -w, we use it to limit the time (seconds) how long blktrace will run. So, our final command looks like below.

blktrace -d /dev/nvme1n1 -w 60

In the end of blktrace, you can discover some new files has created by blktrace and its prefix name is nvme1n1.blktrac.xx The number of files is depends how may CPUs in your system.

-rw-r--r--  1 root     root         821152 Jun  2 10:39 nvme1n1.blktrace.0
-rw-r--r-- 1 root root 21044368 Jun 2 10:39 nvme1n1.blktrace.1
-rw-r--r-- 1 root root 462864 Jun 2 10:39 nvme1n1.blktrace.10
-rw-r--r-- 1 root root 737960 Jun 2 10:39 nvme1n1.blktrace.11
-rw-r--r-- 1 root root 865872 Jun 2 10:39 nvme1n1.blktrace.12
-rw-r--r-- 1 root root 755248 Jun 2 10:39 nvme1n1.blktrace.13
-rw-r--r-- 1 root root 4675176 Jun 2 10:39 nvme1n1.blktrace.14
-rw-r--r-- 1 root root 4471480 Jun 2 10:39 nvme1n1.blktrace.15
-rw-r--r-- 1 root root 5070264 Jun 2 10:39 nvme1n1.blktrace.16
-rw-r--r-- 1 root root 5075040 Jun 2 10:39 nvme1n1.blktrace.17
-rw-r--r-- 1 root root 5062104 Jun 2 10:39 nvme1n1.blktrace.18
-rw-r--r-- 1 root root 5586936 Jun 2 10:39 nvme1n1.blktrace.19
-rw-r--r-- 1 root root 3718848 Jun 2 10:39 nvme1n1.blktrace.2


Now, we can use the blkparse to regenerate human-readable output form the output we get via blktrace before.

We need to indicate source files, you can just use the device name without .blktrace.xx, for example, nvmen1, it will search all files which match the pattern nvmen1.blktrace.xx and put together to analyze. Then, the -f option used to foramt the output data, you can find more about it via man blkparse

The output from blkparse can be tailored for specific use -- in particular, to ease parsing of output, and/or limit output fields to those the user wants to see. The data for fields which can be output include:

a Action, a (small) string (1 or 2 characters) -- see table below for more details

c CPU id

C Command

d RWBS field, a (small) string (1-3 characters) -- see section below for more details

D 7-character string containing the major and minor numbers of the event's device (separated by a comma).

e Error value

m Minor number of event's device.

M Major number of event's device.

n Number of blocks

N Number of bytes

p Process ID

P Display packet data -- series of hexadecimal values

s Sequence numbers

S Sector number

t Time stamp (nanoseconds)

T Time stamp (seconds)

u Elapsed value in microseconds (-t command line option)

U Payload unsigned integer

For our observation, we use %5T.%9t, %p, %C, %a, %S\n to format our result containing timestamp, command, process ID, action and sequence number.

Since the data I/O contains many action, such as complete, queued, inserted..ect. we can use option -a to filter actions, you can find more info via man blktrace. In this case, we use the write to filter the actions.

In the end, use the -o options to indicate the output file name.

barrier: barrier attribute
complete: completed by driver
fs: requests
issue: issued to driver
pc: packet command events
queue: queue operations
read: read traces
requeue: requeue operations
sync: synchronous attribute
write: write traces
notify: trace messages
drv_data: additional driver specific trace

The command will look like below and it will output the result to file output.txt.

blkparse nvme1n1 -f "%5T.%9t, %p, %C, %a, %S\n"  -a write -o output.txt

open the file, the result looks like

    0.000000000, 22890, fio, Q, 1720960
0.000001857, 22890, fio, G, 1720960
0.000005803, 22890, fio, I, 1720960
0.000009234, 22890, fio, D, 1720960
0.000036821, 0, swapper/0, C, 1996928
0.000067519, 22890, fio, Q, 1721088
0.000068538, 22890, fio, G, 1721088
0.000071531, 22890, fio, I, 1721088
0.000073102, 22890, fio, D, 1721088
0.000093464, 0, swapper/0, C, 1994624
0.000123806, 0, swapper/0, C, 1785472
0.000147436, 22892, fio, C, 1784576
0.000159977, 22891, fio, C, 1997312
0.000166653, 22891, fio, Q, 2006912
0.000167632, 22891, fio, G, 2006912
0.000169422, 22891, fio, I, 2006912
0.000171178, 22891, fio, D, 2006912
0.000188830, 22892, fio, Q, 1817728
0.000189783, 22892, fio, G, 1817728
0.000191405, 22892, fio, I, 1817728
0.000192830, 22892, fio, D, 1817728
0.000202367, 22891, fio, Q, 2007040
0.000203160, 22891, fio, G, 2007040
0.000205969, 22891, fio, I, 2007040
0.000207524, 22891, fio, D, 2007040
0.000227655, 22892, fio, Q, 1817856
0.000228457, 22892, fio, G, 1817856
0.000231936, 22892, fio, I, 1817856

Since the fio will fork to two process to handle the process, we use the grep to focus on one specific process (pid=22892).

grep "22892, fio" output.txt | more

Now, the result seems good, we can discover the sequence number (fifth column) is increasing. One thing we need to care about is the row which action is "C", which means the completed, since we don't know how NVME handle those request and reply to upper layer. we only need to focus on other action. such as "Q (queued This notes intent to queue i/o at the given location. No real requests exists yet.)" or "I (inserted A request is being sent to the i/o scheduler for addition to the internal queue and later service by the driver. The request is fully formed at this time)".

    0.000147436, 22892, fio, C, 1784576
0.000188830, 22892, fio, Q, 1817728
0.000189783, 22892, fio, G, 1817728
0.000191405, 22892, fio, I, 1817728
0.000192830, 22892, fio, D, 1817728
0.000227655, 22892, fio, Q, 1817856
0.000228457, 22892, fio, G, 1817856
0.000231936, 22892, fio, I, 1817856
0.000233530, 22892, fio, D, 1817856
0.000360361, 22892, fio, Q, 1817984
0.000361310, 22892, fio, G, 1817984
0.000364163, 22892, fio, I, 1817984
0.000366696, 22892, fio, D, 1817984
0.000536731, 22892, fio, Q, 1818112
0.000537758, 22892, fio, G, 1818112
0.000539371, 22892, fio, I, 1818112
0.000541407, 22892, fio, D, 1818112
0.000670209, 22892, fio, Q, 1818240
0.000671345, 22892, fio, G, 1818240
0.000673383, 22892, fio, I, 1818240
0.000676260, 22892, fio, D, 1818240
0.001885543, 22892, fio, Q, 1818368
0.001887444, 22892, fio, G, 1818368
0.001891353, 22892, fio, I, 1818368
0.001895917, 22892, fio, D, 1818368
0.001934546, 22892, fio, Q, 1818496
0.001935468, 22892, fio, G, 1818496
0.001936891, 22892, fio, I, 1818496
0.001938742, 22892, fio, D, 1818496
0.001965818, 22892, fio, Q, 1818624

Now, we can do all above command again and change the section to rw for fio using the randon write pattern. The blkparse result will show the random sequence number.


In this article, we try to use tools blktrace and blkparse to analysiz the block level I/O for fio request. We observe the filed sequence number to make sure thhat the fio can generate the sequence or random according to its config.


· 5 min read

本文主要嘗試分析 drbd(9.0) 於 kernel運行時的效能分析,希望藉由 perf 這個 tool 來分析整個程式運行的狀況,藉此觀察其運行時各 function 的比例。

Testing Environment

為了進行效能上的分析,首要條件就是先將 DRBD 給衝到效能瓶頸才有機會去觀察,所以本文採用下列的環境與工具來進行這項作業


CPU: Intel(R) Xeon(R) CPU E5-2695 v3 @ 2.30GHz Storage: Non-Volatile memory controller(NVME) Tool: fio OS: Ubuntu 16.04 with linux 4.4.0-78-generic


為了更方便觀察 drbd 的運行,我們將 drbd 創造的 kernel thread 都分別綁在不同的 cpu 上,這樣可以讓每隻 kernel thread 盡可能去使用cpu。

  1. 透過 ps or htop 取得 kernel thread 的 pid,這邊可以關注的有
    • drbd_s_r0 (sender)
    • drbd_r_r0 (receiver)
    • drbd_as_r0 (ack sender)
    • drbd_a_r0 (ack receiver)
    • drbd_w_r0 (worker)
  2. 透過 taskset 這個指令將上述程式分別綁到不同的 cpu 上
taskset -p 0x100 18888


本文使用 fio 來進行資料的讀取,下方提供一個簡易的 fio 設定檔,可依照自行的環境變換修改。








我們 fio 採用 client/server 的架構,要是可支援多台 client 同時一起進行資料讀取,提供更高的壓力測試。

假設該設定檔名稱為 fio.cfg,並且放置於 /tmp/fio.cfg 則首先在 node-1 上面執行下列指令以再背景跑一個 fio server

fio -S &

接下來要運行的時候,執行下列指令來運行 fio,其中若想要改變測試的類型,可透過 --secion進行切換。

/fio --client=node-1 /tmp/fio.cfg --section=rw

這時候可以透過 htop 以及 iostat 的資訊去觀察,如下圖 當前透過 iostat 觀察到的確對 drbd0 有大量的讀寫動作 同時由 htop (記得開啟 kernel thread觀察功能),可以看到 drbd_s_r0 以及 drbd_a_r0 都各自吃掉一個 cpu,大概都快接近 100% 的使用率。


有了上述的環境後,我們就可以準備來分析 drbd 程式碼運行狀況。


這邊使用 perf 這套程式來分析,基本上 kernel 新一點的版本都已經內建此功能了,比較舊的 kernel 則需要自己重新開啟該 kernel config然後重新 build kernel,所以這邊使用 Ubuntu 16.04 with linux 4.4.0-78-generic 相對起來非常簡單。

直接執行 perf 這個指令,若系統上有缺少 linux-tools-4.4.0-78 相關 tool 的話會有文字提示你,如下所示,按照提示使用 apt-get 將相關的套件安裝完畢後,就可以使用 perf 了。

WARNING: perf not found for kernel

You may need to install the following packages for this specific kernel:

perf 的功能非常強大,可以參考 wiki, 這邊我們使用 perf top 的方式來觀察結果。 為了可以順便觀看 call graph 的過程,再執行perf的時候要多下-g這個參數

指令為 perf top -g -p $PID,如 perf top -g -p 18888

在這邊我嘗試觀察 drbd_a 這隻,結果如下列


這邊可以觀察到三隻吃比較兇的 function 都吃很兇,分別是 native_queue_spin_lock_slowpathtr_release 以及 idr_get_next

這邊比較麻煩的是你看這個只能知道就是卡在 spin_locK,系統上是不是 multithread,然後有太多的資料要搬移導致 spin_lock ? 這些搬移的資料是誰放進去的?,這些資料是什麼?


這部份等之後能夠完整理解整個 drbd 的 write-path 或是 read-path 時再重新來看一次這張圖,到時候應該會有完全不一樣的思維。