Monday, August 24, 2020

Palo Alto firewall - Troubleshooting High MP CPU



    • How to Interpret: show system resources

    The command show system resources gives a snapshot of Management Plane (MP) resource utilization including memory and CPU. This is similar to the ‘top’ command in Linux.

    show system resources provides information about the memory used and available and if the MP is using swap. If the swap usage remains consistently high, it implies that processes are either failing to release memory or they justifiably require more memory to operate.

    To verify MP is swapping for a extended period of time, refer to process kswapd highlighted in the sample output below. If this process is running for hours - this could indicate that you have consistent swapping going on and may need to take steps to reduce your memory footprint.

    The values of interest are the Swap, Mem, CPU wait time (%wa), Virtual memory usage (VIRT) and CPU usage (%CPU). These values are helpful when determining high MP CPU and/or slow MP response. If the CPU wait time is high, it indicates the MP is waiting for a process to release the CPU.

    To view real-time memory and CPU usage, run the command: show system resources follow

    Most of the Palo Alto Platforms have multiple core CPUs. Some platforms have dedicated processors for MP and DP, while some use Single Processor for both MP and DP.

    For example,

     Platform MP Processors MP Cores
     PA-200 Single Shared Dual Core Processor 1
     PA-220 Single Shared Quad Core Processor 2
     PA-800 Single Shared 8 Cores Processor 3
     PA-3000 Dedicated Dual Core Processor 2
     PA-3200 Dedicated Quad Core Processor 4
     PA-5000 Dedicated Quad Core Processor 4
     PA-5200 Dedicated Multi Core Processor 16-24

    This processor can be multiple core processor (some cores can also be used for DP depending on platform), the output lists usage for all cores combined.

    You can press "1" after running "show system resources follow" to toggle view to show separate states:

    Sample output of the command is provided below:

    show system resources
    admin@PA-850 (active)> show system resources

    top - 15:15:55 up 15 days,  6:53,  1 user,  load average: 7.50, 6.82, 6.57
    Tasks: 177 total,   8 running, 168 sleeping,   0 stopped,   1 zombie
    %Cpu(s): 71.1 us,  2.7 sy,  0.3 ni, 25.7 id,  0.3 wa,  0.0 hi,  0.0 si,  0.0 st
    KiB Mem :  4119516 total,   515792 free,  1325328 used,  2278396 buff/cache
    KiB Swap:  6104084 total,  4836884 free,  1267200 used.  1112294 avail Mem

      PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
     4532       20   0   99752   4036   3308 R  94.4  0.1  22001:32 pan_task
     4533       20   0   74712   3808   3252 R  94.4  0.1  22001:56 pan_task
     4537       20   0   74732   3872   3272 R  94.4  0.1  22001:01 pan_task
     4538       20   0   74712   3788   3188 R  94.4  0.1  22002:18 pan_task
     4534       20   0   74712   3724   3132 R  88.9  0.1  22002:01 pan_task
    22131       20   0    4012   1620   1244 R  66.7  0.0   0:37.80 genindex.sh
    30475       20   0   16912   7020   1904 R   5.6  0.2   0:00.04 top
        1       20   0    2320    252    252 S   0.0  0.0   0:10.94 init
        2       20   0       0      0      0 S   0.0  0.0   0:03.84 kthreadd
        3       20   0       0      0      0 S   0.0  0.0   1:13.88 ksoftirqd/0
        5        0 -20       0      0      0 S   0.0  0.0   0:00.00 kworker/0:+
    ...snipped...
       46        0 -20       0      0      0 S   0.0  0.0   0:00.00 khelper
       47        0 -20       0      0      0 S   0.0  0.0   0:00.00 netns
      235        0 -20       0      0      0 S   0.0  0.0   0:00.00 writeback
      238        0 -20       0      0      0 S   0.0  0.0   0:00.00 bioset
      239        0 -20       0      0      0 S   0.0  0.0   0:00.00 kblockd
      245        0 -20       0      0      0 S   0.0  0.0   0:00.00 ata_sff
      255       20   0       0      0      0 S   0.0  0.0   0:00.00 khubd
      262        0 -20       0      0      0 S   0.0  0.0   0:00.00 edac-poller
      280        0 -20       0      0      0 S   0.0  0.0   0:00.00 rpciod
      327       20   0       0      0      0 S   0.0  0.0  40:24.01 kswapd0
      328       20   0       0      0      0 S   0.0  0.0   0:00.00 fsnotify_m+
      329        1 -19       0      0      0 S   0.0  0.0   0:00.00 nfsiod
      335        0 -20       0      0      0 S   0.0  0.0   0:00.00 kthrotld
      904       20   0       0      0      0 S   0.0  0.0   0:00.00 scsi_eh_0
      907       20   0       0      0      0 S   0.0  0.0   0:00.00 scsi_eh_1
      921       20   0       0      0      0 S   0.0  0.0   0:00.00 spi32766
      968       20   0       0      0      0 S   0.0  0.0   0:43.90 kworker/5:1
      ...snipped...
     5047 nobody    20   0  303308    812    812 S   0.0  0.0   2:55.11 nginx
    ...snipped...
    30466 admin     20   0    2728    704    580 S   0.0  0.0   0:00.01 less
    30471       20   0   17248   7100   2204 S   0.0  0.2   0:00.02 sh
    30476       20   0   16724   6204   1560 S   0.0  0.2   0:00.02 sed


    You can press "1" after running "show system resources follow" to toggle view to show separate states:
    Example from a PA-850 firewall:

    show system resources
    admin@PA-850 (active)> show system resources
    top - 14:21:05 up 11 days, 5:58, 1 user, load average: 6.45, 6.44, 6.45
    Tasks: 173 total, 8 running, 165 sleeping, 0 stopped, 0 zombie
    %Cpu0 : 35.0 us, 7.3 sy, 0.0 ni, 57.8 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
    %Cpu1 :100.0 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
    %Cpu2 :100.0 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
    %Cpu3 :100.0 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
    %Cpu4 :100.0 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
    %Cpu5 :100.0 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
    %Cpu6 : 20.5 us, 3.0 sy, 0.0 ni, 76.6 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
    %Cpu7 : 28.1 us, 7.3 sy, 0.0 ni, 64.4 id, 0.3 wa, 0.0 hi, 0.0 si, 0.0 st
    KiB Mem : 4119516 total, 456576 free, 1381004 used, 2281936 buff/cache
    KiB Swap: 6104084 total, 5046640 free, 1057444 used. 1434722 avail Mem

    PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
    4532 root 20 0 99752 5796 4784 R 100.0 0.1 16189:00 pan_task
    4533 root 20 0 74712 5540 4920 R 100.0 0.1 16189:29 pan_task
    4534 root 20 0 74712 5460 4860 R 100.0 0.1 16189:29 pan_task
    4537 root 20 0 74732 5360 4756 R 100.0 0.1 16188:40 pan_task
    4538 root 20 0 74712 5564 4932 R 100.0 0.1 16189:40 pan_task
    21848 root 20 0 4012 1664 1256 R 96.0 0.0 5:27.98 genindex.sh
    3330 root 20 0 48976 4520 2560 S 1.0 0.1 69:47.02 ehmon
    4488 root 20 0 639500 139056 117412 S 0.3 3.4 48:58.79 useridd
    4490 root 20 0 364060 5096 3720 S 0.3 0.1 98:37.45 distributord
    4523 root 20 0 58676 2892 2576 S 0.3 0.1 21:11.89 tund
    4761 root 20 0 131032 14696 3064 S 0.3 0.4 75:23.93 identityclient
    4783 nobody 20 0 288456 8620 3936 S 0.3 0.2 4:48.71 appweb3
    4785 root 20 0 1656580 195980 7320 S 0.3 4.8 32:45.77 logrcvr
    25172 nobody 20 0 2186392 159940 7900 S 0.3 3.9 2:41.97 httpd
    1 root 20 0 2320 508 504 S 0.0 0.0 0:08.10 init


    To Toggle IRIX Mode off press Shift+I:

    - You can see above this box has 8 CPUs, 5 of which (1-5) are for DP. The DP process is pan_task which is supposed to be running at all times hence it has no idle CPU cycles.
    - In above you will notice individually each pan_task (process for DP) are showing 100% CPU. This is design behavior of TOP Command in IRIX Mode where It is possible for the % CPU column to display values that total greater than 100%.
    - Solaris mode divides the % CPU for each process by the number of CPUs in the system so that the total will be 100%.


    top - 15:37:56 up 15 days, 7:15, 1 user, load average: 6.59, 6.65, 6.66
    Tasks: 176 total, 7 running, 167 sleeping, 0 stopped, 2 zombie
    %Cpu0 : 28.0 us, 6.4 sy, 0.5 ni, 64.2 id, 0.0 wa, 0.0 hi, 0.9 si, 0.0 st
    %Cpu1 :100.0 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
    %Cpu2 :100.0 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
    %Cpu3 :100.0 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
    %Cpu4 :100.0 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
    %Cpu5 :100.0 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
    %Cpu6 : 17.9 us, 6.0 sy, 0.0 ni, 75.7 id, 0.5 wa, 0.0 hi, 0.0 si, 0.0 st
    %Cpu7 : 43.6 us, 7.3 sy, 0.0 ni, 48.6 id, 0.0 wa, 0.0 hi, 0.5 si, 0.0 st
    KiB Mem : 4119516 total, 292220 free, 1224748 used, 2602548 buff/cache
    KiB Swap: 6104084 total, 4752288 free, 1351796 used. 848512 avail Mem

    Iris mode On
    PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
    4532 root 20 0 99752 3508 2832 R 100.0 0.1 22023:31 pan_task
    4534 root 20 0 74712 3460 2868 R 100.0 0.1 22024:01 pan_task
    4537 root 20 0 74732 3560 2956 R 100.0 0.1 22023:01 pan_task
    4538 root 20 0 74712 3372 2776 R 100.0 0.1 22024:18 pan_task
    4533 root 20 0 74712 3320 2776 R 99.5 0.1 22023:55 pan_task
    23368 root 20 0 4012 1620 1212 R 95.4 0.0 7:05.25 genindex.sh
    3104 root 0 -20 206236 19628 2372 S 5.5 0.5 98:36.40 masterd_apps
    4488 root 20 0 639500 131072 114584 S 4.6 3.2 68:00.68 useridd
    3124 root 15 -5 133252 8904 1240 S 1.8 0.2 202:33.41 sysd
    3307 root 30 10 202592 7432 1928 S 1.4 0.2 85:45.42 python
    41 root 20 0 0 0 0 S 0.5 0.0 17:19.33 rcuc/7
    3686 admin 20 0 4208 1616 1128 R 0.5 0.0 0:00.23 top
    4761 root 20 0 131032 13484 2124 S 0.5 0.3 103:03.36 identityclient
    4804 root 20 0 1204484 3332 1440 S 0.5 0.1 18:27.52 authd
    7675 root 20 0 0 0 0 S 0.5 0.0 0:04.91 kworker/0:2


    Additionally note that average CPU usage for this box will also consider DP Cores so your monitoring tools will show higher MP Usage.

    Example from a PA-5060 device:
    show system resources follow

     top - 21:23:26 up 4 days,  1:57,  1 user,  load average: 0.00, 0.00, 0.00
    Tasks: 131 total,   1 running, 130 sleeping,   0 stopped,   0 zombie
    Cpu0  :  0.3%us,  0.0%sy,  0.0%ni, 99.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
    Cpu1  :  0.3%us,  0.0%sy,  0.0%ni, 99.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
    Cpu2  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
    Cpu3  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
    Mem:   4053832k total,  2953796k used,  1100036k free,   163596k buffers
    Swap:  2007992k total,     1080k used,  2006912k free,  1298156k cached

    PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
    2402 root 0 -20 52960 16m 4440 S 0.3 0.4 5:31.90 masterd_apps
    2421 root 15 -5 21680 4408 2508 S 0.3 0.1 18:59.06 sysd
    3106 nobody 20 0 156m 47m 10m S 0.3 1.2 21:01.78 appweb3
    1 root 20 0 2084 696 576 S 0.0 0.0 0:01.88 init


    In this platform all cores are dedicated for MP, so you do not see any pan_task. Rest of the logic remains same.


    • MP-Log

    To check the management plane's resource usage for historic logs, use the following command (this information is logged every 15 minutes).

    less mp-log mp-monitor.log
    admin@firewall1(active)> less mp-log mp-monitor.log
    2022-03-16 06:34:33.901 +0000  --- processes
    Total num processes: 39
    Name                   PID      CPU%  FDs Open   Virt Mem     Res+Swap     State
    pppoe                  6655     0     7          446568       27148        S
    ha_agent               6648     0     11         239500       29040        S
    satd                   6649     0     12         562168       39420        S
    l2ctrl                 6646     0     8          519204       32788        S
    dp-console             4234     0     5          21992        4596         S
    varrcvr                6645     0     34         3011300      145616       S


    • Genindex.sh causing high MP CPU

    show system resources follow
    top - 14:17:29 up 11 days, 5:55, 1 user, load average: 6.84, 6.46, 6.45
    Tasks: 176 total, 8 running, 168 sleeping, 0 stopped, 0 zombie
    %Cpu(s): 73.0 us, 2.6 sy, 0.1 ni, 24.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
    KiB Mem : 4119516 total, 319332 free, 1415160 used, 2385024 buff/cache
    KiB Swap: 6104084 total, 5050440 free, 1053644 used. 1295050 avail Mem

    PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
    4532 root 20 0 99752 5800 4788 R 100.0 0.1 16185:23 pan_task
    4533 root 20 0 74712 5544 4924 R 100.0 0.1 16185:53 pan_task
    4534 root 20 0 74712 5460 4860 R 100.0 0.1 16185:52 pan_task
    4538 root 20 0 74712 5568 4936 R 100.0 0.1 16186:04 pan_task
    4537 root 20 0 74732 5364 4760 R 99.7 0.1 16185:04 pan_task
    21848 root 20 0 4012 1664 1256 R 97.0 0.0 2:03.25 genindex.sh
    3124 root 15 -5 133232 12636 2568 S 1.0 0.3 140:10.34 sysd
    3104 root 0 -20 206236 22044 4280 S 0.7 0.5 70:13.62 masterd_apps
    3307 root 30 10 202592 10124 4512 S 0.7 0.2 62:48.48 python
    4785 root 20 0 1656580 195988 7104 S 0.3 4.8 32:45.55 logrcvr

    Symptom
    Genindex.sh process utilizing a large amount of CPU.

    Environment
    All firewall platforms can be affected by this post PANOS - +8.1

    Cause
    Genindex.sh runs after a PAN-OS upgrade to re-index the logs with any new fields that have been added.  It runs every 15 minutes and is known to cause spikes in the CPU on the management plane.  After it completes indexing all the logs, it will then stop running.

    This should not cause any issues or be detrimental to the firewall and will only last until it has indexed all of your logs.  However, depending on the amount of logs you have based off your platform and what you have log retention set to, it can take up to weeks.

    One way to reduce the amount of time genindex.sh has to run is to make sure you are making use of log retention as the last logs there are to index, the less time it will take for the process to finish.

    Resolution
    There is no resolution, this is working as expected.  Once it has finished indexing all the logs it will stop running and you should see your CPU utilization drop.  Also it's worth keeping in mind that this process only uses one core so your devices with multiple cores will not be affected, however it will show up in the logs and GUI that the management CPU is high.

    On many occasions, this could be resolved by reducing the amount of logging.


    • Varrcvr process is causing high MP CPU

    ▶ Symptom
    The MP CPU is consistently hitting 100% on this firewall. A major change was migration to Panorama. There were no reports, and we do not generate lots of logs.
    When you executed the command "show system resources follow" multiple times, and found that the varrcvr process was always showing as 66 plus in the output.


    show system resources follow


    admin@pa-220> show system resources follow
    top - 17:15:18 up 63 days, 18:28,  1 user,  load average: 3.77, 3.93, 4.10
    Tasks: 139 total,   6 running, 133 sleeping,   0 stopped,   0 zombie
    %Cpu0  : 90.5 us,  8.5 sy,  0.0 ni,  1.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
    %Cpu1  :100.0 us,  0.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
    %Cpu2  :100.0 us,  0.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
    %Cpu3  : 38.4 us, 14.1 sy, 47.2 ni,  0.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
    KiB Mem :  4119652 total,   254280 free,  2471820 used,  1393552 buff/cache
    KiB Swap:     1972 total,     1972 free,        0 used.  1165668 avail Mem

      PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
     3219 root      20   0   71944  32292   4972 R  99.7  0.8  91773:26 pan_task
     3220 root      20   0   46976   7160   4724 R  99.7  0.2  91776:16 pan_task
     4478 root      20   0 1549004  25420   8980 S  66.6  0.6  28695:37 varrcvr
    27267 root      20   0    3876   1536   1260 R   8.2  0.0   0:05.09 genindex.sh
    27252 root      20   0   17312   7524   2280 S   4.9  0.2   0:00.76 logpurger.sh
     2915 root       0 -20  199232  32176   4548 S   2.0  0.8 432:48.61 masterd_apps
    27212 admin     20   0    4184   1576   1128 R   0.7  0.0   0:00.28 top
    28316 root      20   0   15444   3092   1152 R   0.7  0.1   0:00.02 cat
       10 root      20   0       0      0      0 R   0.3  0.0  57:09.10 rcuc/0
       16 root      20   0       0      0      0 S   0.3  0.0  24:36.92 rcuc/2
     1504 nobody    20   0 2316468 161600  11072 S   0.3  3.9   6:58.72 httpd
    ...snipped...


    ▶ Resolution
    Restart the varrcvr (vardata-receiver) process with the following CLI command.

    debug software restart process vardata-receiver
    admin@pa-220> debug software restart process vardata-receiver
    Process varrcvr was restarted by user admin

    vardata-receiver is Vardata Receiver server process
    Varrcvr: Recording URL filtering log and packet capture sent by dataplane. Involved with WildFire logs. 


    • PA-500 High Management CPU and poor performance with high logging

    ▶ Symptom
    PA-500 runs very slow and it was possible to notice that Management CPU is between 98-100% almost all the time.

    MP CPU Spikes up very frequently and "logrcvr", "mgmtsrvr" and "genindex.sh" appears to be the offending processes:

    PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 
    3265 20 0 1095m 282m 4780 S 35.4 14.2 93995:05 logrcvr 
    14961 20 0 5260 1560 1232 R 16.9 0.1 0:51.34 genindex.sh 

    3265 20 0 1095m 205m 3264 S 76.7 10.4 94000:37 logrcvr 
    3217 20 0 1198m 315m 7280 S 14.7 15.9 38133:24 mgmtsrvr


    High IOWait process was seen as well, however disk space utilization was fine:

    PU USAGE DISTRIBUTION Avg Max Max Date
    ---------------------------------------------
    User 54% 54.3% 01-08-06:15
    System 13% 13.9% 01-05-12:11
    Nice 2% 2.6% 01-09-02:36
    Idle 25% 26.6% 01-04-11:29
    IOWait 1% 1.1% 01-04-11:29


    Due to poor management plane performance commits and other tasks may be affected and get stuck, and using commit force should help in order to go through.

    ▶ Environment
    Palo Alto PA-500 Firewall.
    PAN-OS 8.0 and 8.1,
    High rate of logging configured.

    ▶ Cause
    When average IOWait is greater than 1%, it is likely causing processes to run slower than normal and it is important to reduce the amount of logging.

    ▶ Resolution
    1. Set the Max days for Logs, In this case the max days was configured to value '90' using this article, which automatically purged all the logs older than 90 days.
    2. In this case device was also upgraded to latest stable release. These both actions improved the performance remarkably.



    * Reference URLs:

    No comments: