Saturday, March 5, 2022

Extreme Switch - Troubleshooting High CPU in EXOS


    In the EXOS switch, there are two ways to look at the CPU utilization. The top command shows the real-time utilization of the EXOS tree process and refreshes every second. In comparison, show cpu-monitoring command dissects the process in 5,10,30 secs and 1,5,30,60 minutes intervals and populates in alphabetic order.


    • How to check CPU Utilization in EXOS


    Option 1. Use 'top' command to check the real-time utilization.


    The following lists a description of the table columns seen in the output of the TOP command:

    · usr: user cpu time (or) % CPU time spent in user space
    · sys: system cpu time (or) % CPU time spent in kernel space
    · nic: user nice cpu time (or) % CPU time spent on low priority processes
    · idle: idle cpu time (or) % CPU time spent idle
    · irq: hardware irq (or) % CPU time spent servicing/handling hardware interrupts
    · sirq: software irq (or) % CPU time spent servicing/handling software interrupts

    The load average is based on the CPU average for 1, 5, and 15 minute intervals.

    top
    Switch# top
    Mem: 391088K used, 589604K free, 716K shrd, 16888K buff, 120060K cached
    CPU: 15.4% usr  2.2% sys  0.0% nic 82.2% idle  0.0% io  0.0% irq  0.0% sirq
    Load average: 4.18 4.23 4.19 2/274 4199
      PID  PPID USER     STAT   RSS %RSS CPU %CPU COMMAND
     1949     1 root     S<   89492  9.0   0 14.8 ./hal
    11681     1 root     S    23672  2.4   0  0.3 ./expy -d -m exos.httpd
     1955     1 root     S     5480  0.5   0  0.3 ./fdb
     1975     1 root     S     4568  0.4   0  0.3 ./dot1ag
     2060     1 root     S     4960  0.5   0  0.3 ./pim
     5804  5802 root     S     8232  0.8   0  0.3 /exos/bin/hiveagent_pr {  "upgradeVersion": "",  "status": 0,  "infor
     1653     1 root     S     5632  0.5   0  0.3 /exos/bin/epm -t 40 -f /exos/config/epmrc -d /exos/config/epmdprc
     4188  3727 root     R     1968  0.2   0  0.3 top -d 3
    21691     2 root     IW       0  0.0   0  0.3 [kworker/0:2]
     1953     1 root     S     6340  0.6   0  0.0 ./vlan
     1939     1 root     S    36640  3.7   0  0.0 ./cliMaster
     5796  2019 root     S    16128  1.6   0  0.0 /exos/bin/expy -m exos.apps.iqagent -v 2
     2007     1 root     S    16000  1.6   0  0.0 ./policy
     1945     1 root     S     6676  0.6   0  0.0 ./aaa -t random
     1993     1 root     S     5256  0.5   0  0.0 ./exsshd
     2056     1 root     S     4376  0.4   0  0.0 ./ospf
     1941     1 root     S     7924  0.8   0  0.0 ./snmpMaster

    Press Ctrl + C or q to exit from the top command's monitoring screen.


    Option 2. Use 'show cpu-monitoring' to check in seconds and minutes intervals.


    By default, CPU monitoring is enabled and occurs every 5 seconds. The default CPU threshold value is 90%.

    Depending on the software version running on your switch or your switch model, additional or different CPU and process information might be displayed.

    The show cpu-monitoring command is helpful for understanding the behavior of a process over an extended period of time. The following information appears in a tabular format:

    · Card: The location (MSM A or MSM B) where the process is running on a modular switch.
    · Process: The name of the process.
    · Range of time (5 seconds, 10 seconds, and so forth): The CPU utilization history of the process or the system. The CPU utilization history goes back only 1 hour.
    · Total User/System CPU Usage: The amount of time recorded in seconds that the process spends occupying CPU resources. The values are cumulative meaning that the values are displayed as long as the system is running. You can use this information for debugging purposes to see where the process spends the most amount of time: user context or system context.


    show cpu-monitoring
    Switch# show cpu-monitoring

          CPU Utilization Statistics - Monitored every 5 seconds
    -----------------------------------------------------------------------

    Process      5   10   30   1    5    30   1    Max           Total
                secs secs secs min  mins mins hour            User/System
                util util util util util util util util       CPU Usage
                (%)  (%)  (%)  (%)   (%)  (%)  (%)  (%)         (secs)
    -----------------------------------------------------------------------

    System        0.0  0.0  0.0  0.0  0.0  0.0  0.0 84.1 17486.59   19182.42
    aaa           0.1  0.0  0.0  0.2  0.1  0.1  0.1  1.6  1002.78     807.02
    acl           0.0  0.1  0.0  0.1  0.1  0.1  0.1  1.9  4688.14    4188.75
    bfd           0.0  0.0  0.0  0.0  0.0  0.0  0.0  1.0   934.45     766.22
    bgp           0.0  0.0  0.0  0.0  0.0  0.0  0.0  5.6   280.48     112.71
    brm           0.0  0.0  0.0  0.0  0.0  0.0  0.0  1.0   279.33     131.17
    cfgmgr        0.0  0.0  0.0  0.0  0.2  0.1  0.2  4.8  2036.28     994.53
    cli           0.0  0.0  0.0  0.2  0.0  0.0  0.0 33.5   558.08     256.61
    devmgr        0.0  0.0  0.0  0.2  0.0  0.1  0.1  5.6  1844.89     434.18


    • Understand the processes


    Use the following commands to know about each process.

    show process description
    Switch# show process description
    Process Name     Description
    ----------------------------------------------------------------------
    aaa              Authentication, Authorization, and Accounting Server
    acl              Access Control List Manager
    bfd              IETF Bidirectional Forwarding Detection
    ...snipped ...

    Switch# show process description thttpd
    Process Name     Description
    ----------------------------------------------------------------------
    thttpd           HTTP Services


    • High CPU utilization cases


    Here are several high CPU utilization cases and resolutions.

    Case 1. High CPU utilization with hal process after upgrading to EXOS 30.7.


    EXOS 30.6 and lower versions were not considered the consumption of the hal process, so you will see higher CPU utilization after upgrading the switch to EXOS 30.7 or higher.

    * Resolution:
    Upgrade to latest patch of recommended release. The following link provides recommended EXOS and Switch Engine releases for each hardware platform.

    ExtremeXOS and Switch Engine Release Recommendations


    Case 2. High CPU utilization with hal process in EXOS 22.x.


    Even with default configuration, CPU utilization can be above 20% in EXOS 22.x. Especially on some lower-end switches, this can cause peaks of over 90% when a lot of programming is happening on the switches. The segment that handles the link scan as well as the multicast re-programming was present in the kernel space in earlier 21.x versions. This segment has been moved to the HAL process since the 22.x version.

    * Resolution:
    Upgrade to latest patch of recommended release.


    Case 3. High CPU utilization from VSM process.


    * Symptoms:
    High CPU utilization on the Backup node of a SummitStack. The following log message is generated:
    <Warn:EPM.cpu> Slot-2: CPU utilization monitor: process vsm consumes 97 % CPU

    * Cause:
    TCP port 4001 (used for communication between MLAG peers) is open when it should not be
    A large amount of traffic is being sent to TCP port 4001, leading to high CPU utilization from the vsm process

    * Resolution:
    Upgrade to a version of code that includes the fix for CR xos0052842.


    Case 4. High CPU Utilization from Nodemgr process.


    * Symptoms:
    - High CPU utilization from Nodemgr process
    - System uptime equal to or greater than 994 days

    A message similar to the following log entry may be seen:
    07/25/2014 11:50:24.18 MSM-B: CPU utilization monitor: process nodemgr consumes 79 % CPU

    * Environment
    EXOS versions 12.4.x prior to 12.4.4.8-patch1-1
    EXOS versions 12.5.x prior to 12.5.3.7
    EXOS versions 12.3.x

    * Cause
    The nodemgr process constantly consumes excessive CPU usage once the system uptime reaches around 994 days.

    * Resolution
    This issue has been resolved under CR xos0042592
    This has been fixed in the following EXOS Releases:
    12.4.4.8-patch1-1 and later
    12.5.3.7 and later

    A temporary workaround is to reboot the switch.

    Case 5. High CPU utilization from bcmRx process on sFlow enabled switch.


    * Symptoms
    High CPU utilization of bcmRx process on sFlow enabled switch.

    * Environment
    EXOS 12.4, BlackDiamond 8810

    * Cause
    sFlow is sending excessive traffic to the CPU.

    * Resolution
    Disable sFlow or configure a lower CPU sample limit, for example "configure sflow max-cpu-sample-limit 100".


    * Reference URLs
    How to check CPU utilization in EXOS
    Understanding the output of the TOP command
    How to gather top CPU output over time to a text file
    Does collecting "show tech-support" introduce any problems such as high CPU utilization?
    CPU utilization is not accounting for some kernel thread utilization in the command output 'show cpu-monitoring'


    No comments: