7410 hardware update, and analyzing the HyperTransport

Teknoloji

23 Sep 2009

Today we have released an update to the
Sun Storage
7410,
which upgraded the CPUs from Barcelona to Istanbul:

    7410 (Barcelona) 7410 (Istanbul)
    Max 4 sockets quad core AMD Opteron CPU Max 4 sockets of six core AMD Opteron CPU
    Max 128 Gbytes DRAM Max 256 Gbytes DRAM
    HyperTransport 1.0 HyperTransport 3.0

This is per head node, so a 2-way cluster can bring half a Terabyte
of DRAM for filesystem caching.
But what has me most excited is the upgrade of main system bus,
from AMD’s HyperTransport 1 to HyperTransport 3. In this blog post I’ll
explain why, and post numbers with the new 7410.

The following screenshots show the Maintenance->Hardware screen from the original and the new 7410:



New 7410 Results

The following results were collected from the two 7410s shown above.

    Workload 7410 (Barcelona) 7410 (Istanbul) Improvement
    NFSv3 streaming cached read 2.15 Gbytes/sec 2.68
    Gbytes/sec
    25%
    NFSv3 8k read ops/sec 127,143 223,969 75%

A very impressive improvement from what were already great results.

Both of these results are reading a cached working set over NFS, so the
disks are not involved. The CPUs and HyperTransport were upgraded, and these
cached workloads were chosen to push those components to their limits (not the
disks), to see the effect of the upgrade.

The following screenshots are the source of those results, and were taken
from Analytics on the 7410 - showing what the head node really did. These
tests were performed by Sun’s Open Storage Systems group (OSSG). I was able to login
to Analytics
on their systems and take screenshots from the tests they
performed after the fact (since
Analytics
archives this data) and check
that these results were consistent with my own - which they are.

Original 7410 (Barcelona)

Streaming cached read:


Notice that we can now reach 2.15 Gbytes/sec for NFSv3 on the original 7410 (60 second average of network throughput, which includes protocol headers.)
When I first blogged about the 7410 after launch, I was reaching 1.90
Gbytes/sec; sometime later that became 2.06 Gbytes/sec. The difference is the
software updates - we are gradually improving our performance release after
release.

8k cached read ops:


As a sanity check, we can multiply the observed NFS read ops/sec by their
known size - 8 Kbytes: 127,143 x 8 Kbytes = 0.97 Gbytes/sec. Our observed
network throughput was 1.01 Gbytes/sec, which is consistent with 127K x 8 Kbyte
read ops/sec (higher as it includes protocol headers.)

New 7410 (Istanbul)

Streaming cached read:


2.68 Gbytes/sec - awesome!

8k cached read ops:


This is 75% faster than the original 7410 - this is no small hardware
upgrade! As a sanity test, this showed 223,969 x 8 Kbytes = 1.71 Gbytes/sec.
On the wire we observed 1.79 Gbytes/sec, which includes protocol headers. This is consistent with the expected throughput.

System

The systems tested above were the Barcelona-based and Istanbul-based 7410, both with max CPU and
DRAM, and both running the latest software (2009.Q3.) The same 41 clients were used to
test both 7410s.

The HyperTransport

The Sun Storage 7410 could support four ports of 10 GbE, with a theoretical
combined maximum throughput of 40 Gbit/sec, or 4.64 Gbytes/sec. However in practice it was
reaching about 2.06 Gbytes/sec when reading cached data over NFS. While
over 2 Gbytes/sec is fantastic (and very competitive), why not over 3 or 4
Gbytes/sec?

First of all, if you keep adding high speed I/O cards to a system, you may run out of system resources to drive them before you run out of slots to plug them into. Just because the system lets you plug them all in, doesn’t mean that the CPUs, busses and software can drive it at full speed. So, given that, what specifically stopped the 7410 from going faster?

It wasn’t CPU horsepower: we had four sockets of quad-core Opteron and
the very scalable Solaris kernel. The bottleneck was actually the
HyperTransport.


The HyperTransport is used as the CPU interconnect and the path to the
I/O controllers. Any data transferred with the I/O cards (10 GbE cards, SAS
HBAs, etc), will travel via the HTs. It’s also used by the CPUs so they can
access each other’s memory. In the diagram above, picture CPU0 accessing the memory which
is directly attached to CPU3 - which would require two hops over HT links.

HyperTransport 1

A clue that the HyperTransport (and memory busses) could be the bottleneck
was found with the Cycles Per Instruction (CPI):

    walu# ./amd64cpi-kernel 5
              Cycles     Instructions      CPI     %CPU
        167456476045      14291543652    11.72    95.29
        166957373549      14283854452    11.69    95.02
        168408416935      14344355454    11.74    95.63
        168040533879      14320743811    11.73    95.55
        167681992738      14247371142    11.77    95.26
    [...]
    

amd64cpi-kernel
is a simple script I wrote (these scripts are not supported by Sun), to pull the CPI from the AMD CPU PICs (Performance Instrumentation Counters.) The higher the CPI, the more
cycles are waiting for memory loads/stores, which are stalling instructions. A CPI of
over 11 is the highest I’ve ever seen - a good indication that we are waiting
a significant time for memory I/O.

Also note in the amd64cpi-kernel output that I included %CPU - CPU
utilization. With a CPU utilization of over 95%, how many of you would be
reaching for extra or faster CPU cores to improve the system? This is a problem
for all %CPU measurements - yes, the CPU was processing instructions, but it
wasn’t performing ‘work’ that you assume - instead those instructions are
stalled waiting for memory I/O. Add faster CPUs, and you stall faster (doesn’t
help.) Add more cores or sockets, and you could make the situation worse -
spreading the workload over more CPUs can decrease the L1/L2 CPU cache hit
rates, putting even more pressure on memory I/O.

To investigate the high CPI, I wrote more scripts to figure out what the
memory buses and HT buses were doing. My
amd64htcpu script shows
the HyperTransport transmit Mbytes/sec, by both CPU and port (notice in the
diagram each CPU has 3 HT ports.):

     walu# ./amd64htcpu 1
         Socket  HT0 TX MB/s  HT1 TX MB/s  HT2 TX MB/s  HT3 TX MB/s
              0      3170.82       595.28      2504.15         0.00
              1      2738.99      2051.82       562.56         0.00
              2      2218.48         0.00      2588.43         0.00
              3      2193.74      1852.61         0.00         0.00
         Socket  HT0 TX MB/s  HT1 TX MB/s  HT2 TX MB/s  HT3 TX MB/s
              0      3165.69       607.65      2475.84         0.00
              1      2753.18      2007.22       570.70         0.00
              2      2216.62         0.00      2577.83         0.00
              3      2208.27      1878.54         0.00         0.00
         Socket  HT0 TX MB/s  HT1 TX MB/s  HT2 TX MB/s  HT3 TX MB/s
              0      3175.89       572.34      2424.18         0.00
              1      2756.40      1988.03       578.05         0.00
              2      2191.69         0.00      2538.86         0.00
              3      2186.87      1848.26         0.00         0.00
    [...]
    

This shows traffic on socket 0, HT0 was over 3 Gbytes/sec. These
HyperTransport 1 links have a theoretical maximum of 4 Gbytes/sec, which we are
approaching. While they may not be at 100% utilization (for a 1 second
interval), we have multiple cores per socket trying to access a resource that
has reasonably high utilization - which will lead to stalling.

After identifying memory I/O on HyperTransport 1 as a potential bottleneck, we were able
to improve the situation a few ways:

  • zero-copy: reducing memory I/O that the kernel needs to do, in this
    case by not copying data if possible as it is passed through the kernel stack -
    instead managing references to it.
  • card-reshuffle: we shuffled the I/O cards so that less traffic needed to go
    via the CPU0 to CPU1 HT interconnect. Originally we had all the network cards
    on one I/O controller, and all the SAS HBA cards on the other - so any disk I/O
    served to the network would travel via that hot HT link.
  • thread-binding: certain threads (such as for nxge, the 10 GbE driver) were bound
    to specific CPUs, to increase the CPU L1/L2 cache hit rate, which would decrease the remote memory
    traffic, which would relieve HT pressure.

With these changes, our performance improved and the CPI was down to about
10. To go further, we needed HyperTransport 3.

HyperTransport 3

HT3 promised to triple the bandwidth, however when I first got a prototype
HT3 system I was dissapointed to discover that the max NFSv3 throughput was the
same. It turned out that I had been sent
upgraded CPUs, but on a HT1 system. If anything, this further confirmed what I
had suspected - faster CPUs didn’t help throughput, we needed to upgrade the
HT.

When I did get a HT3 system, the performance was considerably better -
between 25% and 75%. HT links:

    topknot# ./amd64htcpu 1
         Socket  HT0 TX MB/s  HT1 TX MB/s  HT2 TX MB/s  HT3 TX MB/s
              0      5387.91       950.88      4593.76         0.00
              1      6432.35      5705.65      1189.07         0.00
              2      5612.83      3796.13      6312.00         0.00
              3      4821.45      4703.95      3124.07         0.00
         Socket  HT0 TX MB/s  HT1 TX MB/s  HT2 TX MB/s  HT3 TX MB/s
              0      5408.83       973.48      4611.83         0.00
              1      6443.95      5670.06      1166.64         0.00
              2      5625.43      3804.35      6312.05         0.00
              3      4737.19      4602.82      3060.74         0.00
         Socket  HT0 TX MB/s  HT1 TX MB/s  HT2 TX MB/s  HT3 TX MB/s
              0      5318.46       971.83      4544.80         0.00
              1      6301.69      5558.50      1097.00         0.00
              2      5433.77      3697.98      6110.82         0.00
              3      4581.89      4464.40      2977.55         0.00
    [...]
    

HT3 is sending more than 6 Gbytes/sec over some links. The CPI was down to
6, from 10. The difference to performance numbers was huge:

    Instead of 0.5 Gbytes/sec write to disk, we were now approaching 1.
    Instead of 1 Gbyte/sec NFS read from disk, we were now approaching 2.
    And instead of 2 Gbytes/sec NFS read from DRAM, we were now approaching 3.

Along with the CPU upgrade (which helps IOPS), and DRAM upgrade (helps
caching working sets), the 7410 hardware update was looking to be an incredible upgrade
to what was already a powerful system.

CPU PIC Analysis

If I’ve wet your appetite for more CPU PIC analysis, on Solaris run “cpustat
-h” and fetch the document it refers to, which will contain a reference for the
CPU PICs for the platform you are on. The scripts I used above are really not
that complicated - they use shell and perl to wrap the output (as the man page
for cpustat even suggests!) Eg, the amd64cpi-kernel tool was:

    #!/usr/bin/sh
    #
    # amd64cpi-kernel - measure kernel CPI and Utilization on AMD64 processors.
    #
    # USAGE: amd64cpi-kernel [interval]
    #   eg,
    #        amd64cpi-kernel 0.1            # for 0.1 second intervals
    #
    # CPI is cycles per instruction, a metric that increases due to activity
    # such as main memory bus lookups.
    
    interval=${1:-1}        # default interval, 1 second
    
    set -- `kstat -p unix:0:system_misc:ncpus`              # assuming no psets,
    cpus=$2                                                 # number of CPUs
    
    pics='BU_cpu_clk_unhalted,sys0'                         # cycles
    pics=$pics,'FR_retired_x86_instr_w_excp_intr,sys1'      # instructions
    
    /usr/sbin/cpustat -tc $pics $interval | perl -e '
            printf "%16s %16s %8s %8s\n", "Cycles", "Instructions", "CPI", "%CPU";
            while (<>) {
                    next if ++$lines == 1;
                    split;
                    $total += $_[3];
                    $cycles += $_[4];
                    $instructions += $_[5];
    
                    if ((($lines - 1) % '$cpus') == 0) {
                            printf "%16u %16u %8.2f %8.2f\n", $cycles,
                                $instructions, $cycles / $instructions, $total ?
                                100 * $cycles / $total : 0;
                            $total = 0;
                            $cycles = 0;
                            $instructions = 0;
                    }
            }
    '

A gotcha for this one is the “sys” modifier on the pics definitions; they make these PICs record activity during both user-code and kernel-code, not just user-code.

My 7410 Results

I’ve previously posted
many
numbers

covering 7410 performance, although I had yet to collect the full set. I
was missing iSCSI, FTP, HTTP and many others. This hardware upgrade changes
everything - all my previous numbers are now out of date. The numbers
for the new 7410 are so far between 25% and 75% better than what I had posted
previously!

Performance testing is like painting the Golden
Gate Bridge: once you reach the end you must immediately begin at the start
again. In our case, there are so many software and hardware upgrades that
once you approach completing perf testing, the earlier numbers are out of date.
The OSSG group (who gathered the numbers at the start of this post)
are starting to help out so that we can test and share numbers more
quickly.

I’ve created a new column of numbers on my
summary
post
, and I’ll fill out the new numbers as I get them.

Conclusion

For this 7410 upgrade, the extra CPU cores help - but it’s more about the
upgrade to the HyperTransport. HT3 provides 3x the CPU interconnect
bandwidth, and dramatically improves the delivered performance of the 7410:
from 25% to 75%. The 7410 was already a powerful server, it’s now raised the
bar even higher.

Source/Kaynak : http://blogs.sun.com/brendan/entry/7410_hardware_update_and_analyzing

Comment Form

Content In Different Language


Recent Comments


  • Jim Dougherty: You can fix Solaris 8 named_to_major, path_to_inst, drivers_alias errors on boot by simply installin [...]
  • psha: doesn't work [...]
  • Sebastian: Hi, I don't think using a suite will work either. The order is also random. It is just a coincide [...]
  • Himani: please send the ESB tutorial [...]
  • kevin hill: code 39 dvd will not or be found [...]
  • Our Scores