Monday, January 12, 2015

Amazon Web Services C4 Haswell 36 core compute instances are ready

    Back in Nov 2014 the announcement of the new Haswell chips designed and implemented by Intel specifically for AWS EC2 was very exciting.  Today the 36 thread (18 physical cores) Xeon E5-2666 v3 processor @ 2.9 Ghz (an improvement over the 30 core Ivy Bridge predecessor) arrived.   Where else can you test an $18000 computer by renting it for $1.80/hr.

    I put the new EC2 instance running on the top of the line C4-8xLarge processor to a concurrency benchmark written in Java that is designed to test various levels of multithreading load by computing the collatz series.  No tests on HD, network or memory throughput were tested - the benchmark is purely integer computation with no optimization for the new haswell vector instructions.  As a comparison I ran the same parallel processing benchmark on the g2-2xlarge, r3-8xlarge and older c3-8xlarge instances.

I see around 11% performance increase using the C4 over the C3.

Time performance for 64 threads running various levels of 2^22 to 1 concurrent work packets - in ms where a lower number is better.
This is the same diagram with the lower concurrency levels of 1 to 2 removed.
As you can see the blue line for the C4 is giving us the expected performance increase for the core increase from 32 to 36 cores.

Notice that the 3 upper lines are from 2 physical machine processors and the GPU VM - here the 4960 on a Macbook Pro, the 3610 on an Asus ROG and the g2-2xlarge cloud instance.  The varying 3 lines at the bottom of the graph for the r3, c3 and c4 instances are virtual cores that are subject to burst and adjacent loads.

The following is the concurrency map for various levels of threading from 1 to 64 across various levels of concurrent work packets.  As you can see the processor really shines when we feed it over 1k work items distributed over at least 64 threads.

Note: the bumps in the graph occur when the OS scheduled work that may affect the test 2-5%.  More variance happens at the virtualization layer where burst mode may occur on or beside our processor slice.
The level of performance increase looks good - even though we only moved from 2.8 to 2.9 Ghz - we are using a more efficient processor architecture on top of getting 4 more cores.  The single core performance of the C4 is not very good (just like all the other R3, C3 and G2 instances) - as expected but it excels is heavy concurrent workloads.
What is particularly exciting is the 3400% processor usage in the "top" command below.

The concurrency test varies the thread pool from 1 to 1024 threads in powers of 2.  The optimum number of threads is usually around 2 times the # of threaded processors - in this case 64
Cpu(s): 39.2%us, 56.9%sy,  0.0%ni,  4.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  61847440k total,  6289296k used, 55558144k free,    23156k buffers
Swap:        0k total,        0k used,        0k free,   424720k cached
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                       

55371 ec2-user  20   0 34.1g 5.0g  10m S 3427.6  8.6 118:44.18 java 
- I forgot to convert ms to sec - sorry
- This benchmark is a simple fork-join MapReduce test that may be affected by the other 2-5% OS processes occurring in the background - hey it has only been less than 24 hours since the processor went live.
- No really useful work was done during the actual benchmark tests - no collatz number was found past 2^60 yet!
- Note the fastest time in this run was 130 sec for the C4 but I have seen a performance spike even on the R3 instances at 105 sec.
- Use spot instances to save on the cost by up to 85%.

No comments:

Total Pageviews