Wednesday, November 21, 2012

Java 7 SE Raspberry PI Parallel Processing ARM Cluster of 32 boards

Raspberry PI cluster of 32 boards - Under Construction:

Disclaimer: For serious computational power - build your own i7-5820 and use CUDA on an nVidia GTX-970.  The raspberry PI cluster is more of a "build it - and they will come" exploration exercise.

This article is an ongoing discussion of how to get a cluster of 32 raspberry pi boards up and running.

32 node/board Raspberry PI cluster for parallel processing experimentation

The ARM based Raspberry PI board is an excellent platform to investigate various parallel processing configurations.  If we are looking for pure performance then I would stick with an Intel core i7 and a CUDA based NVidia GPU because a single core raspberry PI is about 40 times slower than a single core of an 2nd gen i7-2600 or 3rd gen i7-3610 (about 140 times slower than an 8-thread ForkJoin implementation).  This however is not our goal, we need an efficient and accessible way to run multiple servers - and the pi does this at about $70 per server (board + connectors + 16GB SD) and 4 watts/node.  For example: it would take $9800 of raspberry PI boards with 70G ram to equal one 3rd gen i7 at $1300 with 24G ram.  But we can build a cluster of 8 raspberry PI servers with $560 as opposed to 8 i7 boxes with $10000.

 We need the proper power supplies and network switches for a cluster of 32 raspberry pi boards.
 Raspberry pi boards mount very nicely on standard breadboards using properly bent arduiono headers.

Raspberry PI Cluster
8 board Raspberry PI cluster
In this configuration I am running a research cluster of 8 raspberry PI boards to run distributed Java EE RMI/EJB remote session bean clients of a central Oracle WebLogic 12c server (running on an i7 host)

This tutorial details how to get a networked cluster (bramble) of (eight for now) Raspberry PI boards running as a single distributed auxillary processing unit to a controlling Java EE server - ideally using Hadoop.  The primary goal of this exercise is for distributed experimentation.  As I configure and acquire multiple raspberry pi boards and work out power distribution issues my cluster will increase in size.  I currently work with 8 boards and 8 spares.  The cluster of raspberry PI s can be distributed work using a custom RPC API like remote stateless sessions beans on top of RMI or they can use a formal MapReduce implementation like Hadoop or even MPI.

After running the Oracle embedded ARM JVM with no problems on the REV A raspberry PI using the distribution from Element14, I was not immediately successfull running the JVM on the new REV B (512Mb) version because the default Debian distribution from Element 14 no longer uses the soft float version.  I get the following missing library error.

pi@raspberrypi ~/java/ejre1.7.0_06 $ java -version
java: error while loading shared libraries: cannot open shared object file: No such file or directory

Download a new OS compatible with the Java 7 JDK here

Choose "Soft-float Debian “wheezy”" =

Write it to your SD card

Reinstall Java (curently 1.7.0_10 from Oracle)

You are good to go [Java 7 SE on the Raspberry PI Rev B].  So we can now use Fork-Join, JAXB and JAX-WS webservices (using a single thread however).

DI 1: Powering your Raspberry PI cluster

On some routers you will not get a DHCP assigned address if all the clustered raspberry PI boards are powered up in sync - you will need to stagger the powerup - this only occurs if not enough amps are available.

Using a good Agilent power supply we use from 3 (idle) to 3.5A (startup) (6-7W @ 5V) for 8 boards.

You can use a powered USB router for 4 boards, but 8 will require a better power supply like a bench one from Agilent. A bench supply will usually supply up to 40A power  - but normally 5A which is good for up to 12 Raspberry PI's running at 100% CPU but we will need a better supply for a cluster of 32 raspberry pi boards for example.
A good ATX power supply will suffice to power a cluster of raspberry pi boards.  In this example I have a 450W supply which supplies 30A of 5V power (make sure you put some load on the 5V and 12V rails as well).

Get the ATX adapter and breakout board from SparkFun and make sure you use multiple 24 guage or higher wires to distribute the load (1 wire will overheat, 2 wires go to 28 deg C. - use at least 4 if you go over 8 boards).

As you can see I have yet to fully integrate the power supply interface between the ATX supply and the breadboard bus for the 8 pi's - but we are functioning fine and are no longer limited by the bench supplies or individual 5V USB connectors.  (the blue LED boards are Parallax Propeller 8-core microcontrollers uses a per/core output indicator for now.
I need some sort of protection fuse - in case of a short circuit.  It was very stressful connecting up my 8 raspberry pi boards up to the ATX after testing on one.  I recommend working with all GPIO pins (one 1 header is populated instead of 2 on the latest rev B board) covered by a flat cable connector.

20130126:  I now have 24 of 32 raspberry pi boards powered up however running the full peak 15A off an ATX power supply is not practical as it requires some serious wire guage as my 3 wire 24 guage setup is overheating.  Also if you accidently short the power supply you will use the full 15-40A and burn your wire.  I accidentally shorted the leads on a 5A supply on my metal breadboard and the supply wire started to smell and melt.  This brings us to the recommend way to power a large cluster of raspberry pi boards - separate bench power supplies.  When I shorted the supply the bench supply held stead at 5.2A which is safe enough not to burn your house down before you notice it.


Recommended power supply setup for 32 raspberry pi board cluster

No more than 8 raspberry pi boards per 5A power supply will allow you to add some peripherals like an adafruit display or a propeller 8-core coprocessor on an SPI bus.
So this is kind of expensive but instead of using a 40A bench supply at around $350 I use 4 separate 5A bench supplies (3 Circuit-Test PSC-520 supplies @ 3 x $225 and 1 Agilent U8002A supply @ $450).

DI 2: Updating your board for 512mb RAM (470Mb from 224Mb)

The Rev 2 board has double the ram but will require updated firmware to enable it.
sudo wget -O /usr/bin/rpi-update && sudo chmod +x /usr/bin/rpi-update
sudo apt-get install git-core
sudo rpi-update
- reboot after firmware update

DI 3: Overclocking

The lan chip heats up to 52 degrees celsius from a normal 45 when the raspberry pi is overclocked from 700 to 800 MHz.

DI 4: Setup Networking

Wireless is kind of unreliable, I recommend wired.
The WiPi module from Element 14 works essentially out of the box

Wired networking

After duplicating all the 32 SD cards, put one at a time into one of the raspberry pi boards and change the hostname, hosts and static network interfaces settings

sudo nano /etc/hostname

sudo nano /etc/hosts

sudo nano /etc/network/interfaces
iface eth0 inet static

# here we do not rely on our internet providers's DNS servers - we use the google server at as more reliable DNS server

sudo nano /etc/resolv.conf

DI 5: Setup Java

Setup Tomcat
login in to the manager app using "system:raspberry"

Setup Fortran and MPICH

Issue is that I lower performance (likely network overhead) when I increase the number of nodes (currently 6 pi's)

pi@rpi0 ~ $ mpiexec -f machinefile -n 1 ~/mpich_build/examples/cpi
Process 0 of 1 is on rpi0
pi is approximately 3.1415926544231341, Error is 0.0000000008333410
wall clock time = 0.017286

pi@rpi0 ~ $ mpiexec -f machinefile -n 2 ~/mpich_build/examples/cpi
Process 0 of 2 is on rpi0
Process 1 of 2 is on rpi1
pi is approximately 3.1415926544231318, Error is 0.0000000008333387
wall clock time = 0.020435

pi@rpi0 ~ $ mpiexec -f machinefile -n 4 ~/mpich_build/examples/cpi
Process 1 of 4 is on rpi1
Process 0 of 4 is on rpi0
Process 2 of 4 is on rpi2
Process 3 of 4 is on rpi3
pi is approximately 3.1415926544231239, Error is 0.0000000008333307
wall clock time = 0.037727

pi@rpi0 ~ $ mpiexec -f machinefile -n 6 ~/mpich_build/examples/cpi
Process 2 of 6 is on rpi0
Process 1 of 6 is on rpi1
Process 0 of 6 is on rpi2
Process 3 of 6 is on rpi3
Process 4 of 6 is on rpi4
Process 5 of 6 is on rpi5
pi is approximately 3.1415926544231239, Error is 0.0000000008333307
wall clock time = 0.043331

20121121: Setup 4 networked PIs
20130127: power up of 24 raspberry pi boards


32 Raspberry PI boards from Element 14 @ $35 = $1120
32 Sandisk Ultra 16GB SD cards @ $10-18 = $320-576
0 micro USB cables = $0
1 HDMI cable from Apple = $20
8 power supply cables @ 10 = $80
4 bench 5A power supplies from Agilent or Circuit-Test @224-450 = $896-1800
4 large breadboards (that fit 8 raspberry pi boards) @ 45 = $180
64 bendable arduino headers from @ $1 = $64
5 Gigabit 8 node network hubs or 2 16 node hubs from Dlink @ 65 = $325
32 belkin flexible network cables from the Apple store @ 15 = $480

Total = $3885.00

Copies of this article



st said...


Thanks for posting this !

In the initial lines, you mentioned about a Raspberry pi single core vs i7 core. Would like to know how is an i7 40 times faster than pi ? Is it based on just clock speed or you did the 'flops' calculation ?


Michael O'Brien said...

Thank you for the question. I ran my collatz (hailstone numbers) sequence generation code on the Java 7 JVM on several machines i7-920, i7-2600, i7-3610 and several overclocked 800Mhz PI's. The results were that a 2700Mhz single threaded i7 is 40 times faster than an 800Mhz raspberry PI - some of this is the fact that the PI runs a 3.4 times slower clock - which would mean the PI is around 12 times slower than a hypothetical 800MHz i7 core. Since the code only uses 64 bit integer math (division by 2 and multiply by 3 are accomplished by shifting) and runs in memory with minimal console logging - then shared resources do not slow down concurrency. When I ran a ForkJoin version of the code and ran between 2 and 32 parallel forks I saw no performance improvement on the raspberry PI - as expected since forking will only cause overhead in a single core machine. However on the i7 with 8 virtual and 4 physical cores we see just over 4 times performance with around 5-10% increased speed due to HT of the 4 virtual cores. This is how I arrived at the around 140x performance difference - this would be with a fork of only 4 - which puts the i7 at only around 75% capacity - you would need about 128 threads to get close to 98% or 160x performance. Therefore with a 32 node raspberry PI beowolf grid you can on Java 7 SE Raspberry PI Parallel Processing ARM Cluster of 32 boards

see my post on the performance graph

I will also add the raspberry PI results to my distributed page at

Michael O'Brien said...

Actually I forgot about my older post - you can see the baseline numbers - 140x faster using 99% CPU on an i7 3610 means about 32x when divided by 4.4 (4 HT + 4 real cores)

Raspberry PI
4360612 ms for: 16384

31186 ms for : 16384

Unknown said...

I know it's quite late to comment on this post, but why not install hard-float OS and use openjdk? even if hadoop doesn't use floating point calculation, I bet that would make the OS work faster

Unknown said...

Nice blog.Thanks for sharing this information.


Michael O'Brien said...


Anonymous said...

Great Right UP,

Follow the blog to increase your traffic by doing Proper JS SEO.

What are the tricks to ace JavaScript SEO – Expert Guide

Aishwariya said...

i am glad to discover this page : i have to thank you for the time i spent on this especially great reading !! i really liked each part and also bookmarked you for new information on your site. Primavera Online Training | Primavera Course in Chennai

Chaitanya said...

I value the blog post. Really looking forward to reading more.
SAP Successfactors Online Training
Devops Online Training

Total Pageviews