I followed the directions at https://medium.com/@bossjones/how-i-setup-a-raspberry-pi-3-cluster-using-the-new-docker-swarm-mode-in-29-minutes-aa0e4f3b1768#.ma06iyonf but tweaked them a bit.
First off, I wanted to have my cluster using eth0 to connect to my laptop and then share its WiFi connection. Using this technique means that my WiFi network name and password are not on the cluster. So the cluster should be able to plug into any laptop or server without changes. Follow instructions at https://t.co/2jRbNAOiCU to share your eth0 connection.
Use lsblk to umount any directories on the SD cards you'll be using. See http://affy.blogspot.com/2016/06/how-did-i-prepare-my-picocluster-for.html for a bit of information about lsblk.
Now flash the SD cards using the flash tool from hypriot. Notice that *no* network information is provided.
I used piX naming convention so that I can easily loop over all five RPI in the PicoCluster.
flash --hostname pi1 --device /dev/mmcblk0 https://github.com/hypriot/image-builder-rpi/releases/download/v0.8.1/hypriotos-rpi-v0.8.1.img.zip
flash --hostname pi2 --device /dev/mmcblk0 https://github.com/hypriot/image-builder-rpi/releases/download/v0.8.1/hypriotos-rpi-v0.8.1.img.zip
flash --hostname pi3 --device /dev/mmcblk0 https://github.com/hypriot/image-builder-rpi/releases/download/v0.8.1/hypriotos-rpi-v0.8.1.img.zip
flash --hostname pi4 --device /dev/mmcblk0 https://github.com/hypriot/image-builder-rpi/releases/download/v0.8.1/hypriotos-rpi-v0.8.1.img.zip
flash --hostname pi5 --device /dev/mmcblk0 https://github.com/hypriot/image-builder-rpi/releases/download/v0.8.1/hypriotos-rpi-v0.8.1.img.zip
Using this function, you can find the IP addresses for the RPI.
function getip() { (traceroute $1 2>&1 | head -n 1 | cut -d\( -f 2 | cut -d\) -f 1) }
List the IP addresses.
for i in `seq 1 5`; do echo "HOST: pi$i IP: $(getip pi$i.local)"; done
Remove any fingerprints for the RPI.
for i in `seq 1 5`; do ssh-keygen -R pi${i}.local 2>/dev/null; done
Copy your PKI identity to the RPI.
for i in `seq 1 5`; do ssh-copy-id -oStrictHostKeyChecking=no -oCheckHostIP=no pirate@pi${i}.local; done
Download the deb file for Docker v1.12
curl -O https://jenkins.hypriot.com/job/armhf-docker/17/artifact/bundles/latest/build-deb/raspbian-jessie/docker-engine_1.12.0%7Erc4-0%7Ejessie_armhf.deb
Copy the deb file to the RPI
for i in `seq 1 5`; do scp -oStrictHostKeyChecking=no -oCheckHostIP=no docker-engine_1.12.0%7Erc4-0%7Ejessie_armhf.deb pirate@pi$i.local:.; done
Remove older Docker version from the RPI
for i in `seq 1 5`; do ssh -oStrictHostKeyChecking=no -oCheckHostIP=no pirate@pi$i.local sudo apt-get purge -y docker-hypriot; done
Install Docker
for i in `seq 1 5`; do ssh -oStrictHostKeyChecking=no -oCheckHostIP=no pirate@pi$i.local sudo dpkg -i docker-engine_1.12.0%7Erc4-0%7Ejessie_armhf.deb; done
Initialize the Swarm
ssh -oStrictHostKeyChecking=no -oCheckHostIP=no pirate@pi1.local docker swarm init
Join slaves to Swarm - replace the join command below with the specific one displayed by the init command.
for i in `seq 2 5`; do
ssh -oStrictHostKeyChecking=no -oCheckHostIP=no pirate@pi$i.local docker swarm join --secret ceuok9jso0klube8m3ih9gcsv --ca-hash sha256:f0864eb57963e3f9cd1756e691d0b609903e3a0bb48785272ea53155809025ee 10.42.0.49:2377;
done
Exercise the Swarm
ssh -oStrictHostKeyChecking=no -oCheckHostIP=no pirate@pi1.local
docker service create --name ping hypriot/rpi-alpine-scratch ping 8.8.8.8
docker service tasks ping
docker service update --replicas 10 ping
docker service tasks ping
docker service rm ping
I've read several blog posts about people running Apache Spark on a Raspberry PI. It didn't seem too hard so I thought I've have a go at it. But the results were disappointing. Bear in mind that I am a Spark novice so some setting is probably. I ran into two issues - memory and heartbeats.
So, this what I did.
I based my work on these pages:
* https://darrenjw2.wordpress.com/2015/04/17/installing-apache-spark-on-a-raspberry-pi-2/
* https://darrenjw2.wordpress.com/2015/04/18/setting-up-a-standalone-apache-spark-cluster-of-raspberry-pi-2/
* http://www.openkb.info/2014/11/memory-settings-for-spark-standalone_27.html
I created five SD cards according to my previous blog post (see http://affy.blogspot.com/2016/06/how-did-i-prepare-my-picocluster-for.html).
Installation of Apache Spark
* install Oracle Java and Python
for i in `seq 1 5`; do (ssh -oStrictHostKeyChecking=no -oCheckHostIP=no pirate@pi0${i}.local sudo apt-get install -y oracle-java8-jdk python2.7 &); done
* download Spark
wget http://d3kbcqa49mib13.cloudfront.net/spark-1.6.2-bin-hadoop2.6.tgz
* Copy Spark to all RPI
for i in `seq 1 5`; do (scp -q -oStrictHostKeyChecking=no -oCheckHostIP=no spark-1.6.2-bin-hadoop2.6.tgz pirate@pi0${i}.local:. && echo "Copy complete to pi0${i}" &); done
* Uncompress Spark
for i in `seq 1 5`; do (ssh -oStrictHostKeyChecking=no -oCheckHostIP=no pirate@pi0${i}.local tar xfz spark-1.6.2-bin-hadoop2.6.tgz && echo "Uncompress complete to pi0${i}" &); done
* Remove tgz file
for i in `seq 1 5`; do (ssh -oStrictHostKeyChecking=no -oCheckHostIP=no pirate@pi0${i}.local rm spark-1.6.2-bin-hadoop2.6.tgz); done
* Add the following to your .bashrc file on each RPI. I can't figure out how to put this into a loop.
export SPARK_LOCAL_IP="$(ip route get 1 | awk '{print $NF;exit}')"
* Run Standalone Spark Shell
ssh -oStrictHostKeyChecking=no -oCheckHostIP=no pirate@pi01.local
cd spark-1.6.2-bin-hadoop2.6
bin/run-example SparkPi 10
bin/spark-shell --master local[4]
# This takes several minutes to display a prompt.
# While the shell is running, visit http://pi01.local:4040/
scala> sc.textFile("README.md").count
# After the job is complete, visit the monitor page.
scala> exit
* Run PyShark Shell
bin/pyspark --master local[4]
>>> sc.textFile("README.md").count()
>>> exit()
CLUSTER
Now for the clustering...
* Enable password-less SSH between nodes
ssh -oStrictHostKeyChecking=no -oCheckHostIP=no pirate@pi01.local
for i in `seq 1 5`; do avahi-resolve --name pi0${i}.local -4 | awk ' { t = $1; $1 = $2; $2 = t; print; } ' | sudo tee --append /etc/hosts; done
echo "$(ip route get 1 | awk '{print $NF;exit}') $(hostname).local" | sudo tee --append /etc/hosts
ssh-keygen
for i in `seq 1 5`; do ssh-copy-id pirate@pi0${i}.local; done
* Configure Spark for Cluster
cd spark-1.6.2-bin-hadoop2.6/conf
create a slaves file with the following contents
pi01.local
pi02.local
pi03.local
pi04.local
pi05.local
cp spark-env.sh.template spark-env.sh
In spark-env.sh
Set SPARK_MASTER_IP the results of "ip route get 1 | awk '{print $NF;exit}'"
SPARK_WORKER_MEMORY=512m
* Copy the spark environment script to the other RPI
for i in `seq 2 5`; do scp spark-env.sh pirate@pi0${i}.local:spark-1.6.2-bin-hadoop2.6/conf/; done
* Start the cluster
cd ..
sbin/start-all.sh
* Visit the monitor page
http://192.168.1.8:8080
And everything is working so far! But ...
* Start a Spark Shell
bin/spark-shell --executor-memory 500m --driver-memory 500m --master spark://pi01.local:7077 --conf spark.executor.heartbeatInterval=45s
And this fails...
At the end of this article, I have a working Docker Swarm running on a five-node PicoCluster. Please flash your SD cards according to http://affy.blogspot.com/2016/06/how-did-i-prepare-my-picocluster-for.html. Stop following that article after copying the SSH ids to the RPI.
I am controlling the PicoCluster using my laptop. Therefore, my laptop is the HOST in the steps below.
There is no guarantee this commands are correct. They just seem to work for me. And please don't ever, ever depend on this information for anything non-prototype without doing your own research.
* On the HOST, create the Docker Machine to hold the consul service.
docker-machine create \
-d generic \
--engine-storage-driver=overlay \
--generic-ip-address=$(getip pi01.local) \
--generic-ssh-user "pirate" \
consul-machine
* Connect to the consul-machine Docker Machine
eval $(docker-machine env consul-machine)
* Start Consul.
docker run \
-d \
-p 8500:8500 \
hypriot/rpi-consul \
agent -dev -client 0.0.0.0
* Reset docker environment to talk with host docker.
unset DOCKER_TLS_VERIFY DOCKER_HOST DOCKER_CERT_PATH DOCKER_MACHINE_NAME
* Visit the consul dashboard to provide it is working and accessible.
firefox http://$(getip pi01.local):8500
* Create the swarm-master machine. Note that eth0 is being used instead of eth1.
docker-machine create \
-d generic \
--engine-storage-driver=overlay \
--swarm \
--swarm-master \
--swarm-image hypriot/rpi-swarm:latest \
--swarm-discovery="consul://$(docker-machine ip consul-machine):8500" \
--generic-ip-address=$(getip pi02.local) \
--generic-ssh-user "pirate" \
--engine-opt="cluster-store=consul://$(docker-machine ip consul-machine):8500" \
--engine-opt="cluster-advertise=eth0:2376" \
swarm-master
* Create the first slave node.
docker-machine create \
-d generic \
--engine-storage-driver=overlay \
--swarm \
--swarm-image hypriot/rpi-swarm:latest \
--swarm-discovery="consul://$(docker-machine ip consul-machine):8500" \
--generic-ip-address=$(getip pi03.local) \
--generic-ssh-user "pirate" \
--engine-opt="cluster-store=consul://$(docker-machine ip consul-machine):8500" \
--engine-opt="cluster-advertise=eth0:2376" \
swarm-slave01
* List nodes in the swarm. I don't know why, but this command must be run from one of the RPI. Otherwise, I see a "malformed HTTP response" message.
eval $(docker-machine env swarm-master)
docker -H $(docker-machine ip swarm-master):3376 run \
--rm \
hypriot/rpi-swarm:latest \
list consul://$(docker-machine ip consul-machine):8500
* Create the second slave node.
docker-machine create \
-d generic \
--engine-storage-driver=overlay \
--swarm \
--swarm-image hypriot/rpi-swarm:latest \
--swarm-discovery="consul://$(docker-machine ip consul-machine):8500" \
--generic-ip-address=$(getip pi04.local) \
--generic-ssh-user "pirate" \
--engine-opt="cluster-store=consul://$(docker-machine ip consul-machine):8500" \
--engine-opt="cluster-advertise=eth0:2376" \
swarm-slave02
* Create the first third node.
docker-machine create \
-d generic \
--engine-storage-driver=overlay \
--swarm \
--swarm-image hypriot/rpi-swarm:latest \
--swarm-discovery="consul://$(docker-machine ip consul-machine):8500" \
--generic-ip-address=$(getip pi05.local) \
--generic-ssh-user "pirate" \
--engine-opt="cluster-store=consul://$(docker-machine ip consul-machine):8500" \
--engine-opt="cluster-advertise=eth0:2376" \
swarm-slave03
* Check that docker machine sees all of the nodes
$ docker-machine ls
NAME ACTIVE DRIVER STATE URL SWARM DOCKER ERRORS
consul-machine - generic Running tcp://192.168.1.8:2376 v1.11.1
swarm-master - generic Running tcp://192.168.1.7:2376 swarm-master (master) v1.11.1
swarm-slave01 - generic Running tcp://192.168.1.2:2376 swarm-master v1.11.1
swarm-slave02 - generic Running tcp://192.168.1.5:2376 swarm-master v1.11.1
swarm-slave03 - generic Running tcp://192.168.1.4:2376 swarm-master v1.11.1
* List the swarm nodes in Firefox using Consul.
firefox http://$(docker-machine ip consul-machine):8500/ui/#/dc1/kv/docker/swarm/nodes/
* Is my cluster working? First, switch to the swarm-master environment. Then view it's information. You should see the slaves listed. Next run the hello-world container. And finally, list the containers.
eval $(docker-machine env swarm-master)
docker -H $(docker-machine ip swarm-master):3376 info
docker -H $(docker-machine ip swarm-master):3376 run hypriot/armhf-hello-world
docker -H $(docker-machine ip swarm-master):3376 ps -a
This post tells how I attached a USB Thumb drive to my Raspberry PI and used it to hold Docker's Root Directory.
The first step is to connect to the RPI.
$ ssh -o 'StrictHostKeyChecking=no' -o 'CheckHostIP=no' 'pirate@pi02.local'
Now create a mount point. This is just a directory, nothing fancy. It should be owned by root because Docker runs as root. Don't try to use "pirate" as the owner. I tried that. It failed. Leave the owner as root.
$ sudo mkdir /media/usb
Then look at the attached USB devices.
$ sudo blkid
/dev/mmcblk0: PTTYPE="dos"
/dev/mmcblk0p1: SEC_TYPE="msdos" LABEL="HypriotOS" UUID="D6D9-1D76" TYPE="vfat"
/dev/mmcblk0p2: LABEL="root" UUID="81e5bfc7-0701-4a09-80aa-fe5bc3eecbcf" TYPE="ext4"
/dev/sda1: LABEL="STORE N GO" UUID="F171-FAE6" TYPE="vfat" PARTUUID="f11d6f2b-01"
Note that the USB thumb drive is /dev/sda1. The information above is for the original formatting of the drive. After formatting the drive to use "ext3" the information looks like:
/dev/sda1: LABEL="PI02" UUID="801b666c-ea47-4f6f-ab6b-b88acceff08f" TYPE="ext3" PARTUUID="f11d6f2b-01"
This is the command that I used to format the drive to use ext3. Notiice that I named the drive the same as the hostname. I have no particular reason to do this. It just seemed right. Only run this formatting command once.
$ sudo mkfs.ext3 -L "PI02" /dev/sda1
Now it's time to mount the thumb drive. Here we connect the device (/dev/sda1) to the mount point. After this command is run you'll be able to use /media/usb as a normal directory.
$ sudo mount /dev/sda1 /media/usb
Next we setup the thumb drive to be available whenever the RPI is rebooted. First, find the UUID. It's whatever UUID is associated with sda1.
$ sudo ls -l /dev/disk/by-uuid
total 0
lrwxrwxrwx 1 root root 10 Jul 3 2014 801b666c-ea47-4f6f-ab6b-b88acceff08f -> ../../sda1
lrwxrwxrwx 1 root root 15 Jul 3 2014 81e5bfc7-0701-4a09-80aa-fe5bc3eecbcf -> ../../mmcblk0p2
lrwxrwxrwx 1 root root 15 Jul 3 2014 D6D9-1D76 -> ../../mmcblk0p1
Now add that UUID to the /etc/fstab file so it will be recognized across reboots. If you re-flash your SD card, you'll need to execute this step again.
$ echo "UUID=801b666c-ea47-4f6f-ab6b-b88acceff08f /media/usb nofail 0 0" | sudo tee -a /etc/fstab
Some images already on the Hypriot SD card. We'll make sure they are available after we move the Docker Root directory.
$ docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
hypriot/rpi-swarm 1.2.2 f13b7205f2db 5 weeks ago 13.97 MB
hypriot/rpi-consul 0.6.4 879ac05d5353 6 weeks ago 19.71 MB
Stop Docker to ensure that the Docker root directory does not change.
$ sudo systemctl stop docker
Copy files to the new location. Don't bother deleting the original files.
$ sudo cp --no-preserve=mode --recursive /var/lib/docker /media/usb/docker
If you are paranoid, you can compare the two directory trees.
$ sudo diff /var/lib/docker /media/usb/docker
Common subdirectories: /var/lib/docker/containers and /media/usb/docker/containers
Common subdirectories: /var/lib/docker/image and /media/usb/docker/image
Common subdirectories: /var/lib/docker/network and /media/usb/docker/network
Common subdirectories: /var/lib/docker/overlay and /media/usb/docker/overlay
Common subdirectories: /var/lib/docker/tmp and /media/usb/docker/tmp
Common subdirectories: /var/lib/docker/trust and /media/usb/docker/trust
Common subdirectories: /var/lib/docker/volumes and /media/usb/docker/volumes
Edit the docker service file to add --graph "/media/usb/docker" to the end of the ExecStart line.
$ sudo vi /etc/systemd/system/docker.service
Now reload the systemctl daemon and start docker.
sudo systemctl daemon-reload
sudo systemctl start docker
Confirm that the ExecStart is correct - that is has the graph parameter.
$ sudo systemctl show docker | grep ExecStart
Confirm that the Docker Root Directory has changed.
$ docker info | grep "Root Dir"
And finally, confirm that you can see docker images.
$ docker images
How Did I prepare my PicoCluster?
DOCKER VERSION: 1.11.1
HYPRIOT VERSION: 0.8
RASPBERRY PI: 3
From my Linux laptop, I created five SD cards using the flash utility from Hypriot.
As I plugged each SD card into my laptop, I ran 'lsblk'. Then I used 'umount' for anything mounted to the SD card. For example.
$ lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 111.8G 0 disk
├─sda1 8:1 0 79.9G 0 part /
├─sda2 8:2 0 1K 0 part
└─sda5 8:5 0 31.9G 0 part [SWAP]
sdb 8:16 0 894.3G 0 disk
└─sdb1 8:17 0 894.3G 0 part /data
sr0 11:0 1 1024M 0 rom
mmcblk0 179:0 0 15G 0 disk
├─mmcblk0p1 179:1 0 64M 0 part /media/medined/3ABE-55E4
└─mmcblk0p2 179:2 0 14.9G 0 part /media/medined/root
umount any mount points for mmcblk0 (or your SD card). For example,
umount /media/medined/3ABE-55E4
umount /media/medined/root
If the SD cards were flashed in the past then you'll need to run
umount /media/medined/HypriotOS
umount /media/medined/root
Here are the five flash commands that I used. Of course, I used my real SSID and PASSWORD. Note that this command leaves your password in your shell history. If this is a concern, please research alternatives.
As you flash the SD cards, use a gold sharpie to indicate the hostname of the SD card. This will make it much easier to make sure they are in the right RPI.
flash --hostname pi01 --ssid NETWORK --password PASSWORD --device /dev/mmcblk0 https://downloads.hypriot.com/hypriotos-rpi-v0.8.0.img.zip
flash --hostname pi02 --ssid NETWORK --password PASSWORD --device /dev/mmcblk0 https://downloads.hypriot.com/hypriotos-rpi-v0.8.0.img.zip
flash --hostname pi03 --ssid NETWORK --password PASSWORD --device /dev/mmcblk0 https://downloads.hypriot.com/hypriotos-rpi-v0.8.0.img.zip
flash --hostname pi04 --ssid NETWORK --password PASSWORD --device /dev/mmcblk0 https://downloads.hypriot.com/hypriotos-rpi-v0.8.0.img.zip
flash --hostname pi05 --ssid NETWORK --password PASSWORD --device /dev/mmcblk0 https://downloads.hypriot.com/hypriotos-rpi-v0.8.0.img.zip
Next after the SD cards are plaeced into the PicoCluster, I plugged it into power.
As a sidenote, each time you restart the RPIs, their SSH fingerprint changes. You'll need to remove the old fingerprint. One technique is the following:
for i in `seq 1 5`; do ssh-keygen -R pi0${i}.local 2>/dev/null; done
I dislike questions about server fingerprint's when connecting. Therefore, you'll see me using the "StrictHostKeyChecking=no" option with SSH. I take no stance on the security ramifications of this choice. I'm connecting to my local PicoCluster not some public server. Make your own security decisions.
Ensure that you have a SSH key set. Look for "~/.ssh/id_rsa". If you don't have that file, use ssh-keygen to make one.
Now copy your PKI credential to the five PRI to enable password-less SSH. You be asked for the password, which should be "hypriot", five times.
for i in `seq 1 5`; do ssh-copy-id -oStrictHostKeyChecking=no -oCheckHostIP=no pirate@pi0${i}.local; done
Next you can check that password-less SSH is working. After each SSH, you'll see a prompt like "HypriotOS/armv7: pirate@pi01 in ~". Just check the hostname is correct and then type exit to move onto the next RPI.
for i in `seq 1 5`; do ssh -oStrictHostKeyChecking=no -oCheckHostIP=no pirate@pi0${i}.local; done
You can use the following shell function to determine the IP address of an RPI. I also found it happy to log into my router to see the list of attached devices. By the way, if you haven't changed the default password for the admin user of your router, do it. This article will wait...
function getip() { (traceroute $1 2>&1 | head -n 1 | cut -d\( -f 2 | cut -d\) -f 1) }
It's probably a good idea to place that function in your .bashrc file so that you'll always have it handy.
for i in `seq 1 5`; do echo "PI0${i}.local: $(getip pi0${i}.local)"; done
Now comes the fun part, setting up the Docker Swarm. Fair warning. I don't know if these steps are correct.
docker-machine create \
-d generic \
--engine-storage-driver=overlay \
--swarm \
--swarm-master \
--swarm-image hypriot/rpi-swarm:latest \
--generic-ip-address=$(getip pi01.local) \
--generic-ssh-user "pirate" \
--swarm-discovery="token://01" \
swarm
docker-machine create \
-d generic \
--engine-storage-driver=overlay \
--swarm \
--swarm-image hypriot/rpi-swarm:latest \
--generic-ip-address=$(getip pi02.local) \
--generic-ssh-user "pirate" \
--swarm-discovery="token://01" \
swarm-slave01
docker-machine create \
-d generic \
--engine-storage-driver=overlay \
--swarm \
--swarm-image hypriot/rpi-swarm:latest \
--generic-ip-address=$(getip pi03.local) \
--generic-ssh-user "pirate" \
--swarm-discovery="token://01" \
swarm-slave02
docker-machine create \
-d generic \
--engine-storage-driver=overlay \
--swarm \
--swarm-image hypriot/rpi-swarm:latest \
--generic-ip-address=$(getip pi04.local) \
--generic-ssh-user "pirate" \
--swarm-discovery="token://01" \
swarm-slave03
docker-machine create \
-d generic \
--engine-storage-driver=overlay \
--swarm \
--swarm-image hypriot/rpi-swarm:latest \
--generic-ip-address=$(getip pi05.local) \
--generic-ssh-user "pirate" \
--swarm-discovery="token://01" \
swarm-slave04
Now you can run list the nodes in the cluster using Docker Machine:
$ docker-machine ls
NAME ACTIVE DRIVER STATE URL SWARM DOCKER ERRORS
swarm - generic Running tcp://192.168.1.12:2376 swarm (master) v1.11.1
swarm-slave01 - generic Running tcp://192.168.1.7:2376 swarm v1.11.1
swarm-slave02 - generic Running tcp://192.168.1.11:2376 swarm v1.11.1
swarm-slave03 - generic Running tcp://192.168.1.23:2376 swarm v1.11.1
swarm-slave04 - generic Running tcp://192.168.1.22:2376 swarm v1.11.1
Notice that a master node is indicated but it is not marked as active. I don't know why.
Before moving on, let's look at what containers are being run. There should be six.
for i in `seq 1 5`; do echo "RPI ${i}"; ssh -oStrictHostKeyChecking=no -oCheckHostIP=no pirate@pi0${i}.local docker ps -a; done
RPI 1
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
ceb4a5255dc2 hypriot/rpi-swarm:latest "/swarm join --advert" About an hour ago Up About an hour 2375/tcp swarm-agent
e9d3bf308284 hypriot/rpi-swarm:latest "/swarm manage --tlsv" About an hour ago Up About an hour 2375/tcp, 0.0.0.0:3376->3376/tcp swarm-agent-master
RPI 2
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
e2dca97c23fe hypriot/rpi-swarm:latest "/swarm join --advert" About an hour ago Up About an hour 2375/tcp swarm-agent
RPI 3
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
07d0b4fc4490 hypriot/rpi-swarm:latest "/swarm join --advert" 11 minutes ago Up 11 minutes 2375/tcp swarm-agent
RPI 4
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
88712d8df693 hypriot/rpi-swarm:latest "/swarm join --advert" 6 minutes ago Up 6 minutes 2375/tcp swarm-agent
RPI 5
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
b7738fb8c4b8 hypriot/rpi-swarm:latest "/swarm join --advert" 2 minutes ago Up 2 minutes 2375/tcp swarm-agent
Currently, when you type "docker ps" you're looking at containers running on your local computer. You can switch so that "docker" connects to one of the "docker machines" using this command:
eval $(docker-machine env swarm)
Now "docker ps" returns information about containers running on pi01.
$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
ceb4a5255dc2 hypriot/rpi-swarm:latest "/swarm join --advert" About an hour ago Up About an hour 2375/tcp swarm-agent
e9d3bf308284 hypriot/rpi-swarm:latest "/swarm manage --tlsv" About an hour ago Up About an hour 2375/tcp, 0.0.0.0:3376->3376/tcp swarm-agent-master
One neat "trick" is to look at the information from the "swarm-agent-master" container. This is done using Docker's -H option. Notice that the results indicate there are six containers running. Count the number of containers found using the "for..loop" earlier. They are the same number.
$ docker -H $(docker-machine ip swarm):3376 info
Containers: 6
Running: 6
Paused: 0
Stopped: 0
Images: 15
Server Version: swarm/1.2.3
Role: primary
Strategy: spread
Filters: health, port, containerslots, dependency, affinity, constraint
Nodes: 5
swarm: 192.168.1.12:2376
â”” ID: P4OH:AB7Q:T2T3:P6OK:BW5F:YSIB:NACW:Q2F3:FKU4:IJFD:AUJQ:74CZ
â”” Status: Healthy
â”” Containers: 2
â”” Reserved CPUs: 0 / 4
â”” Reserved Memory: 0 B / 971.7 MiB
â”” Labels: executiondriver=, kernelversion=4.4.10-hypriotos-v7+, operatingsystem=Raspbian GNU/Linux 8 (jessie), provider=generic, storagedriver=overlay
â”” UpdatedAt: 2016-06-22T01:39:56Z
â”” ServerVersion: 1.11.1
swarm-slave01: 192.168.1.7:2376
â”” ID: GDQI:WYHS:OD2W:EE67:CKMU:A2PW:6K5T:YZSK:B5KL:SPCZ:6GVX:5MCO
â”” Status: Healthy
â”” Containers: 1
â”” Reserved CPUs: 0 / 4
â”” Reserved Memory: 0 B / 971.7 MiB
â”” Labels: executiondriver=, kernelversion=4.4.10-hypriotos-v7+, operatingsystem=Raspbian GNU/Linux 8 (jessie), provider=generic, storagedriver=overlay
â”” UpdatedAt: 2016-06-22T01:39:45Z
â”” ServerVersion: 1.11.1
swarm-slave02: 192.168.1.11:2376
â”” ID: CA7H:C7UA:5V5N:NY4C:KECT:JK57:HDGN:2DNH:ASXQ:UJFQ:A5A4:US3Y
â”” Status: Healthy
â”” Containers: 1
â”” Reserved CPUs: 0 / 4
â”” Reserved Memory: 0 B / 971.7 MiB
â”” Labels: executiondriver=, kernelversion=4.4.10-hypriotos-v7+, operatingsystem=Raspbian GNU/Linux 8 (jessie), provider=generic, storagedriver=overlay
â”” UpdatedAt: 2016-06-22T01:39:32Z
â”” ServerVersion: 1.11.1
swarm-slave03: 192.168.1.23:2376
â”” ID: 6H6D:P6EN:PTBL:Q5E3:MP32:T6CI:XU33:PCQV:KT6H:KRJ4:LYSN:76EJ
â”” Status: Healthy
â”” Containers: 1
â”” Reserved CPUs: 0 / 4
â”” Reserved Memory: 0 B / 971.7 MiB
â”” Labels: executiondriver=, kernelversion=4.4.10-hypriotos-v7+, operatingsystem=Raspbian GNU/Linux 8 (jessie), provider=generic, storagedriver=overlay
â”” UpdatedAt: 2016-06-22T01:39:25Z
â”” ServerVersion: 1.11.1
swarm-slave04: 192.168.1.22:2376
â”” ID: 2ZBK:3DJE:D23C:7QAB:TLFS:L7EO:L4L4:IQ6Y:EC7D:UG7S:3WU6:QJ5D
â”” Status: Healthy
â”” Containers: 1
â”” Reserved CPUs: 0 / 4
â”” Reserved Memory: 0 B / 971.7 MiB
â”” Labels: executiondriver=, kernelversion=4.4.10-hypriotos-v7+, operatingsystem=Raspbian GNU/Linux 8 (jessie), provider=generic, storagedriver=overlay
â”” UpdatedAt: 2016-06-22T01:39:32Z
â”” ServerVersion: 1.11.1
Plugins:
Volume:
Network:
Kernel Version: 4.4.10-hypriotos-v7+
Operating System: linux
Architecture: arm
CPUs: 20
Total Memory: 4.745 GiB
Name: e9d3bf308284
Docker Root Dir:
Debug mode (client): false
Debug mode (server): false
WARNING: No kernel memory limit support
And that's as far as I've gotten.
It took me a bit of time to get this simple program working so I'm sharing for other people new to Go.
Yesterday, I showed how to run NodeJS inside a Docker container. Today, I updated my Github project (https://github.com/medined/docker-nodejs) so that the Example server works correctly.
The trick is for the NodeJS code inside the container to find the container's IP address and listen on that address instead of localhost or 127.0.0.1. This is not difficult.
In my continuing quest to run my development tools from within Docker containers, I looked at Node today.
The Github project is at https://github.com/medined/docker-nodejs.
My Dockerfile is fairly simple:
FROM ubuntu:14.04
This is another in my series of very short entries about Docker. I've been working to not install maven on my development laptop. But I still want to use spring-boot:run to launch my applications. Here is the Docker command I am using. Notice the server.port is specified on the command line so that I can change it as needed.
docker run \
I recently reinstalled Ubuntu on my zareason laptop. As I was thinking about installing my development tools, I thought about how to integrate Docker into the process. Below I show how simple using the Maven container can be:
* Create an alias to the Maven container.
alias mvn="docker run \
-it \
--rm \
--name my-maven-project \
-v "$PWD":/usr/src/mymaven \
-w /usr/src/mymaven \
maven:3.3-jdk-8 \
mvn"
* Clone my ragnvald Java project.
git clone git@github.com:medined/ragnvald.git
* cd ragnvald
* Package the project.
mvn package
That's it. You're using Maven without installing onto your laptop! The results of the compilation are placed into the target directory.
If you need to specify a Maven settings.xml file that's fairly easy as well. Simply create it alongside the pom.xml file. Then slightly modify your alias:
alias mvn="docker run \
-it \
--rm \
--name my-maven-project \
-v "$PWD":/root/.m2 \
-v "$PWD":/usr/src/mymaven \
-w /usr/src/mymaven \
maven:3.3-jdk-8 \
mvn"
The ragnvald project goes one step farther to use an Artifactory container so that I can use the Artifactory web interface if needed. That's quite convenient!
This entry doesn't reveal any hidden secrets just the simple steps to start using MySQL on Docker.
* Install docker
* Install docker-compose
* mkdir firstdb
* cd firstdb
* vi docker-compose.yml
mysql:
image: mysql:latest
environment:
MYSQL_DATABASE: sample
MYSQL_USER: mysql
MYSQL_PASSWORD: mysql
MYSQL_ROOT_PASSWORD: supersecret
* docker-compose up
* docker-compose ps
Name Command State Ports
-----------------------------------------------------------------
firstdb_mysql_1 /entrypoint.sh mysqld Up 3306/tcp
* Use a one-shot Docker instance to display environment variables. Notice
the variables that start with MYSQL? Your programs can use these variables
to make the database connection.
docker run --link=firstdb_mysql_1:mysql ubuntu env
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
HOSTNAME=abfc8d50633b
MYSQL_PORT=tcp://172.17.0.23:3306
MYSQL_PORT_3306_TCP=tcp://172.17.0.23:3306
MYSQL_PORT_3306_TCP_ADDR=172.17.0.23
MYSQL_PORT_3306_TCP_PORT=3306
MYSQL_PORT_3306_TCP_PROTO=tcp
MYSQL_NAME=/nostalgic_rosalind/mysqldb
MYSQL_ENV_MYSQL_PASSWORD=mysql
MYSQL_ENV_MYSQL_ROOT_PASSWORD=supersecret
MYSQL_ENV_MYSQL_USER=mysql
MYSQL_ENV_MYSQL_DATABASE=sample
MYSQL_ENV_MYSQL_MAJOR=5.6
MYSQL_ENV_MYSQL_VERSION=5.6.24
HOME=/root
* Use a one-shot Docker instance for a MySQL command-line interface. Once this
is running, you'll be able to use command like 'show databases'.
docker run -it \
--link=firstcompose_mysqldb_1:mysql \
--rm \
mysql/mysql-server:latest \
sh -c 'exec mysql -h"$MYSQL_PORT_3306_TCP_ADDR" -P"$MYSQL_PORT_3306_TCP_PORT" -uroot -p"$MYSQL_ENV_MYSQL_ROOT_PASSWORD"'
That's all it takes to start.
Witness a tale of two Dockerfiles that perform the same task. See the size difference. Imagine how it might change infrastructure costs.
FROM debian:wheezy RUN apt-get update && apt-get install -y openjdk-7-jre && rm -rf /var/lib/apt/lists/* ADD target/si-standalone-sample-1.0-SNAPSHOT.jar / ENV JAVA_HOME /usr/lib/jvm/java-7-openjdk-amd64 ENV CLASSPATH si-standalone-sample-1.0-SNAPSHOT.jar CMD [ "java", "org.springframework.boot.loader.JarLauncher" ]
FROM debian:wheezy RUN apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv-keys 0x219BD9C9 && \ echo "deb http://repos.azulsystems.com/ubuntu precise main" >> /etc/apt/sources.list.d/zulu.list && \ apt-get -qq update && \ apt-get -qqy install zulu-7 && \ rm -rf /var/lib/apt/lists/* ADD target/si-standalone-sample-1.0-SNAPSHOT.jar / ENV JAVA_HOME /usr/lib/jvm/zulu-7-amd64 ENV CLASSPATH si-standalone-sample-1.0-SNAPSHOT.jar CMD [ "java", "org.springframework.boot.loader.JarLauncher" ]
spring-integration openjdk 549.1 MB spring-integration azul 261.3 MB
Update from Jan 2015: The Zulu team added formal Debian support last October, I just did not know about it. Look at the version history for Zulu 8.4, 7.7, and 6.6 at http://www.azulsystems.com/zulurelnotes. Also look on DockerHub for their 8.4.x Docker files. They don't use lsb_release -cs in Debian Dockerfiles anymore, and instead allow the Zulu repository to honor 'stable' as release name. 'stable' always pushes the highest level for a Java major version. - I am paraphrasing the comments from Matthew Schuetze below.
I saw the following line in a Dockerfile
RUN echo "deb http://repos.azulsystems.com/ubuntu `lsb_release -cs` main" >> /etc/apt/sources.list.d/zulu.list
$ apt-get update && apt-get install -y lsb
$ docker diff 09 | wc -l 30013
$ apt-get update && apt-get install -y lsb-release
$ docker diff 23 | wc -l 1689
While I dabble in System Administration, I don't have a deep knowledge how packages are created or maintained. Today, we'll see how to use Docker to increase my understanding of "apt-get update". I was curious about this command because I read that it's good practice to remove the files created during the update process.
I started a small container using
docker run -i -t debian:wheezy /bin/bash
docker diff "45"
C /var C /var/lib C /var/lib/apt C /var/lib/apt/lists A /var/lib/apt/lists/http.debian.net_debian_dists_wheezy-updates_Release A /var/lib/apt/lists/http.debian.net_debian_dists_wheezy-updates_Release.gpg A /var/lib/apt/lists/http.debian.net_debian_dists_wheezy-updates_main_binary-amd64_Packages.gz A /var/lib/apt/lists/http.debian.net_debian_dists_wheezy_Release A /var/lib/apt/lists/http.debian.net_debian_dists_wheezy_Release.gpg A /var/lib/apt/lists/http.debian.net_debian_dists_wheezy_main_binary-amd64_Packages.gz A /var/lib/apt/lists/lock C /var/lib/apt/lists/partial A /var/lib/apt/lists/security.debian.org_dists_wheezy_updates_Release A /var/lib/apt/lists/security.debian.org_dists_wheezy_updates_Release.gpg A /var/lib/apt/lists/security.debian.org_dists_wheezy_updates_main_binary-amd64_Packages.gz
# ls -lh /var/lib/apt/lists total 8.0M -rw-r--r-- 1 root root 121K Nov 23 02:49 http.debian.net_debian_dists_wheezy-updates_Release -rw-r--r-- 1 root root 836 Nov 23 02:49 http.debian.net_debian_dists_wheezy-updates_Release.gpg -rw-r--r-- 1 root root 0 Nov 23 02:37 http.debian.net_debian_dists_wheezy-updates_main_binary-amd64_Packages -rw-r--r-- 1 root root 165K Oct 18 10:33 http.debian.net_debian_dists_wheezy_Release -rw-r--r-- 1 root root 1.7K Oct 18 10:44 http.debian.net_debian_dists_wheezy_Release.gpg -rw-r--r-- 1 root root 7.3M Oct 18 10:07 http.debian.net_debian_dists_wheezy_main_binary-amd64_Packages.gz -rw-r----- 1 root root 0 Nov 23 04:09 lock drwxr-xr-x 2 root root 4.0K Nov 23 04:09 partial -rw-r--r-- 1 root root 100K Nov 20 16:31 security.debian.org_dists_wheezy_updates_Release -rw-r--r-- 1 root root 836 Nov 20 16:31 security.debian.org_dists_wheezy_updates_Release.gpg -rw-r--r-- 1 root root 270K Nov 20 16:31 security.debian.org_dists_wheezy_updates_main_binary-amd64_Packages.gz
gzip -d http.debian.net_debian_dists_wheezy_main_binary-amd64_Packages.gz
# more http.debian.net_debian_dists_wheezy_main_binary-amd64_Packages Package: 0ad Version: 0~r11863-2 Installed-Size: 8260 Maintainer: Debian Games TeamArchitecture: amd64 Depends: 0ad-data (>= 0~r11863), 0ad-data (<= 0~r11863-2), gamin | fam, libboost-signals1.49.0 (>= 1.49.0-1), libc6 (>= 2.11), libcurl3-gnutls (>= 7.16.2), libenet1a, libgamin0 | libfam0, libgcc1 (>= 1:4.1.1), libgl1-mesa-glx | libgl1, lib jpeg8 (>= 8c), libmozjs185-1.0 (>= 1.8.5-1.0.0+dfsg), libnvtt2, libopenal1, libpng12-0 (>= 1.2.13-4), libsdl1.2debian (>= 1.2.11), libstdc++6 (>= 4.6), libvorbisfile3 (>= 1.1.2), libwxbase2.8-0 (>= 2.8.12.1), libwxgtk2.8-0 (>= 2.8.12.1), l ibx11-6, libxcursor1 (>> 1.1.2), libxml2 (>= 2.7.4), zlib1g (>= 1:1.2.0) Pre-Depends: dpkg (>= 1.15.6~) Description: Real-time strategy game of ancient warfare Homepage: http://www.wildfiregames.com/0ad/ Description-md5: d943033bedada21853d2ae54a2578a7b Tag: game::strategy, implemented-in::c++, interface::x11, role::program, uitoolkit::sdl, uitoolkit::wxwidgets, use::gameplaying, x11::application Section: games Priority: optional Filename: pool/main/0/0ad/0ad_0~r11863-2_amd64.deb Size: 2260694 MD5sum: cf71a0098c502ec1933dea41610a79eb SHA1: aa4a1fdc36498f230b9e38ae0116b23be4f6249e SHA256: e28066103ecc6996e7a0285646cd2eff59288077d7cc0d22ca3489d28d215c0a ...
# grep "Package" http.debian.net_debian_dists_wheezy_main_binary-amd64_Packages | wc -l 36237
rm -rf /var/lib/apt/lists/*
Brooklyn is a large project with a lot of dependencies. I wanted to compile it, but I also wanted to remove all traces of the project when I was done experimenting. I used Docker to accomplish this goal.
See the files below at https://github.com/medined/docker-brooklyn.
First, I created a Dockerfile to load java, maven, and clone the repository.
$ cat Dockerfile FROM ubuntu:14.04 MAINTAINER David Medinets# # Install Java # RUN apt-get update && \ apt-get install -y software-properties-common && \ add-apt-repository -y ppa:webupd8team/java && \ echo debconf shared/accepted-oracle-license-v1-1 select true | sudo debconf-set-selections && \ echo debconf shared/accepted-oracle-license-v1-1 seen true | sudo debconf-set-selections && \ apt-get update && \ apt-get install -y oracle-java8-installer ENV JAVA_HOME /usr/lib/jvm/java-8-oracle # # Install Maven # RUN echo "deb http://ppa.launchpad.net/natecarlson/maven3/ubuntu precise main" >> /etc/apt/sources.list && \ echo "deb-src http://ppa.launchpad.net/natecarlson/maven3/ubuntu precise main" >> /etc/apt/sources.list && \ apt-get update && \ apt-get -y --force-yes install maven3 && \ rm -f /usr/bin/mvn && \ ln -s /usr/share/maven3/bin/mvn /usr/bin/mvn RUN mkdir -p /root/.m2 ADD settings.xml /root/.m2/settings.xml # # Clone the brooklyn project # RUN apt-get install -y git RUN git clone https://github.com/apache/incubator-brooklyn.git WORKDIR /incubator-brooklyn RUN apt-get clean && rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*
$ cat build_image.sh #!/bin/bash sudo DOCKER_HOST=$DOCKER_HOST docker build --no-cache --rm=true -t medined/brooklyn.build .
$ cat run_image.sh #!/bin/bash ##### # Make sure that Artifactory is running. # ARTIFACTORY_COUNT=$(docker ps --filter=status=running | grep artifactory | wc -l) if [ "${ARTIFACTORY_COUNT}" != "1" ] then echo "Starting Artifactory" docker run --name "artifactorydata" -v /opt/artifactory/data -v /opt/artifactory/logs tianon/true docker run -d -p 8081:8081 --name "artifactory" --volumes-from artifactorydata codingtony/artifactory fi IMAGEID=$(docker ps -a |grep "brooklyn.build" | awk '{print $1}') if [ "$IMAGEID" != "" ] then echo "Stopping $IMAGEID" IMAGEID=$(sudo DOCKER_HOST=$DOCKER_HOST docker stop $IMAGEID | xargs docker rm) fi sudo DOCKER_HOST=$DOCKER_HOST \ docker run \ --link artifactory:artifactory \ -i \ -t medined/brooklyn.build \ /bin/bash
This document shows how to extract a dataset from an HTML page.
We’ll start by loading two libraries. RCurl is used to read an HTML page. XML is used to parse HTML which can be viewed as a form of XML.
library(RCurl)
## Loading required package: bitops
library(XML)
Let R know where to find the HTML page. Then download and parse it.
theurl <- "http://bulbapedia.bulbagarden.net/wiki/List_of_Pok%C3%A9mon_Trading_Card_Game_expansions"
webpage <- getURL(theurl)
webpage <- readLines(tc <- textConnection(webpage)); close(tc)
doc <- htmlTreeParse(webpage, error=function(...){}, useInternalNodes = TRUE)
Use XPATH to extract all tr (table row) nodes from the HTML page. There is a lot of extraneous information in those tr nodes so we’ll filter the list from 70 elements to 67 elements.
tr <- getNodeSet(doc, "//*/tr")
tr_with_pokemon_sets <- tr[4:length(tr)-1]
Let’s look at one example of the HTML. It holds information about one Pokemon set. The pound signs at the start of the lines are not part of the data, they are just part of the printing.
tr_with_pokemon_sets[1]
[[1]]
<tr><th> 1
</th>
<td> 1
</td>
<td>
</td>
<td> <a href="/wiki/Base_Set_(TCG)" title="Base Set (TCG)">Base Set</a>
</td>
<td> Expansion Pack
</td>
<td> 102
</td>
<td> 102
</td>
<td> January 9, 1999
</td>
<td> October 20, 1996
</td></tr>
In order to make sense of that HTML, we’ll use a custom function to manipulate each element in tr_with_pokemon_sets. Generally speaking, the function removes newlines and HTML syntax. It also provides data types and column names.
xmlToCsv <- function(xml) {
a <- gsub('\n\n','\t', xmlValue(xml))
b <- gsub('\t\t','\t \t', a)
d <- gsub('\t\t','\t', b)
e <- gsub('^ |\t$','', d)
f <- gsub('\t ','\t', e)
cc <- c("numeric", "numeric", "character", "character", "character", "character", "character", "character", "character")
cn <- c("EngNumber", "JpNumber", "Icon", "EngSet", "JpSet", "EngCardCount", "JpCardCount", "EngDate", "JpDate")
g <- read.table(text=f, sep="\t", header=FALSE)
colnames(g) <- cn
keeps <- c("EngNumber", "EngSet", "EngCardCount")
return(g[keeps])
}
Magic happens next. We apply the custom function, convert results toa data.frame and remove NA values.
pokemon_set_dataframe <- na.omit(do.call(rbind, lapply(tr_with_pokemon_sets, xmlToCsv)))
The information is displayed so you can see the data so far.
pokemon_set_dataframe
EngNumber EngSet EngCardCount
1 1 Base Set 102
2 2 Jungle 64
3 3 Fossil 62
4 4 Base Set 2 130
5 5 Team Rocket 83*
6 7 Gym Challenge 132
7 8 Neo Genesis 111
8 9 Neo Discovery 75
9 10 Neo Revelation 66*
10 11 Neo Destiny 113*
11 12 Legendary Collection 110
14 13 Expedition Base Set 165
15 14 Aquapolis 186*
16 14 Aquapolis 186*
17 15 Skyridge 182*
18 15 Skyridge 182*
19 16 EX Ruby & Sapphire 109
20 17 EX Sandstorm 100
21 18 EX Dragon 100*
22 19 EX Team Magma vs Team Aqua 97*
23 20 EX Hidden Legends 102*
24 21 EX FireRed & LeafGreen 116*
25 22 EX Team Rocket Returns 111*
26 23 EX Deoxys 108*
27 24 EX Emerald 107*
28 25 EX Unseen Forces 145*
29 26 EX Delta Species 114*
30 27 EX Legend Maker 93*
31 28 EX Holon Phantoms 111*
32 29 EX Crystal Guardians 100
33 30 EX Dragon Frontiers 101
34 31 EX Power Keepers 108
35 32 Diamond & Pearl 130
36 33 Mysterious Treasures 124*
37 34 Secret Wonders 132
38 35 Great Encounters 106
39 36 Majestic Dawn 100
40 37 Legends Awakened 146
41 38 Stormfront 106*
42 40 Rising Rivals 120*
43 41 Supreme Victors 153*
44 42 Arceus 111*
45 43 HeartGold & SoulSilver 124*
46 44 Unleashed 96*
47 45 Undaunted 91*
48 46 Triumphant 103*
49 47 Call of Legends 106
50 48 Black & White 115*
51 49 Emerging Powers 98
52 50 Noble Victories 102*
53 51 Next Destinies 103*
54 52 Dark Explorers 111*
55 53 Dragons Exalted 128*
56 54 Boundaries Crossed 153*
57 55 Plasma Storm 138*
58 56 Plasma Freeze 122*
59 57 Plasma Blast 105*
60 58 Legendary Treasures 138*
61 59 XY 146
62 60 Flashfire 109*
63 61 Furious Fists 113*
64 62 Phantom Forces 122*
65 63 Primal Clash 150+
Notice those extra asterisks and plus signs? The next bit of code removes them.
pokemon_set_dataframe$EngCardCount <- gsub("\\*|\\+", "", pokemon_set_dataframe$EngCardCount)
Here is the final dataset.
pokemon_set_dataframe
EngNumber EngSet EngCardCount
1 1 Base Set 102
2 2 Jungle 64
3 3 Fossil 62
4 4 Base Set 2 130
5 5 Team Rocket 83
6 7 Gym Challenge 132
7 8 Neo Genesis 111
8 9 Neo Discovery 75
9 10 Neo Revelation 66
10 11 Neo Destiny 113
11 12 Legendary Collection 110
14 13 Expedition Base Set 165
15 14 Aquapolis 186
16 14 Aquapolis 186
17 15 Skyridge 182
18 15 Skyridge 182
19 16 EX Ruby & Sapphire 109
20 17 EX Sandstorm 100
21 18 EX Dragon 100
22 19 EX Team Magma vs Team Aqua 97
23 20 EX Hidden Legends 102
24 21 EX FireRed & LeafGreen 116
25 22 EX Team Rocket Returns 111
26 23 EX Deoxys 108
27 24 EX Emerald 107
28 25 EX Unseen Forces 145
29 26 EX Delta Species 114
30 27 EX Legend Maker 93
31 28 EX Holon Phantoms 111
32 29 EX Crystal Guardians 100
33 30 EX Dragon Frontiers 101
34 31 EX Power Keepers 108
35 32 Diamond & Pearl 130
36 33 Mysterious Treasures 124
37 34 Secret Wonders 132
38 35 Great Encounters 106
39 36 Majestic Dawn 100
40 37 Legends Awakened 146
41 38 Stormfront 106
42 40 Rising Rivals 120
43 41 Supreme Victors 153
44 42 Arceus 111
45 43 HeartGold & SoulSilver 124
46 44 Unleashed 96
47 45 Undaunted 91
48 46 Triumphant 103
49 47 Call of Legends 106
50 48 Black & White 115
51 49 Emerging Powers 98
52 50 Noble Victories 102
53 51 Next Destinies 103
54 52 Dark Explorers 111
55 53 Dragons Exalted 128
56 54 Boundaries Crossed 153
57 55 Plasma Storm 138
58 56 Plasma Freeze 122
59 57 Plasma Blast 105
60 58 Legendary Treasures 138
61 59 XY 146
62 60 Flashfire 109
63 61 Furious Fists 113
64 62 Phantom Forces 122
65 63 Primal Clash 150
With a bit more complexity the first column of numbers can be removed.
x <- as.matrix(format(pokemon_set_dataframe))
rownames(x) <- rep("", nrow(x))
print(x, quote=FALSE)
EngNumber EngSet EngCardCount
1 Base Set 102
2 Jungle 64
3 Fossil 62
4 Base Set 2 130
5 Team Rocket 83
7 Gym Challenge 132
8 Neo Genesis 111
9 Neo Discovery 75
10 Neo Revelation 66
11 Neo Destiny 113
12 Legendary Collection 110
13 Expedition Base Set 165
14 Aquapolis 186
14 Aquapolis 186
15 Skyridge 182
15 Skyridge 182
16 EX Ruby & Sapphire 109
17 EX Sandstorm 100
18 EX Dragon 100
19 EX Team Magma vs Team Aqua 97
20 EX Hidden Legends 102
21 EX FireRed & LeafGreen 116
22 EX Team Rocket Returns 111
23 EX Deoxys 108
24 EX Emerald 107
25 EX Unseen Forces 145
26 EX Delta Species 114
27 EX Legend Maker 93
28 EX Holon Phantoms 111
29 EX Crystal Guardians 100
30 EX Dragon Frontiers 101
31 EX Power Keepers 108
32 Diamond & Pearl 130
33 Mysterious Treasures 124
34 Secret Wonders 132
35 Great Encounters 106
36 Majestic Dawn 100
37 Legends Awakened 146
38 Stormfront 106
40 Rising Rivals 120
41 Supreme Victors 153
42 Arceus 111
43 HeartGold & SoulSilver 124
44 Unleashed 96
45 Undaunted 91
46 Triumphant 103
47 Call of Legends 106
48 Black & White 115
49 Emerging Powers 98
50 Noble Victories 102
51 Next Destinies 103
52 Dark Explorers 111
53 Dragons Exalted 128
54 Boundaries Crossed 153
55 Plasma Storm 138
56 Plasma Freeze 122
57 Plasma Blast 105
58 Legendary Treasures 138
59 XY 146
60 Flashfire 109
61 Furious Fists 113
62 Phantom Forces 122
63 Primal Clash 150
And we can plot the number of cards per set against the set number.
plot(pokemon_set_dataframe[c(1,3)])
The EngCount column is actually a character data type which is not correct. The transform method changes the datatype.
pokemon_set_dataframe <- transform(pokemon_set_dataframe, EngCardCount = as.numeric(EngCardCount))
Now it’s possible to sum the card counts.
noquote(format(sum(pokemon_set_dataframe$EngCardCount), big.mark=","))
[1] 7,372
http://www.cyberciti.biz/faq/bash-shell-change-the-color-of-my-shell-prompt-under-linux-or-unix/
https://github.com/medined/D4M_Schema provides a step-by-step introduction to the D4M nosql schema used by many organizations.
D4M is a breakthrough in computer programming that combines the advantages of five distinct processing technologies (sparse linear algebra, associative arrays, fuzzy algebra, distributed arrays, and triple-store/NoSQL databases such as Hadoop HBase and Apache Accumulo) to provide a database and computation system that addresses the problems associated with Big Data.
Recently I wanted to provide the same configuration file to two different Docker containers. I choose to solve this using a Docker volume. The configuration file will be sourced from within each container and looks like this:
$ cat bridge-env.sh
export BRIDGENAME=brbob
export IMAGENAME=bob
export IPADDR=10.0.10.1/24
Before any explanations, let's look at the files we'll be using:
./configuration/build_image.sh - wrapper for _docker build_.
./configuration/run_image.sh - wrapper for _docker run_.
./configuration/Dockerfile - control file for Docker image.
./configuration/files/bridge-env.sh - environment setting script.
All of the files are fairly small. Since our main topic today is Docker, let's look at the Docker configuration file first.
$ cat Dockerfile
FROM stackbrew/busybox:latest
MAINTAINER David Medinets <david.medinets@gmail.com>
RUN mkdir /configuration
VOLUME /configuration
ADD files /configuration
And you can build this image.
$ cat build_image.sh
sudo DOCKER_HOST=$DOCKER_HOST docker build --rm=true -t medined/shared-configuration .
I setup my docker to use a port instead of a UNIX socket. Therefore my DOCKERHOST is "tcp://0.0.0.0:4243". Since _sudo is being used, the environment variable needs to be set inside the sudo enviroment. If you want to use the default UNIX socker, leave DOCKER_HOST empty. The command will still work.
Then run it.
$ cat run_image.sh
sudo DOCKER_HOST=$DOCKER_HOST docker run --name shared-configuration -t medined/shared-configuration true
This command runs a docker container called sharedconfiguration. You'll notice that the _true command is run which exits immediately. Since this container will only hold files, it's ok there are no processes running in it. However, be very careful not to delete this container. Here is the output from docker ps showing the container.
$ docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
d4a2aa46b5d9 medined/shared-configuration:latest true 7 seconds ago Exited (0) 7 seconds ago -shared-configuration
Now it's time to spin up two plain Ubuntu containers that can access the shared file.
$ sudo DOCKER_HOST=$DOCKER_HOST docker run --name A --volumes-from=shared-configuration -d -t ubuntu /bin/bash
94638de8b615f356f1240bbe602c0b7862e0589f1711fbff242b6d6f74c7de7d
$ sudo DOCKER_HOST=$DOCKER_HOST docker run --name B --volumes-from=shared-configuration -d -t ubuntu /bin/bash
sudo DOCKER_HOST=$DOCKER_HOST docker run --name B --volumes-from=shared-configuration -d -t ubuntu /bin/bash
How can we see the shared file? Let's turn to a very useful tool called nsenter (or namespace enter). The following command installs nsenter if isn't already installed.
hash nsenter 2>/dev/null \
|| { echo >&2 "Installing nsenter"; \
sudo DOCKER_HOST=$DOCKER_HOST \
docker run -v /usr/local/bin:/target jpetazzo/nsenter; }
I use a little script file to make nsenter easier to use:
$ cat enter_image.sh
#!/bin/bash
IMAGENAME=$1
usage() {
echo "Usage: $0 [image name]"
exit 1
}
if [ -z $IMAGENAME ]
then
echo "Error: missing image name parameter."
usage
fi
PID=$(sudo DOCKER_HOST=$DOCKER_HOST docker inspect --format {{.State.Pid}} $IMAGENAME)
sudo nsenter --target $PID --mount --uts --ipc --net --pid
This script is used by specifying the image name to use. For example,
$ ./enter_image.sh A
root@94638de8b615:/# cat /configuration/bridge-env.sh
export BRIDGENAME=brbob
export IMAGENAME=bob
export IPADDR=10.0.10.1/24
root@94638de8b615:/# exit
logout
$ ./enter_image.sh B
root@925365faded2:/# cat /configuration/bridge-env.sh
export BRIDGENAME=brbob
export IMAGENAME=bob
export IPADDR=10.0.10.1/24
root@925365faded2:/# exit
logout
We see the same information in both containers. Let's prove that the bridge-env.sh file is shared instead of being two copies.
$ ./enter_image.sh A
root@94638de8b615:/# echo "export NEW_VARIABLE=VALUE" >> /configuration/bridge-env.sh
root@94638de8b615:/# exit
logout
$ ./enter_image.sh B
root@925365faded2:/# cat /configuration/bridge-env.sh
export BRIDGENAME=brbob
export IMAGENAME=bob
export IPADDR=10.0.10.1/24
export NEW_VARIABLE=VALUE
We changed the file in the first container and saw the changes in the second container. As an alternative to using nsenter, you can simply run a container to list the files.
$ docker run --volumes-from shared-configuration busybox ls -al /configuration
Based on the work by sroegner, I have a github project at https://github.com/medined/docker-accumulo which lets you run multiple single-node Accumulo instances using Docker.
First, create the image.
git clone https://github.com/medined/docker-accumulo.git cd docker-accumulo/single_node ./make_image.sh
Now start your first container.
export HOSTNAME=bellatrix export IMAGENAME=bellatrix export BRIDGENAME=brbellatrix export SUBNET=10.0.10 export NODEID=1 export HADOOPHOST=10.0.10.1 ./make_container.sh $HOSTNAME $IMAGENAME $BRIDGENAME $SUBNET $NODEID $HADOOPHOST yes
export HOSTNAME=rigel export IMAGENAME=rigel export BRIDGENAME=brrigel export SUBNET=10.0.11 export NODEID=1 export HADOOPHOST=10.0.11.1 ./make_container.sh $HOSTNAME $IMAGENAME $BRIDGENAME $SUBNET $NODEID $HADOOPHOST no
export HOSTNAME=saiph export IMAGENAME=saiph export BRIDGENAME=brbellatrix export SUBNET=10.0.12 export NODEID=1 export HADOOPHOST=10.0.12.1 ./make_container.sh $HOSTNAME $IMAGENAME $BRIDGENAME $SUBNET $NODEID $HADOOPHOST no
The SUBNET is different for all containers. This isolates the Accumulo containers from each other.
Look at the running containers
$ docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 41da6f17261f medined/accumulo:latest /docker/run.sh saiph 4 seconds ago Up 2 seconds 0.0.0.0:49179->19888/tcp, 0.0.0.0:49180->2181/tcp, 0.0.0.0:49181->50070/tcp, 0.0.0.0:49182->50090/tcp, 0.0.0.0:49183->8141/tcp, 0.0.0.0:49184->10020/tcp, 0.0.0.0:49185->22/tcp, 0.0.0.0:49186->50095/tcp, 0.0.0.0:49187->8020/tcp, 0.0.0.0:49188->8025/tcp, 0.0.0.0:49189->8030/tcp, 0.0.0.0:49190->8050/tcp, 0.0.0.0:49191->8088/tcp saiph 23692dfe3f1e medined/accumulo:latest /docker/run.sh rigel 10 seconds ago Up 9 seconds 0.0.0.0:49166->19888/tcp, 0.0.0.0:49167->2181/tcp, 0.0.0.0:49168->50070/tcp, 0.0.0.0:49169->8025/tcp, 0.0.0.0:49170->8088/tcp, 0.0.0.0:49171->10020/tcp, 0.0.0.0:49172->22/tcp, 0.0.0.0:49173->50090/tcp, 0.0.0.0:49174->50095/tcp, 0.0.0.0:49175->8020/tcp, 0.0.0.0:49176->8030/tcp, 0.0.0.0:49177->8050/tcp, 0.0.0.0:49178->8141/tcp rigel 63f8f1a7141f medined/accumulo:latest /docker/run.sh bella 21 seconds ago Up 20 seconds 0.0.0.0:49153->19888/tcp, 0.0.0.0:49154->50070/tcp, 0.0.0.0:49155->8020/tcp, 0.0.0.0:49156->8025/tcp, 0.0.0.0:49157->8030/tcp, 0.0.0.0:49158->8050/tcp, 0.0.0.0:49159->8088/tcp, 0.0.0.0:49160->8141/tcp, 0.0.0.0:49161->10020/tcp, 0.0.0.0:49162->2181/tcp, 0.0.0.0:49163->22/tcp, 0.0.0.0:49164->50090/tcp, 0.0.0.0:49165->50095/tcp bellatrix
You can connect to running instances using the public ports. Especially useful is the public zookeeper port. Rather than searching through the ports listed above, here is an easier way.
$ docker port saiph 2181 0.0.0.0:49180 $ docker port rigel 2181 0.0.0.0:49167 $ docker port bellatrix 2181 0.0.0.0:49162
Having '0.0.0.0' in the response means that any IP address can connect.
You can enter the namespace of a container (i.e., access a bash shell) this way.
$ ./enter_image.sh rigel -bash-4.1# hdfs dfs -ls / Found 2 items drwxr-xr-x - accumulo accumulo 0 2014-07-12 09:13 /accumulo drwxr-xr-x - hdfs supergroup 0 2014-07-11 21:06 /user -bash-4.1# accumulo shell -u root -p secret Shell - Apache Accumulo Interactive Shell - - version: 1.5.1 - instance name: accumulo - instance id: bb713243-3546-487f-b6d6-cfaa272efb30 - - type 'help' for a list of available commands - root@accumulo> tables !METADATA
Now let's start an edge node. For my purposes, an edge node can connect to Hadoop, Zookeeper and Accumulo without running any of those processes. All of the edge node's resources are dedicated to client work.
export HOSTNAME=rigeledge export IMAGENAME=rigeledge export BRIDGENAME=brrigel export SUBNET=10.0.11 export NODEID=2 export HADOOPHOST=10.0.11.1 ./make_container.sh $HOSTNAME $IMAGENAME $BRIDGENAME $SUBNET $NODEID $HADOOPHOST no
As this container is started, the 'no' means that the supervisor configuration files will be deleted. So while supervisor will be running, it won't be managing any processes. This is not a best practice. It's just the way I chose for this prototype.
After I spin up Accumulo in a Docker container, well-known ports (like 2181 for Zookeeper) are not well-known any more. The internal private port (i.e., 2181) is exposed as a different public port (i.e., 49143). Java program trying to connect to Accumulo must automatically find the public port numbers.
The java code below finds the public port for Zookeeper for a Docker container named "walt". I don't know why the slash is needed in the image name.
int wantedPublicPort = -1; String wantedContainerName = "/walt"; int wantedPrivatePort = 2181; String dockerURL = "http://127.0.0.1:4243"; String dockerUser = "medined"; String dockerPassword = "XXXXX"; String dockerEmail = "david.medinets@gmail.com"; DockerClient docker = new DockerClient(dockerURL); docker.setCredentials(dockerUser, dockerPassword, dockerEmail); List<Container> containers = docker.listContainersCmd().exec(); for (Container container : containers) { String[] names = container.getNames(); for (String name : container.getNames()) { if (name.equals(wantedContainerName)) { for (Container.Port port : container.getPorts()) { if (port.getPrivatePort() == wantedPrivatePort) { wantedPublicPort = port.getPublicPort(); } } } } } System.out.println("Zookeeper Port: " + wantedPublicPort);
gt; gt;com.github.docker-java gt; gt;docker-java gt; gt;0.9.0 gt; gt;
As a simple lay programmer, I sometimes have trouble figuring out where log files are stored on unix systems. Sometimes logs are within application directories. Other times they are in /var/log. With Docker containers, this uncertainty is eliminated. How? By the 'docker diff' command. I will show why. When connecting to a Docker-based system, you can see the running containers:
$ docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 90a9f7122c02 medined/accumulo:latest /run.sh walt 9 hours ago Up 9 hours 0.0.0.0:49153->50070/tcp, 0.0.0.0:49154->50090/tcp, 0.0.0.0:49155->50095/tcp, 0.0.0.0:49156->8025/tcp, 0.0.0.0:49157->8030/tcp, 0.0.0.0:49158->8088/tcp, 0.0.0.0:49159->10020/tcp, 0.0.0.0:49160->19888/tcp, 0.0.0.0:49161->2181/tcp, 0.0.0.0:49162->22/tcp, 0.0.0.0:49163->8020/tcp, 0.0.0.0:49164->8050/tcp, 0.0.0.0:49165->8141/tcp walt
$ docker diff walt ... D /data1/hdfs/dn/current/BP-1274135865-172.17.0.10-1404767453280/current/finalized/blk_1073741825_1001.meta ... A /var/log/supervisor/accumulo-gc-stderr---supervisor-5H7Rr7.log A /var/log/supervisor/accumulo-gc-stdout---supervisor-LK8wDU.log ... A /var/log/supervisor/namenode-stdout---supervisor-mciN4u.log A /var/log/supervisor/secondarynamenode-stderr---supervisor-EaluLZ.log A /var/log/supervisor/secondarynamenode-stdout---supervisor-Ap4Fri.log C /var/log/supervisor/supervisord.log A /var/log/supervisor/zookeeper-stderr---supervisor-CCwUGw.log A /var/log/supervisor/zookeeper-stdout---supervisor-lDiuIF.log C /var/run C /var/run/sshd.pid C /var/run/supervisord.pid
Here is another quick note. This time about Docker.
# Run the standard Ubuntu image docker run --name=bash -i -t ubuntu /bin/bash # Do something ... # Detach by typing Ctl-p and Ctl-q. # Look at the image while on the Host system. docker ps # Reattach to the Ubuntu image docker attach bash
This note shows the difference between an Accumulo query both without and with an WholeRowIterator. The code snippet below picks up the narrative after you've initialized a Connector object. First we can see what a plain scan looks like:
// Read from the tEdge table of the D4M schema. String tableName = "tEdge"; // Read from 5 tablets at a time. int numQueryThreads = 5; Text startRow = new Text("6000"); Text endRow = new Text("6001"); List<Range> range = Collections.singletonList(new Range(startRow, endRow)); BatchScanner scanner = connector.createBatchScanner(tableName, new Authorizations(), numQueryThreads); scanner.setRanges(range); for (Entry<Key, Value> entry : scanner) { System.out.println(entry.getKey()); } scanner.close();
600006a870bb4c8471a27c9bd0f3f064265d062d :a00100|0.0001 [] 1401023353637 false 600006a870bb4c8471a27c9bd0f3f064265d062d :a00200|0.0001 [] 1401023353637 false ... 600006a870bb4c8471a27c9bd0f3f064265d062d :state|UT [] 1401023353637 false 600006a870bb4c8471a27c9bd0f3f064265d062d :zipcode|84521 [] 1401023353637 false 6000338cbf2daede3efd4355165c98771b3e2b66 :a00100|29673.0000 [] 1401023273694 false 6000338cbf2daede3efd4355165c98771b3e2b66 :a00200|20421.0000 [] 1401023273694 false ... 6000338cbf2daede3efd4355165c98771b3e2b66 :state|OR [] 1401023273694 false 6000338cbf2daede3efd4355165c98771b3e2b66 :zipcode|97365 [] 1401023273694 false
BatchScanner scanner = connector.createBatchScanner(tableName, new Authorizations(), numQueryThreads); scanner.setRanges(range); IteratorSetting iteratorSetting = new IteratorSetting(1, WholeRowIterator.class); scanner.addScanIterator(iteratorSetting); for (Entry<Key, Value> entry : scanner) { System.out.println(entry.getKey()); } scanner.close();
600006a870bb4c8471a27c9bd0f3f064265d062d : [] 9223372036854775807 false 6000338cbf2daede3efd4355165c98771b3e2b66 : [] 9223372036854775807 false
for (Entry<Key, Value> entry : scanner) { try { SortedMap<Key, Value> wholeRow = WholeRowIterator.decodeRow(entry.getKey(), entry.getValue()); System.out.println(wholeRow); } catch (IOException e) { throw new RuntimeException(e); } }
{600006a870bb4c8471a27c9bd0f3f064265d062d :a00100|0.0001 [] 1401023353637 false=1, 600006a870bb4c8471a27c9bd0f3f064265d062d :a00200|0.0001 [] 1401023353637 false=1, ... 600006a870bb4c8471a27c9bd0f3f064265d062d :state|UT [] 1401023353637 false=1, 600006a870bb4c8471a27c9bd0f3f064265d062d :zipcode|84521 [] 1401023353637 false=1} {6000338cbf2daede3efd4355165c98771b3e2b66 :a00100|29673.0000 [] 1401023273694 false=1, 6000338cbf2daede3efd4355165c98771b3e2b66 :a00200|20421.0000 [] 1401023273694 false=1, ... 6000338cbf2daede3efd4355165c98771b3e2b66 :state|OR [] 1401023273694 false=1, 6000338cbf2daede3efd4355165c98771b3e2b66 :zipcode|97365 [] 1401023273694 false=1}
----------- --------- | key | | value | ----------- --------- | [nothing here yet] | ----------- ---------
What is a Key? See below.
----------- ----------- --------- | tablet | | key | | value | ----------- ----------- --------- | default | || ----------- ----------- ---------
-infinity ==> ALL DATA <== +infinity.This concept of start and end keys can be shown in our tablet depiction as well.
----------- ----------- --------- | tablet | | key | | value | ----------- ----------- --------- | start key: -infinity | ----------------------------------- | default | |After inserting three records into a new table, you'll have the following situaton. Notice that Accumulo always stores keys in lexically sorted order. So far, the start and end keys have not been changed.| ----------------------------------- | end key: +infinity | ----------- ----------- ---------
----------- ------- --------- | tablet | | key | | value | ----------- ------- --------- | default | | 01 | | X | | default | | 03 | | X | | default | | 05 | | X | ----------- ------- ---------Accumulo stores all entries for a tablet on a single node in the clsuter. Since our table has only one tablet, the information can't spread beyond one node. In order to distribute information, you'll need to create more than tablet for your table.
The tablet's range is still from -infinity to +infinity. That hasn't changed yet.
Split point - the place where one tablet becomes two.Let's add two split points to see what happens. As the split points are added, new tablets are created.
----------- ------- --------- | tablet | | key | | value | ----------- ------- --------- | A | | 01 | | X | range: -infinity to 02 (inclusive) | split point 02 | | B | | 03 | | X | range: 02 (exclusive) to +infinity | B | | 05 | | X | ----------- ------- ---------The split point does not need to exist as an entry. This feature means that you can pre-split a table by simply giving Accumulo a list of split points.
-------------------------------- | Tablet Server | -------------------------------- | | | -- Tablet ---------------- | | | -infinity to +infinity | | | -------------------------- | | | --------------------------------Then the first split point is added. Now there are two tablets. However, they are still on a single server. And this also makes sense. Thinking about adding a split point to a table with millions of entries. While the two tablets reside on one server, adding a split is just an accounting change.
----------------------------------------------------------------------- | Tablet Server | ----------------------------------------------------------------------- | | | -- Tablet --------------------- -- Tablet --------------------- | | | -infinity to 02 (inclusive) | | 02 (exclusive) to +infinity | | | ------------------------------- ------------------------------- | | | -----------------------------------------------------------------------At some future point, Accumulo might move the second tablet to another Tablet Server.
------------------------------------| |------------------------------------ | Tablet Server | | Tablet Server | ------------------------------------| |------------------------------------ | | | | | -- Tablet --------------------- | | -- Tablet --------------------- | | | -infinity to 02 (inclusive) | | | | 02 (exclusive) to +infinity | | | ------------------------------- | | ------------------------------- | | | | | ------------------------------------- -------------------------------------
----------- ------- --------- | tablet | | key | | value | ----------- ------- --------- | A | | 01 | | X | range: -infinity to 02 (inclusive) | split point 02 | | B | | 03 | | X | range: 02 (exclusive) to 04 (inclusive) | split point 04 | | C | | 05 | | X | range: 04 (exclusive) to +infinity ----------- ------- ---------The table now has three tablets. When enough tablets are created, some process inside Accumulo moves one or more tablets into different nodes. Once that happens the data is distributed. Hopefully, you can now figure out which tablet any specific key inserts into. For example, key "00" goes into tablet "A".
----------- ------- --------- | tablet | | key | | value | ----------- ------- --------- | A | | 00 | | X | range: -infinity to 02 (inclusive) | A | | 01 | | X | | split point 02 | | B | | 03 | | X | range: 02 (exclusive) to 04 (inclusive) | split point 04 | | C | | 05 | | X | range: 04 (exclusive) to +infinity ----------- ------- ---------Internally, the first tablet ("A") as a starting key of -infinity. Any entry with a key between -infinity and "00" inserts into the first key. The last tablet has an ending key of +infinity. Therefore any key between "05" and +infinity inserts into the last tablet. Accumulo automatically creates split points based on some conditions. For example, if the tablet grows too large. However, that's a whole 'nother conversation.
------------------------------------------------------------------- | row | column family | column qualifier | visibility | timestamp | -------------------------------------------------------------------These five components, combined, go into the _Key_.
------------------------------------------------ | row | column family | column qualifier | ------------------------------------------------ | book | 140122317 | Batman: Hush | | book | 1401216676 | Batman: A Killing Joke |You can see how the _book_ row would have millions of entries. Potentially causing memory issues inside your TServer. Many people add a _shard_ value to the row to introduce potential split points. With shard values, the above table might look like this:
--------------------------------------------------- | row | column family | column qualifier | --------------------------------------------------- | book_0 | 140122317 | Batman: Hush | | book_5 | 1401216676 | Batman: A Killing Joke |With this style of row values, Accumulo could use book_5 as a split point so that the row are no longer unmanageable. Of course, this technique adds a bit of complexity to the query process. I'll leave the query issue to a future note. Let's explore how shard values can be generated.
|-------------------------------------- | RELATIONAL REPRESENTATION | |-------------------------------------- | SK | First Name | Last Name | Age | |-------------------------------------| | 1001 | John | Kloplick | 36 | ---------------------------------------Key-value database spread information across several rows using the synthetic key to tie them together. In simplified form, the information is stored in three key-value combinations (or three entries).
|---------------------------------- | KEY VALUE REPRESENTATION | |---------------------------------- | ROW | CF | CQ | |---------------------------------| | 1001 | first_name | John | | 1001 | last_name | Kloplick | | 1001 | age | 36 | -----------------------------------If the coin flip sharding strategy were used the information might look like the following. The potential split point shows that the entries can be spread across two tablets.
|------------------------------------- | ROW | CF | CQ | |------------------------------------| | 1001_01 | first_name | John | | 1001_01 | age | 36 | | 1001_02 | last_name | Kloplick | <-- potential split point --------------------------------------To retrieve the information you'd need to scan both servers! This coin flip sharding technique is not going to scale. Imagine information about a person spread over 40 servers. Collating that information would be prohibitively time-consuming.
2,314,539 modulo 5 = 4
|------------------------------------- | ROW | CF | CQ | |------------------------------------| | John_04 | first_name | John | | John_04 | age | 36 | | John_04 | last_name | Kloplick | --------------------------------------
Note that the shard value is _not_ related to any specific node. It's just a potential split point for Accumulo.It's time to look at a specific use case to see if this sharding strategy is sound. What if we need to add a set of friends for John? It's unlikely that the information about John's friends have his first name. But very likely for his synthetic key of 1001 to be there. We can now see choosing the first_name field as the base of the sharding strategy was unwise.
1,507,424 modulo 997 = 957
|-------------------------------------- | ROW | CF | CQ | |-------------------------------------| | 1001_957 | first_name | John | | 1001_957 | age | 36 | | 1001_957 | last_name | Kloplick | --------------------------------------Using this technique makes it simple to add a height field.
|-------------------------------------------- | ROW | CF | CQ | |-------------------------------------------| | 1001_957 | height_in_inches | 68 | ---------------------------------------------