[How To] Recover a Fuse ESB/MQ fabric cluster from crash

Earlier I had blogged on how to create a fabric cluster on Fuse ESB/MQ. Today I will explain how to perform a disaster recovery of that fabric cluster, should it go down. 
Fuse ESB/FuseMQ/FMC 7.1.0 fabric cluster can crash when a new patch is added to an existing ensemble. There is a bug report at http://fusesource.com/issues/browse/FABRIC-365 which discusses the issue. The bug has been fixed in JBoss Fuse 6.0. However, in Fuse ESB 7.1.0 this fix is only available as a patch upgrade (ironic!).
Irrespective of patching, fabric cluster may go wrong due to other reasons. You may employ this method to recover the setup. This will work on all OS which employs the same folder structure. Tested on RHEL and Ubuntu.
That this tutorial assumes that –
  1.    You have basic knowledge of fabric server, cluster and ensemble and its functions
  2.    You can navigate through Linux servers and know the basic commands
  3.    You are authorised to make changes on servers, please test locally first
  4.    All fuse fabric processes are shut down

How do you know the Zookeeper server cluster is not running/broken?

Fabric cluster should show a list of all containers in the cluster when the command container-list is given. The following errors can be taken as hints that the cluster is not working as expected. The command  will give you:

org.fusesource.fabric.api.FabricException: java.lang.IllegalStateException: Error waiting for ZooKeeper connection

In the FuseESB/FuseMQ root container logs you will see the following error
05:47:11,059 | WARN | 0.0/ | NIOServerCnxn | 51 – org.fusesource.fabric.fabric-linkedin-zookeeper – 7.1.0.fuse-047 | Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running 


If your Zookeeper server is not running, you will be limited to running your child containers, that too if you install them as a service. If you use FMC or the Fuse console to run them, they will not work. You will not be able to recover any data, recreating cluster will result in complete loss of information. Hence, it becomes important to recover a fabric cluster rather than create a new one.
Disclaimer: Please tread carefully beyond this point. I shall not be held responsible if you lose data or burn down your server . These steps are completely mine and have worked for me. It may or may not work for your server. You have been warned!

Disaster Recovery

Before starting the recovery process, it is essential  to find out that at least one of the three fabric servers are running. In this crashed cluster, 1 Zookeeper server running. You can quickly find out the ports used by Zookeeper servers by looking at instances.properties files in either FuseESB or Fusefusemq instances folder. Here is what you should be looking for

Once you find that out ,  do a netstat to see which ports are active. By default, a Fabric cluster server will use the ports 2181,2182,2183

fuse@VBoxSvr1:$netstat -nltp | grep :2182

This will give you the port number, the process id and the process name
tcp    0   0 :::2182       :::*       LISTEN      28767/java 

You can also find out which of the two root containers is the server by searching for the PID (28767/java)
fuse@VBoxSvr1:~$ps -eaf | grep 28767
fuse       892   864  0 18:28 pts/0    00:00:00 grep –color=auto 28767
fuse     28767 28765  0 05:28 ?        00:04:07 java -Dkaraf.home=/opt/FuseESBEnterprise-7.1.0 -Dkaraf.base=/opt/FuseESBEnterprise-7.1.0 -Dkaraf.data=/opt/FuseESBEnterprise-7.1.0/data -Dcom.sun.management.jmxremote -Dkaraf.stafuseesbLocalConsole=false -Dkaraf.stafuseesbRemoteShell=true -Djava.endorsed.dirs=/usr/java/jdk1.7.0_17/jre/lib/endorsed:/usr/java/jdk1.7.0_17/lib/endorsed:/opt/FuseESBEnterprise-7.1.0/lib/endorsed -Djava.ext.dirs=/usr/java/jdk1.7.0_17/jre/lib/ext:/usr/java/jdk1.7.0_17/lib/ext:/opt/FuseESBEnterprise-7.1.0/lib/ext -Xmx512m -Djava.library.path=/opt/FuseESBEnterprise-7.1.0/lib/ -classpath /opt/FuseESBEnterprise-7.1.0/lib/karaf-wrapper.jar:/opt/FuseESBEnterprise-7.1.0/lib/karaf.jar:/opt/FuseESBEnterprise-7.1.0/lib/karaf-jaas-boot.jar:/opt/FuseESBEnterprise-7.1.0/lib/karaf-wrapper-main.jar -Dwrapper.key=aUKIyldpL0xq60Xs -Dwrapper.pofuseesb=32001 -Dwrapper.jvm.pofuseesb.min=31000 -Dwrapper.jvm.pofuseesb.max=31999 -Dwrapper.pid=28765 -Dwrapper.version=3.2.3 -Dwrapper.native_library=wrapper -Dwrapper.service=TRUE -Dwrapper.cpu.timeout=10 -Dwrapper.jvmid=1 org.apache.karaf.shell.wrapper.Main

For example, if you have port 2182 running, which was the case in this scenario, you will need to go to VBoxSvr1 and look into FuseESB folders. The files to look for are

  1.     org_apache_felix_cm_impl_DynamicBindings.config
  2.     zookeeper.config
  3.     factory.config
  4.     {uniqueservername}.config

If you see more one server is active, your job becomes easier.
The second and fourth configs are very important . No (2) defines the hostname/ips and port numbers of the physical servers and No (4) defines the server configuration. The servernames are unique for each ensemble, so its very important to make sure you are using the correct one. The paths for each of these files are given below.


View the server configuration file for the running server
cat /opt/FuseESBEnterprise-7.1.0/data/cache/bundle5/data/config/org/fusesource/fabric/zookeeper/server/af327e34-501b-1576-9abf-7ad6eb7eb582.config

Note the server.id and the service.pid. They are unique to each fabric server.

The zookeeper config can be found in the following directory
fuse@VBoxSvr0:$cat /opt/FuseESBEnterprise-7.1.0/data/cache/bundle5/data/config/org/fusesource/fabric/zookeeper.config 

The contained information should look like this
zookeeper.url=”VBoxSvr0.localdomain:2182,VBoxSvr1.localdomain:2181,VBoxSvr1.localdomain:2182″ fabric.zookeeper.pid=”org.fusesource.fabric.zookeeper” 

The factory configuration can be found further down inside the zookeeper folder
fuse@VBoxSvr1:$cat /opt/FuseESBEnterprise-7.1.0/data/cache/bundle5/data/config/org/fusesource/fabric/zookeeper/server/factory.config 

This contained information links the config file to the unique server id.

What to look for?

In your initial research you should try to find the following
  1.     How many Zookeeper config files are missing?
  2.     How many server config files are missing?
  3.     How many factory. config files are missing or contains incorrect information?
  4.     How many DynamicBindings.config  files are missing the server information?


Once that is sorted, you can start looking for the information to add to
the above mentioned files. First up is the unique server name string. To find this
unique string for the fabric servers which are not running, search the
log files of the Fuse ESB and MQ root containers. The logs will contain the
information. Example below, searching for FuseMQ fabric server  string.
fuse@VBoxSvr1:$cat /opt/FusefusemqEnterprise-7.1.0/data/log/karaf.log | grep org.fusesource.fabric.zookeeper.server.

The string to look for would be a long alphanumeric string attached to the end of the word org.fusesource.fabric.zookeeper.server. In the logs, the string is 56084def-f94c-4af6-bc5a-0a818848883d

2013-06-30 19:54:48,068 | INFO | okeeper.server]) | ZKServerFactoryBean | ternal.BaseManagedServiceFactory 69 | 50 – org.fusesource.fabric.fabric-zookeeper – 7.1.0.fuse-047 | Configuration org.fusesource.fabric.zookeeper.server.56084def-f94c-4af6-bc5a-0a818848883d  updated: {server.3=VBoxSvr1:2889:3889, 
server.2=VBoxSvr1:2888:3888, server.1=VBoxSvr0:2888:3888, server.id=2, clientpofuseesb=2181, service.factorypid=org.fusesource.fabric.zookeeper.server, 
ticktime=2000, fabric.zookeeper.pid=org.fusesource.fabric.zookeeper.server-0001, synclimit=5, initlimit=10, 
service.pid=org.fusesource.fabric.zookeeper.server.56084def-f94c-4af6-bc5a-0a818848883d, datadir=data/zookeeper/0001}
2013-06-30 19:54:48,111 | INFO | pool-10-thread-1 | ZKServerFactoryBean | per.internal.ZKServerFactoryBean 44 | 50 – org.fusesource.fabric.fabric-zookeeper – 7.1.0.fuse-047 | Creating zookeeper server with propefuseesbies: {server.3=VBoxSvr1:2889:3889, server.2=VBoxSvr1:2888:3888, server.1=VBoxSvr0:2888:3888, server.id=2, clientpofuseesb=2181, service.factorypid=org.fusesource.fabric.zookeeper.server, ticktime=2000, 
fabric.zookeeper.pid=org.fusesource.fabric.zookeeper.server-0001, synclimit=5, initlimit=10, service.pid=org.fusesource.fabric.zookeeper.server.56084def-f94c-4af6-bc5a-0a818848883d, datadir=data/zookeeper/0001}
2013-06-30 19:54:49,010 | INFO | use-047-thread-2 | ZooKeeperConfigAdminBridge | admin.ZooKeeperConfigAdminBridge 361 | 149 – org.fusesource.fabric.fabric-configadmin – 7.1.0.fuse-047 | Deleting configuration org.fusesource.fabric.zookeeper.server.56084def-f94c-4af6-bc5a-0a818848883d
2013-06-30 19:54:49,012 | INFO | 5a-0a511818883d) | ZKServerFactoryBean | ternal.BaseManagedServiceFactory 88 | 50 – org.fusesource.fabric.fabric-zookeeper – 7.1.0.fuse-047 | Configuration org.fusesource.fabric.zookeeper.server.56084def-f94c-4af6-bc5a-0a818848883d delete

If you do not find any information in the logs, search the full folder using grep
grep -irl “org.fusesource.fabric.zookeeper.server.*” /opt/FusefusemqEnterprise-7.1.0/*

To update the server config create a new config file with the name you searched for in the previous exercise.

vim /opt/FusefusemqEnterprise-7.1.0/data/cache/bundle5/data/config/org/fusesource/fabric/zookeeper/server/56084def-f94c-4af6-bc5a-0a818848883d.config

Then add the following information. You should note, this is same as the server config mentioned earlier. The difference is that server.id and the service.pid are changed. You must remember server.id should be corresponding to server.x i.e the first three server descriptions in the config fie. In other words, if server.1 corresponds to the current config, server.id should be 1 , for server.2 , server.id should be 2 and so on. The service.pid contains the unique server name. This alphanumeric string should be same as the config filename



zookeeper.config information should be same for all the three servers. You can copy and paste into the respective zookeeper.config file locations. Usually the file location is data/cache/bundle5/data/config/org/fusesource/fabric/


This config contains an information about the mvn location for the fabric-zookeeper jars. Edit the config file..
fuse@VBoxSvr1:$vim /opt/FusefusemqEnterprise-7.1.0/data/cache/bundle5/data/config/org_apache_felix_cm_impl_DynamicBindings.config  

and add the line given below. Then save and close.


Edit the factory.config file, and add the information for the corresponding server name

vim /opt/FusefusemqEnterprise-7.1.0/data/cache/bundle5/data/config/org/fusesource/fabric/zookeeper/server/factory.config

Here the factory.pidList value should be the same service.pid in the server config. Both these files should be in the same location.

Repeat the above steps for the second server as well. Once all the steps are completed, you can start the FuseESB and FuseMQ root services on both the clustered servers.

service fuseesb-service start
service fusemq-service start

Tail the logs to see if you are getting the desired results. Initially there may be errors, will settle down after some time.
tail -f /opt/FuseESBEnterprise-7.1.0/data/log/fuseesb.log

tail -f /opt/FuseMQEnterprise-7.1.0/data/log/fusemq.log

Try to login to the console using client connection
ssh admin@VBoxSvr0 -p 8101

Then give container-list to view all the containers.
FuseMQ:admin@fusemq-root-node02> container-list | grep true
FuseManagementConsole             1.3       true    fmc                            
fusemq-root-node01                 1.3       true    fabric, fabric-ensemble-0001-1     success
fusemq-root-node02*                1.3       true    fabric, fabric-ensemble-0001-2     success
fuseesb-root-node01                1.3       true    fabric, fuse-esb-full              success
  fuseesb-child-node01         1.3       true    fabric, default success
  fuseesb-child2-node01            1.3       true    fabric, default        success
fuseesb-root-node02                1.3       true    fabric, fabric-ensemble-0001-3     success
FuseMQ:admin@fusemq-root-node02> ensemble-list

That’s all. Your fabric server will be back in action! Let me know if your views and/or you have other ways of achieving this task in the comments below.

Leave a Reply

%d bloggers like this: