Trouble Shoot CQ Cluster
Troubleshoot CQ Clustering
Here are some common questions and there answers to understand clustering better
- Question: Where is crx.xxx file get created ?
- Answer: On slave node when it is first joined in cluster. And this is current repository. Note that existence of this directory does not mean that this is slave node. Please see below how to decide which one is master.
- Question: How can I decide which one is master ?
- Answer: Go to http://<host>:<port>/crx/config/cluster.jsp on any node.
- Question: My all instances are down, How can I decide which one was last current master ?
- Answer: Note that if all the instances are down, clustered.txt file is only present in slave node (If everything is fine). Instance which don't have clustered.txt file is master node.
- Question: How can I decide which is my current directory ?
- Answer: You can check bootstrap.properties on node and check for repository.home property. If there is no crx.xxx then crx-quickstart is current directory.
- Question: Writes are always performed through master ?
- Answer: Yes.
- Question: What if Master is down cluster ?
- Answer: Slave will become the master. If you have multiple node in cluster one of the slave will become Master based on election. Slave does following to become master. Remove clustered.txt file from /crx-quickstart/crx.XXX and switch it back to Master.
- Question: What if old Master then comes back online ?
- Answer: Current master will continue to be Master in cluster. Old master will be slave and you can verify this by existence of clustered.txt under /repository folder.
- Question: What if current master (Old slave) is down again ?
- Answer: Current Slave will become the master (If multiple node then based on election one of the slave will become current master).
- Question: What is best way to install HF on a cluster node ?
- Answer: Install in Master (Use above method to determine which is master) -> let it synch to Slave -> Check slave package manager to make sure it is installed -> Click on reinstall option again from package manager in slave for CRX Hotfix package -> Stop Master -> Make sure it is down -> Stop slave -> make sure it is down -> for a instance where you have crx.XXX folder, check current repository from bootstrap.properties file and then copy crx-quickstart/crx.XXX/patches to crx-quickstart/repository (Or use manual install of jar file on slave instance) -> start master -> make sure it is up -> Start Slave -> Check repo version by going to repository configuration and searching for jcr.repository.version
- Question: At some point I want to run as stand alone system and make crx-quickstart as my current directory what should I do ?
- Answer: If you want to do it in Master instance where there is no crx.xxx folder. You probably don't have to do any thing. If you want to do it on slave instance where crx.xxx folder first thing you have to make sure that which is current repository (You can do that by doing to bootstrap.properties file). Make sure that your system is stopped -> rename repository folder under crx-quickstart folder -> rename crx.xxx to repository -> move it to crx-quickstart folder -> delete bootstrap.properties file -> delete cluster* under crx-quickstart/repository-> delete revision.log -> delete tarJournal -> restart the system. Note that ideally if you want to keep crx.xxx as current directory then you don't have to do any thing.
- Question: What about tar optimization on cluster Instance ? (We will cover this later)
- Answer: TarOptimization always run on Master node in a cluster environment. If you are optimizing tar files in a cluster, you need to ensure that the Tar optimization times are set to the same value on all cluster nodes. For example, <param name="autoOptimizeAt" value="1:00-4:00"/>
- Question: How about Datastore Garbage collection ? (We will cover this later)
- Answer: See http://dev.day.com/content/kb/home/Crx/CrxSystemAdministration/DataStoreGarbageCollection.html for that.
- Question: What if mater is stopped in a middle of synch process
- Answer: If this is graceful stop, Master gives 60000 ms for slave to sync up with. If slave sync up before that master is stopped after sync complete. Check cluster system properties to see how to set up this time.
- Question: How can I make sure, One of the node is always master if in a cluster.
- Answer: You need to set up "preferredMaster" to "true" for that node. For more Information please check http://dev.day.com/docs/en/crx/current/administering/persistence_managers.html
- Question: How replication work in clustered environment
- Answer: Similar to write operation, Replication is delegated to master, if done from slave.
- Question: Ok, I understand normal scenario but what happen to cluster when there is network issue.
- Answer: Ideally if you are not sure about network connections or there is network problem often between cluster nodes shared nothing clustering is not recommended. But If chose to select Shared nothing clustering, Slave will try to read from master and after some time when it is unable to do so you will get "Read from master timed out." error and slave will be disconnected.
- Question: Then what should I do when there is network issue
- Answer: You should stop slave, and restart it when network is normal. Other option is you can set "becomeMasterOnTimeout" parameter on slave (In repository.xml), This will make slave as master when time out happen (Again problem here would be you will two masters at one time, so not highly recommended).
- Question: What happen if my cluster instance is in different TimeZone ?
- Answer: It is not recommended to have cluster instance on different TimeZone. It can create problem in tar optimization, backup and restore, Garbage collection, Data tar file time stamp mismatch.
- Question: How to recover from power failure situation or case where both the node have clustered.txt file and it is difficult to decide which one was last master
- Answer: If the file 'clustered.txt' exists on all cluster nodes (for example because a power failure caused all cluster nodes to stop at the same time, or an online backup was restored on all cluster nodes), then the file needs to be deleted on one of the cluster nodes. To find out where to delete the marker file, compare the size of the last data*.tar files of the default workspace, the version workspace, and the tarJournal directory. The file clustered.txt needs to be deleted on the cluster node that has more data than the other cluster nodes (any cluster node if all cluster nodes have the same amount of data).