In some scenarios, typically container applications, it is quite common to hit the error “WFLYCTL0348: Timeout after [300] seconds waiting for service container stability“. This article will guide through the steps to find the root cause and solve the issue.
What is container stability?
Firstly, let’s start from the following message which you have found after starting JBoss / WildFly or Keycloak:
ERROR [org.jboss.as.controller.management-operation] (Controller Boot Thread) WFLYCTL0348: Timeout after [300] seconds waiting for service container stability. Operation will roll back. Step that first updated the service container was 'add' at address '[ ("core-service" => "management"), ("management-interface" => "http-interface") ]'
This kind of error usually happens following up a deployment, although it is not actually a deployment timeout. It simply means that the container has not reached stability within the default timeout (300 seconds).
The application itself may contain bugs, logical errors, or compatibility issues that only manifest after deployment. These errors can result in crashes, exceptions, or incorrect behavior that affects the stability of the application server.
As a result, all applications will be undeployed and the container will shutdown.
In this state, when not all container services are in ready state, you might expect failures. Even worse, your server might be attacked. For this reason a container shut-down follows.
Things we can do
Before trying the obvious fix (increasing the blocking timeout) it is crucial to understand why your container takes so much to reach the ready state. For example, collect a set of Thread dumps during the deployment of the application. Then, search for any Thread which might BLOCKED. The following article can guide you through the inspection of a Thread dump: How to inspect a Thread Dump like a pro
If you cannot get the root cause of the issue, you can use the following check-list:
Network checks:
First of all, we need to check if some misconfiguration are delaying the core-service. For example, check if the block is due to a network host resolving. You can usually troubleshoot this kind of issues through the following checklist of options/tools:
- Check TCP/IP Settings.
- Flush the DNS Cache.
- Release and Renew DHCP Server IP.
- Change to Public DNS Servers.
- Query DNS records with dig.
- Run nslookup.
- Run host.
- Query an address with traceroute or tracert.
Lack of resources ?
Then, another possibility is that your JBoss / WildFly container simply has not enough resources. You can have a look at the Memory /CPU Metrics for your Container and compare with your max settings. The following article can help you to change the JVM settings on a Kubernetes/OpenShift platform: Configuring JVM settings on Openshift
External Resources ?
Another possibility is that external resources, such as databases, are causing the block in the initialization. Example: a Database Lock. Use the SQL for your Database to verify if this could the the cause. The following is an SQL Statement you can use for PostgreSQL to detect locks:
SELECT pid, state, username, query, query_start FROM pg_stat_activity WHERE pid in ( select pid from pg_locks l join pg_class t on l.relation = t.oid where t.relkind = 'r' );
Increasing the blocking timeout
Finally, if you cannot remove external constraints that are delaying your start up, you can increase the blocking timeout. To do that, set the property jboss.as.management.blocking.timeout to an higher value, for example 600 seconds. With the CLI, you can apply the property permanently as follows:
/system-property=jboss.as.management.blocking.timeout:add(value=600)
On the other hand, you can also set it as startup parameter:
standalone.sh -Djboss.as.management.blocking.timeout=600
For applications running in Kubernetes/OpenShift, you can set the property as follows:
JAVA_OPTS_APPEND=-Djboss.as.management.blocking.timeout=600
Conclusion
This article was a walk though the resolution of a common issue that might happen when your container is not able to reach stability following up a deployment. We have covered a check-list of options that can cause this issue, even though a dump of the server activity is the first activity that should follow.