I Investigated #zombie processes on my home #linux server today. Here a #thread
It started by: "why the server is slow today?" after I ssh into it and do a `ls`
I looked at the last 7 days of the #NodeExporter dashboard in #grafana but found nothing. Well in fact there is a small memory leak (+1% per day) but I didn't notice it at first sight.
1/n
Then I checked the running processes.
- Why the hell I have 30K Tasks ?
- Ups it's 30997 #zombie process !
2/n
Definition: A #zombie process is a process that has completed execution but still has an entry in the process table.
Causes: Zombie processes occur when child processes have completed execution, and their exit status needs to be read by the parent process.
Effects: Zombie processes can cause resource leaks by consuming memory and holding file descriptors.
The presence of a few zombie processes is usually harmless, but having too many can indicate a bug in the parent process
3/n
Let's kill them !
I can't: #Zombie processes cannot be killed using regular signals like `SIGKILL` since they are already dead.
It explains their name: The term 'zombie process' is metaphorical, comparing it to an 'undead' person that has not been 'reaped'.
To remove zombie processes, the parent process should be signaled (e.g., SIGCHLD) to read the child's exit status, or the parent process can be terminated if it is unresponsive.
4/n
`ps -A -ostat,pid,ppid | grep -e '[zZ]' | tail -10`
I used tail to avoid listing the 30k processed and ppid to list the parent process id
Then I kill the parents, I kept one for investigation
`sudo kill -9 240816 236637`
5/n
Time for investigation.
I checked one of the parent and found [ssl_client] <defunct>
I checked a second parent and found [wget] <defunct>
This reminded me that the last change I made was to enable #https using #letsencrypt certificate for most services
#wget is used in the healthcheck section inside #dockerCompose but it doesn't explain the zombie, or may be ?
6/n
I checked that I set the parameter to not check the certificate as i'm using 127.0.0.1 instead of the FQDN and #letsencrypt don't provide #certificate for IP addresses
I exec inside the #docker container to run manually the `wget --no-check-certificate`
It's working correctly
When I remove the healthcheck section in #DockerCompose there is no more #zombie process
Root cause found: It's the wget used by the healthcheck that create the zombies !
7/n
I summarize: I have #zombie processes created by #wget command when doing an https request in the #healthcheck section of #DockerCompose
Zombie processes occur when child processes have completed execution, and their exit status needs to be read by the parent process.
A process in a #container is still a process on the host, so it takes up a PID on the host. Whatever you run in a container is PID 1 which means it has to install a signal handler to get that signal.
8/n
The first thing to understand is an init process doesn't magically remove zombies. A (normal) init is designed to reap zombies when the parent process that failed to wait on them exits and the zombies hang around. The init process then becomes the zombies parent and they can be cleaned up.
9/n
Next, a #container is a #cgroup of processes running in their own PID namespace. This cgroup is cleaned up when the container is stopped. Any zombies that are in a container are removed on stop. They don't reach the hosts init.
10/n
Third is the different ways containers are used. Most run one main process and nothing else. If there is another process spawned it is usually a child of that main process. So until the parent exits, the zombie will exist. Then see point 2 (the zombies will be cleared on #container exit).
11/n
The other role an #init process can provide is to install signal handlers so signals sent from the host can be passed onto the container process. PID 1 is a bit special as it requires the process to listen for a signal for it to be received.
If you can install a SIGINT and SIGTERM signal handler in your PID 1 process then an init process doesn't add much here.
!!! Those explanations come from this superb article in #stackoverflow
https://stackoverflow.com/questions/49162358/docker-init-zombies-why-does-it-matter !!!
12/n
The solution to avoid #zombie in the #container is to use an #init process. This is included by default in #docker thanks to #tini
https://github.com/krallin/tini
13/n
The syntaxe in #DockerCompose is:
init: true
What is advantage of #tini ? https://github.com/krallin/tini/issues/8
14/n
@slamp Time to get out the shotguns when there are that many zombies
@slamp hello yes I would like to subscribe to this
@slamp Great deduction, love how you analyze the issue. Subbing!
@lgeurts Thanks a lot ! I already known what are zombies processes and the fact they missed their parent. It helps me for the investigation.
@lgeurts I was surprised when I googled the problem and found many solutions which are: "remove health check"
This is not a solution, it's a bad workaround.
Additionally, I still need to investigate why the problem only occurs when using https (I didn't have it before), if it is only the case for #wget or also for #curl and if it only happens on a docker image based on #alpine
@slamp Am curious. And you're right, a workaround should be considered a temporary solution so it's always bad to implement something like that for the long term. About the https, could you explain didn't have that before. Is this a new problem?
@lgeurts
I didn't have the issue when doing a GET using http
This configuration didn't create zombie
healthcheck:
test: ["CMD-SHELL", "wget -q -0 - http://127.0.0.1"]
interval: 60s
timeout: 5s