Hachyderm @hachyderm

Recent searches

Search options

Only available when logged in.

I Investigated #zombie processes on my home #linux server today. Here a #thread

It started by: "why the server is slow today?" after I ssh into it and do a `ls`

I looked at the last 7 days of the #NodeExporter dashboard in #grafana but found nothing. Well in fact there is a small memory leak (+1% per day) but I didn't notice it at first sight.

1/n

screenshot of grafana dashbord for node-exporter

May 28, 2024, 08:04 PM·

0boosts·3favorites

**slamp** @slamp · May 28, 2024

May 28, 2024

slamp @slamp

Then I checked the running processes.
- Why the hell I have 30K Tasks ?
- Ups it's 30997 #zombie process !

2/n

Screenshot of top showing 31226 tasks and 30997 zombies

**slamp** @slamp · May 28, 2024

May 28, 2024

slamp @slamp

Definition: A #zombie process is a process that has completed execution but still has an entry in the process table.

Causes: Zombie processes occur when child processes have completed execution, and their exit status needs to be read by the parent process.

Effects: Zombie processes can cause resource leaks by consuming memory and holding file descriptors.

The presence of a few zombie processes is usually harmless, but having too many can indicate a bug in the parent process

3/n

**slamp** @slamp · May 28, 2024

May 28, 2024

slamp @slamp

Let's kill them !
I can't: #Zombie processes cannot be killed using regular signals like `SIGKILL` since they are already dead.

It explains their name: The term 'zombie process' is metaphorical, comparing it to an 'undead' person that has not been 'reaped'.

To remove zombie processes, the parent process should be signaled (e.g., SIGCHLD) to read the child's exit status, or the parent process can be terminated if it is unresponsive.

4/n

**slamp** @slamp · May 28, 2024 *

May 28, 2024 *

slamp @slamp

`ps -A -ostat,pid,ppid | grep -e '[zZ]' | tail -10`

I used tail to avoid listing the 30k processed and ppid to list the parent process id
Then I kill the parents, I kept one for investigation

`sudo kill -9 240816 236637`

5/n

Result of ps command and then the sudo kill command

**slamp** @slamp · May 28, 2024 *

May 28, 2024 *

slamp @slamp

Time for investigation.

I checked one of the parent and found [ssl_client] <defunct>

I checked a second parent and found [wget] <defunct>

This reminded me that the last change I made was to enable #https using #letsencrypt certificate for most services

#wget is used in the healthcheck section inside #dockerCompose but it doesn't explain the zombie, or may be ?

6/n

screenshot showing the process with <defunct>

**slamp** @slamp · May 28, 2024 *

May 28, 2024 *

slamp @slamp

I checked that I set the parameter to not check the certificate as i'm using 127.0.0.1 instead of the FQDN and #letsencrypt don't provide #certificate for IP addresses

I exec inside the #docker container to run manually the `wget --no-check-certificate`

It's working correctly

When I remove the healthcheck section in #DockerCompose there is no more #zombie process

Root cause found: It's the wget used by the healthcheck that create the zombies !

7/n

healthcheck:
test: ["CMD-SHELL", "wget --no-check-certificate -q -O - https://127.0.0.1:9090/api/v1/status/runtimeinfo | grep st
atus > /dev/null"]
interval: 60s
timeout: 5s
retries: 5
start_period: 20s

Result of the command wget --no-check-certificate -q -O - https://127.0.0.1:9090/api/v1/status/runtimeinfo | grep status

**slamp** @slamp · May 28, 2024

May 28, 2024

slamp @slamp

I summarize: I have #zombie processes created by #wget command when doing an https request in the #healthcheck section of #DockerCompose

Zombie processes occur when child processes have completed execution, and their exit status needs to be read by the parent process.

A process in a #container is still a process on the host, so it takes up a PID on the host. Whatever you run in a container is PID 1 which means it has to install a signal handler to get that signal.

8/n

**slamp** @slamp · May 28, 2024

May 28, 2024

slamp @slamp

#zombies

The first thing to understand is an init process doesn't magically remove zombies. A (normal) init is designed to reap zombies when the parent process that failed to wait on them exits and the zombies hang around. The init process then becomes the zombies parent and they can be cleaned up.

9/n

**slamp** @slamp · May 28, 2024

May 28, 2024

slamp @slamp

Next, a #container is a #cgroup of processes running in their own PID namespace. This cgroup is cleaned up when the container is stopped. Any zombies that are in a container are removed on stop. They don't reach the hosts init.

10/n

**slamp** @slamp · May 28, 2024

May 28, 2024

slamp @slamp

Third is the different ways containers are used. Most run one main process and nothing else. If there is another process spawned it is usually a child of that main process. So until the parent exits, the zombie will exist. Then see point 2 (the zombies will be cleared on #container exit).

11/n

**slamp** @slamp · May 28, 2024

May 28, 2024

slamp @slamp

#Signals

The other role an #init process can provide is to install signal handlers so signals sent from the host can be passed onto the container process. PID 1 is a bit special as it requires the process to listen for a signal for it to be received.

If you can install a SIGINT and SIGTERM signal handler in your PID 1 process then an init process doesn't add much here.

!!! Those explanations come from this superb article in #stackoverflow
https://stackoverflow.com/questions/49162358/docker-init-zombies-why-does-it-matter !!!

12/n

Stack OverflowDocker - init, zombies - why does it matter?I did read this article: https://blog.phusion.nl/2015/01/20/docker-and-the-pid-1-zombie-reaping-problem/ To set some context: Article is about problem with zombies in containers, it try to convinc...

**slamp** @slamp · May 28, 2024 *

May 28, 2024 *

slamp @slamp

The solution to avoid #zombie in the #container is to use an #init process. This is included by default in #docker thanks to #tini
https://github.com/krallin/tini

13/n

GitHubGitHub - krallin/tini: A tiny but valid `init` for containersA tiny but valid `init` for containers. Contribute to krallin/tini development by creating an account on GitHub.

**slamp** @slamp · May 28, 2024 *

May 28, 2024 *

slamp @slamp

The syntaxe in #DockerCompose is:
init: true

What is advantage of #tini ? https://github.com/krallin/tini/issues/8

14/n

screenshot of docker compose yaml file for service grafana showing the syntax of init: true

**slamp** @slamp · May 28, 2024

May 28, 2024

slamp @slamp

To finish I checked again the #grafana dashboard and this time I saw the memory #leak !

End of #thread on #zombie process on #linux
I hope you enjoyed it !

15/15

Grafana panel showing a decrease in memory used

**Thorium** @Thorium@social.linux.pizza · May 28, 2024

May 28, 2024

Thorium @Thorium@social.linux.pizza

@slamp Time to get out the shotguns when there are that many zombies

**Ariel ( arc)** @arichtman@eigenmagic.net · May 28, 2024

May 28, 2024

Ariel ( arc) @arichtman@eigenmagic.net

@slamp hello yes I would like to subscribe to this

**Flous** @lgeurts@fosstodon.org · May 29, 2024

May 29, 2024

Flous @lgeurts@fosstodon.org

@slamp Great deduction, love how you analyze the issue. Subbing!

**slamp** @slamp · May 29, 2024

May 29, 2024

slamp @slamp

@lgeurts Thanks a lot ! I already known what are zombies processes and the fact they missed their parent. It helps me for the investigation.

**slamp** @slamp · May 29, 2024

May 29, 2024

slamp @slamp

@lgeurts I was surprised when I googled the problem and found many solutions which are: "remove health check"
This is not a solution, it's a bad workaround.
Additionally, I still need to investigate why the problem only occurs when using https (I didn't have it before), if it is only the case for #wget or also for #curl and if it only happens on a docker image based on #alpine

**Flous** @lgeurts@fosstodon.org · May 29, 2024

May 29, 2024

Flous @lgeurts@fosstodon.org

@slamp Am curious. And you're right, a workaround should be considered a temporary solution so it's always bad to implement something like that for the long term. About the https, could you explain didn't have that before. Is this a new problem?

**slamp** @slamp · May 29, 2024

May 29, 2024

slamp @slamp

@lgeurts
I didn't have the issue when doing a GET using http

This configuration didn't create zombie

healthcheck:
test: ["CMD-SHELL", "wget -q -0 - http://127.0.0.1"]
interval: 60s
timeout: 5s

Drag & drop to upload

Recent searches

Search options

Administered by:

Server stats:

Recent searches

Search options

Administered by:

Server stats:

Back