67 lines
4.4 KiB
Markdown
67 lines
4.4 KiB
Markdown
+++
|
|
title = "[solved] Zombies created by Gitea"
|
|
date = 2022-06-04
|
|
description = "Gitea can use process groups to kill its children using a negative PID to never create zombies."
|
|
[taxonomies]
|
|
tags = ['hostea', 'gitea', 'troubleshoot', 'problem']
|
|
|
|
[extra]
|
|
author = 'dachary'
|
|
+++
|
|
|
|
Gitea can [create zombies](/blog/zombies), for instance if a Git mirror takes too long. When updating a mirror, Gitea relies on the `git remote update` command which creates a child process, `git-remote-https`, to fetch data from the remote repository. Gitea has an internal timeout that will kill the child process (e.g. `git remote update`) when it takes too long but will not kill the grandchild. This grandchild will become an orphan and run forever or until its own timeout expires, which is about two minutes on git version 2.25.
|
|
|
|
```
|
|
$ time git clone https://4.4.4.4
|
|
Clonage dans '4.4.4.4'...
|
|
fatal: impossible d'accéder à 'https://4.4.4.4/': Failed to connect to 4.4.4.4 port 443: Connexion terminée par expiration du délai d'attente
|
|
|
|
real 2m9,753s
|
|
user 0m0,001s
|
|
sys 0m0,009s
|
|
```
|
|
|
|
As explained in the [diagnostic blog post regarding Gitea zombies](/blog/zombies/#killing-a-child-process-and-all-its-children) there fortunately is a very simple way to avoid this by making sure each Gitea child is a [process group leader](https://en.wikipedia.org/wiki/Process_group). That first step was [introduced in Gitea 1.17](https://github.com/go-gitea/gitea/pull/19865) and [backported to Gitea 1.16.9](https://github.com/go-gitea/gitea/pull/19865). The actual bug fix can now be implemented.
|
|
|
|
### Using negative process id to kill children
|
|
|
|
When Gitea timeout on a child, it relies on [os.Process.Kill](https://github.com/golang/go/blob/f8a53df314e4af8cd350eedb0dae77d4c4fc30d0/src/os/exec/exec.go#L650) which translates into a using the kill(2) system call to send a SIGKILL signal to unconditionally terminate it: `kill(pid, SIGKILL)`. Using a negative pid with `kill(-pid, SIGKILL)` will also terminate all processes created by Gitea's child, without Gitea knowing when or why they were created. From the kill(2) manual page:
|
|
|
|
> If pid is less than -1, then sig is sent to every process in the process group whose ID is -pid.
|
|
|
|
Which is implemented as follows in the [Friendly Forge Format library](https://lab.forgefriends.org/friendlyforgeformat/gofff/-/blob/a9603c7cc934fccd4382b7f4309b75c852742480/util/exec.go#L130):
|
|
|
|
> `syscall.Kill(-cmd.Process.Pid, syscall.SIGKILL)`
|
|
|
|
### Not using the default Go CommandContext
|
|
|
|
Since [CommandContext](https://pkg.go.dev/os/exec#CommandContext) does not allow to send a signal to the negative pid of the child process, it has to be implemented by Gitea itself, in a way that is similar to how the [Friendly Forge Format library](https://lab.forgefriends.org/friendlyforgeformat/gofff/-/blob/a9603c7cc934fccd4382b7f4309b75c852742480/util/exec.go#L75-82) does it:
|
|
|
|
```
|
|
ctxErr := watchCtx(ctx, cmd.Process.Pid)
|
|
err = cmd.Wait()
|
|
interruptErr := <-ctxErr
|
|
// If cmd.Wait returned an error, prefer that.
|
|
// Otherwise, report any error from the interrupt goroutine.
|
|
if interruptErr != nil && err == nil {
|
|
err = interruptErr
|
|
}
|
|
```
|
|
|
|
### Testing the bug is fixed and stays fixed
|
|
|
|
Long standing bugs that are difficult to reproduce manually such as this one require robust testing to ensure that:
|
|
|
|
* the diagnostic identifying the root cause is correct
|
|
* the bug fix works
|
|
* it does not resurface insidiously because of a subtle regression introduce years later
|
|
|
|
It is easy to implement as can be seen in the [Friendly Forge Format library](https://lab.forgefriends.org/friendlyforgeformat/gofff/-/blob/a9603c7cc934fccd4382b7f4309b75c852742480/util/exec_test.go#L44-76). In a nutshell:
|
|
|
|
* [git clone https://4.4.4.4](https://lab.forgefriends.org/friendlyforgeformat/gofff/-/blob/a9603c7cc934fccd4382b7f4309b75c852742480/util/exec_test.go#L53) which will hang because of firewall rules
|
|
* [wait for the git-remote-https](https://lab.forgefriends.org/friendlyforgeformat/gofff/-/blob/a9603c7cc934fccd4382b7f4309b75c852742480/util/exec_test.go#L60-65) grandchild process to be spawned
|
|
* [cancel the context and wait for the goroutine to terminate](https://lab.forgefriends.org/friendlyforgeformat/gofff/-/blob/a9603c7cc934fccd4382b7f4309b75c852742480/util/exec_test.go#L67-68)
|
|
* [verify the git-remote-https is killed](https://lab.forgefriends.org/friendlyforgeformat/gofff/-/blob/a9603c7cc934fccd4382b7f4309b75c852742480/util/exec_test.go#L70-75)
|
|
|
|
And with that... no more zombies!
|