zombies part 2
ci/woodpecker/push/woodpecker Pipeline was successful Details

pull/36/head
Loïc Dachary 2022-06-04 16:52:27 +02:00
parent 90d1fc0d54
commit eddda717a7
Signed by: dachary
GPG Key ID: 992D23B392F9E4F2
2 changed files with 70 additions and 1 deletions

View File

@ -1,5 +1,5 @@
+++
title = "[solved] Zombies created by Gitea"
title = "[diagnostic] Zombies created by Gitea"
date = 2022-06-02
description = "An increasing number of zombies processes are created by Gitea because it only kills its direct children on timeout."
[taxonomies]

View File

@ -0,0 +1,69 @@
+++
title = "[solved] Zombies created by Gitea"
date = 2022-06-04
description = "Gitea can use process groups to kill its children using a negative PID to never create zombies."
[taxonomies]
tags = ['hostea', 'gitea', 'troubleshoot', 'problem']
[extra]
author = 'dachary'
+++
Gitea can [create zombies](zombies), for instance if a Git mirror takes too long. When updating a mirror, Gitea relies on the `git remote update` command which creates a child process, `git-remote-https`, to fetch data from the remote repository. Gitea has an internal timeout that will kill the child process (e.g. `git remote update`) when it takes too long but will not kill the grandchild. This grandchild will become an orphan and run forever or until its own timeout expires, which is about two minutes on git version 2.25.
```
$ time git clone https://4.4.4.4
Clonage dans '4.4.4.4'...
fatal: impossible d'accéder à 'https://4.4.4.4/': Failed to connect to 4.4.4.4 port 443: Connexion terminée par expiration du délai d'attente
real 2m9,753s
user 0m0,001s
sys 0m0,009s
```
As explained in the [diagnostic blog post regarding Gitea zombies](zombies/#killing-a-child-process-and-all-its-children) there fortunately is a very simple way to avoid this by making sure each Gitea child is a [process group leader](https://en.wikipedia.org/wiki/Process_group). That first step was [introduced in Gitea 1.17](https://github.com/go-gitea/gitea/pull/19865) and [backported to Gitea 1.16.9](https://github.com/go-gitea/gitea/pull/19865). The actual bug fix can now be implemented.
### Using negative process id to kill children
When Gitea timeout on a child, it relies on [os.Process.Kill](https://github.com/golang/go/blob/f8a53df314e4af8cd350eedb0dae77d4c4fc30d0/src/os/exec/exec.go#L650) which translates into a using the kill(2) system call to send a SIGKILL signal to unconditionally terminate it: `kill(pid, SIGKILL)`. Using a negative pid with `kill(-pid, SIGKILL)` will also terminate all processes created by Gitea's child, without Gitea knowing when or why they were created. From the kill(2) manual page:
> If pid is less than -1, then sig is sent to every process in the process group whose ID is -pid.
Which is implemented as follows in the [Friendly Forge Format library](https://lab.forgefriends.org/friendlyforgeformat/gofff/-/blob/f42a29284a5262d3e6f94801089369626c5197f6/util/exec.go#L79):
> `syscall.Kill(-cmd.Process.Pid, syscall.SIGKILL)`
### Not using the default Go CommandContext
Since [CommandContext](https://pkg.go.dev/os/exec#CommandContext) does not allow to send a signal to the negative pid of the child process, it has to be implemented by Gitea itself, in a way that is similar to how the [Friendly Forge Format library](https://lab.forgefriends.org/friendlyforgeformat/gofff/-/blob/f42a29284a5262d3e6f94801089369626c5197f6/util/exec.go#L71-87) does it:
```
err := cmd.Start()
...
go func() {
<-ctx.Done()
if killErr := syscall.Kill(-cmd.Process.Pid, syscall.SIGKILL); killErr == nil {
...
}
}()
err = cmd.Wait()
```
### Testing the bug is fixed and stays fixed
Long standing bugs that are difficult to reproduce manually such as this one require robust testing to ensure that:
* the diagnostic identifying the root cause is correct
* the bug fix works
* it does not resurface insidiously because of a subtle regression introduce years later
It is easy to implement as can be seen in the [Friendly Forge Format library](https://lab.forgefriends.org/friendlyforgeformat/gofff/-/blob/f42a29284a5262d3e6f94801089369626c5197f6/util/exec_test.go#L44-76). In a nutshell:
* [git clone https://4.4.4.4](https://lab.forgefriends.org/friendlyforgeformat/gofff/-/blob/f42a29284a5262d3e6f94801089369626c5197f6/util/exec_test.go#L53) which will hang because of firewall rules
* [wait for the git-remote-https](https://lab.forgefriends.org/friendlyforgeformat/gofff/-/blob/f42a29284a5262d3e6f94801089369626c5197f6/util/exec_test.go#L60-65) grandchild process to be spawned
* [cancel the context and wait for the goroutine to terminate](https://lab.forgefriends.org/friendlyforgeformat/gofff/-/blob/f42a29284a5262d3e6f94801089369626c5197f6/util/exec_test.go#L67-68)
* [verify the git-remote-https is killed](https://lab.forgefriends.org/friendlyforgeformat/gofff/-/blob/f42a29284a5262d3e6f94801089369626c5197f6/util/exec_test.go#L70-75)
And with that... no more zombies!