4.4 KiB
+++ title = "[solved] Zombies created by Gitea" date = 2022-06-04 description = "Gitea can use process groups to kill its children using a negative PID to never create zombies." [taxonomies] tags = ['hostea', 'gitea', 'troubleshoot', 'problem']
[extra] author = 'dachary' +++
Gitea can create zombies, for instance if a Git mirror takes too long. When updating a mirror, Gitea relies on the git remote update
command which creates a child process, git-remote-https
, to fetch data from the remote repository. Gitea has an internal timeout that will kill the child process (e.g. git remote update
) when it takes too long but will not kill the grandchild. This grandchild will become an orphan and run forever or until its own timeout expires, which is about two minutes on git version 2.25.
$ time git clone https://4.4.4.4
Clonage dans '4.4.4.4'...
fatal: impossible d'accéder à 'https://4.4.4.4/': Failed to connect to 4.4.4.4 port 443: Connexion terminée par expiration du délai d'attente
real 2m9,753s
user 0m0,001s
sys 0m0,009s
As explained in the diagnostic blog post regarding Gitea zombies there fortunately is a very simple way to avoid this by making sure each Gitea child is a process group leader. That first step was introduced in Gitea 1.17 and backported to Gitea 1.16.9. The actual bug fix can now be implemented.
Using negative process id to kill children
When Gitea timeout on a child, it relies on os.Process.Kill which translates into a using the kill(2) system call to send a SIGKILL signal to unconditionally terminate it: kill(pid, SIGKILL)
. Using a negative pid with kill(-pid, SIGKILL)
will also terminate all processes created by Gitea's child, without Gitea knowing when or why they were created. From the kill(2) manual page:
If pid is less than -1, then sig is sent to every process in the process group whose ID is -pid.
Which is implemented as follows in the Friendly Forge Format library:
syscall.Kill(-cmd.Process.Pid, syscall.SIGKILL)
Not using the default Go CommandContext
Since CommandContext does not allow to send a signal to the negative pid of the child process, it has to be implemented by Gitea itself, in a way that is similar to how the Friendly Forge Format library does it:
ctxErr := watchCtx(ctx, cmd.Process.Pid)
err = cmd.Wait()
interruptErr := <-ctxErr
// If cmd.Wait returned an error, prefer that.
// Otherwise, report any error from the interrupt goroutine.
if interruptErr != nil && err == nil {
err = interruptErr
}
Testing the bug is fixed and stays fixed
Long standing bugs that are difficult to reproduce manually such as this one require robust testing to ensure that:
- the diagnostic identifying the root cause is correct
- the bug fix works
- it does not resurface insidiously because of a subtle regression introduce years later
It is easy to implement as can be seen in the Friendly Forge Format library. In a nutshell:
- git clone https://4.4.4.4 which will hang because of firewall rules
- wait for the git-remote-https grandchild process to be spawned
- cancel the context and wait for the goroutine to terminate
- verify the git-remote-https is killed
And with that... no more zombies!