website/content/blog/2022-06-04-zombies-part-2.md

4.4 KiB

+++ title = "[solved] Zombies created by Gitea" date = 2022-06-04 description = "Gitea can use process groups to kill its children using a negative PID to never create zombies." [taxonomies] tags = ['gna', 'gitea', 'troubleshoot', 'problem']

[extra] author = 'dachary' +++

Gitea can create zombies, for instance if a Git mirror takes too long. When updating a mirror, Gitea relies on the git remote update command which creates a child process, git-remote-https, to fetch data from the remote repository. Gitea has an internal timeout that will kill the child process (e.g. git remote update) when it takes too long but will not kill the grandchild. This grandchild will become an orphan and run forever or until its own timeout expires, which is about two minutes on git version 2.25.

$ time git clone https://4.4.4.4
Clonage dans '4.4.4.4'...
fatal: impossible d'accéder à 'https://4.4.4.4/': Failed to connect to 4.4.4.4 port 443: Connexion terminée par expiration du délai d'attente

real	2m9,753s
user	0m0,001s
sys	0m0,009s

As explained in the diagnostic blog post regarding Gitea zombies there fortunately is a very simple way to avoid this by making sure each Gitea child is a process group leader. That first step was introduced in Gitea 1.17 and backported to Gitea 1.16.9. The actual bug fix can now be implemented.

Using negative process id to kill children

When Gitea timeout on a child, it relies on os.Process.Kill which translates into a using the kill(2) system call to send a SIGKILL signal to unconditionally terminate it: kill(pid, SIGKILL). Using a negative pid with kill(-pid, SIGKILL) will also terminate all processes created by Gitea's child, without Gitea knowing when or why they were created. From the kill(2) manual page:

If pid is less than -1, then sig is sent to every process in the process group whose ID is -pid.

Which is implemented as follows in the Friendly Forge Format library:

syscall.Kill(-cmd.Process.Pid, syscall.SIGKILL)

Not using the default Go CommandContext

Since CommandContext does not allow to send a signal to the negative pid of the child process, it has to be implemented by Gitea itself, in a way that is similar to how the Friendly Forge Format library does it:

	ctxErr := watchCtx(ctx, cmd.Process.Pid)
	err = cmd.Wait()
	interruptErr := <-ctxErr
	// If cmd.Wait returned an error, prefer that.
	// Otherwise, report any error from the interrupt goroutine.
	if interruptErr != nil && err == nil {
		err = interruptErr
	}

Testing the bug is fixed and stays fixed

Long standing bugs that are difficult to reproduce manually such as this one require robust testing to ensure that:

  • the diagnostic identifying the root cause is correct
  • the bug fix works
  • it does not resurface insidiously because of a subtle regression introduce years later

It is easy to implement as can be seen in the Friendly Forge Format library. In a nutshell:

And with that... no more zombies!