✕
ERROR: Job failed (system failure): Error: No such container: <container_id>
### Summary It happens on random (for about 10% of the jobs), in 2-5 min into the job. Job simply fails with this error. Restarting the job once or twice generally solves the problem System: CentOs 7 Runner: 10.2.0 / 10.4.0 (DiD) Docker: 17.06.2-ce Gitlab: 10.4.2 ### Example Job is starting normally, without much of a problem. ![b101error](/uploads/7d696b46a7664b3a0312f83ed7814f4b/b101error.png) But sometimes it just fails with: `ERROR: Job failed (system failure): Error: No such container: 858565f6d59f647f9546f785e539884259de09d3364b804b57f7fe5e618f22c9` Note that the container id is not present anywhere in logs, not even in docker logs. ## Proposal - [Use Docker volumes](https://gitlab.com/gitlab-org/gitlab-runner/-/merge_requests/1989), instead of getting the volume mappings from previous containers we used we will use the volumes directly this will help 80% of the cases. - [Make container watching more reliable](https://gitlab.com/gitlab-org/gitlab-runner/-/merge_requests/1990), this will solve a scenario where the system is under load and we start the container and the script finishes executing but we haven't yet started watching the container. When we start watching the container, the container itself is already gone which results into a 404. - [Retry stage when the container is not found for Docker executor](https://gitlab.com/gitlab-org/gitlab-runner/-/merge_requests/1995) will retry the stage from the begging if the container was removed for some reason, this will make our executor a lot more resilient to the ephemeral nature of containers. ## Dev logs <details> <summary> 2020-04-01 </summary> - Start with a PoC to add retry mechanism https://gitlab.com/gitlab-org/gitlab-runner/-/merge_requests/1983#discussion-for-poc - Investigate Events to see if they are a more reliable way to check for containers to be ready or not. After investigating the Docker codebase this is not the case https://gitlab.com/gitlab-org/gitlab-runner/-/merge_requests/1983#note_315679300 - Found a good solution to add retry mechanism https://gitlab.com/gitlab-org/gitlab-runner/-/merge_requests/1983#note_315491245 </details> <details> <summary> 2020-04-02 </summary> - Investigate why a 404 can be returned from `CreateContainer`: https://gitlab.com/gitlab-org/gitlab-runner/-/issues/4450#note_316034514 - Take a look at the data we have and see where it's failing: https://gitlab.com/gitlab-org/gitlab-runner/-/issues/4450#note_316082090 </details> <details> <summary> 2020-04-03 </summary> - Work on how we manage volumes, to create volumes directly instead of using containers and take volumes from that https://gitlab.com/gitlab-org/gitlab-runner/-/merge_requests/1989/diffs?commit_id=61b396e76ca0741a6b26486e27c32f4062853a39 </details> <details> <summary> 2020-04-06 </summary> - Finish https://gitlab.com/gitlab-org/gitlab-runner/-/merge_requests/1989 and get it ready for review. This should fix the [`ContainerCreate`](https://gitlab.com/gitlab-org/gitlab-runner/-/issues/4450#note_316034514) scenario. - Open https://gitlab.com/gitlab-org/gitlab-runner/-/merge_requests/1990 to help with the issue where the container might have exited and been cleanup already before we actually start watching the container. - Start work on https://gitlab.com/gitlab-org/gitlab-runner/-/merge_requests/1995 to retry a stage if it failed because of the container not found </details> <details> <summary> 2020-04-07 </summary> Keep working on https://gitlab.com/gitlab-org/gitlab-runner/-/merge_requests/1995 so that if a container is removed mid execution, we restart the stage (for example `before_script+script`) from the beginning so we are more resilient to the ephemeral nature of Docker containers. Container removed during execution `docker rm -f $CONTAINER_ID`: ![Screen_Shot_2020-04-07_at_18.21.21](/uploads/b1ea8d53589aa22379fbe37569b31904/Screen_Shot_2020-04-07_at_18.21.21.png) Container removed multiple times and we try up to 3 times: ![Screen_Shot_2020-04-07_at_18.19.20](/uploads/d09295fd5b0b51b1494e75db8bac9887/Screen_Shot_2020-04-07_at_18.19.20.png) This should be ready for review tomorrow since the test on Windows is not behaving as expected, and had a ton of problem trying to figure out how to create an integration test for this. </details> <details> <summary> 2020-04-08 </summary> ### 2020-04-08 All fixes have been implemented and in the review stage, to reiterate what we are doing to fix this issue: - [Use Docker volumes](https://gitlab.com/gitlab-org/gitlab-runner/-/merge_requests/1989), instead of getting the volume mappings from previous containers we used we will use the volumes directly this will help 80% of the cases. - [Make container watching more reliable](https://gitlab.com/gitlab-org/gitlab-runner/-/merge_requests/1990), this will solve a scenario where the system is under load and we start the container and the script finishes executing but we haven't yet started watching the container. When we start watching the container, the container itself is already gone which results into a 404. - [Retry stage when the container is not found for Docker executor](https://gitlab.com/gitlab-org/gitlab-runner/-/merge_requests/1995) will retry the stage from the begging if the container was removed for some reason, this will make our executor a lot more resilient to the ephemeral nature of containers. In the current issue description we have: > Every time we get a 404 we will print all the ContainerIDs that the Runner has and all the container IDs that the container daemon has this way we get a "snapshot" of what is the current state of things when this kind of issue prevents. The retry logic should be created using the [retry library](https://gitlab.com/gitlab-org/gitlab-runner/-/tree/master/helpers%2Fretry) We will not do this, if Docker returns a 404 logging the containers will not help us debugging the problem, with the fixes above we confirmed that these are more logic bugs from our end and not Docker misbehaving. I will update the issue description with the new proposal. </details> ## Regressions All the regressions the fixes for this issue has caused: 1. https://gitlab.com/gitlab-org/gitlab-runner/-/issues/25438 1. https://gitlab.com/gitlab-org/gitlab-runner/-/issues/25428 1. https://gitlab.com/gitlab-org/gitlab/-/issues/215037 1. https://gitlab.com/gitlab-org/gitlab-runner/-/issues/25440 1. https://gitlab.com/gitlab-org/gitlab-runner/-/issues/25432
issue