NextCloud crashing #37

New Issue

eric · 2025-07-28T02:26:25Z

eric commented

2025-07-28 02:26:25 +00:00

NextCloud (along with several other services) is crashing frequently, seemingly as a result of Redis crashing. Here are some log messages:

Redis:

Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.

NextCloud

[Mon Jul 28 01:31:02.852886 2025] [mpm_prefork:notice] [pid 1:tid 1] AH00170: caught SIGWINCH, shutting down gracefully                                       10.244.9.1 - - [28/Jul/2025:01:30:57 +0000] "GET /status.php HTTP/1.1" 200 1622 "-" "kube-probe/1.31"

If there are too many file pointers open, adding additional processors to the stack may not be helpful. It would depend on a number of factors...

If I added another VM to the cluster that is running on the R720, and Redis was running on this new instance, it could utilize the underlying filesystem of that machine, thus alleviating disk pressure.

NextCloud (along with several other services) is crashing frequently, seemingly as a result of Redis crashing. Here are some log messages: Redis: ``` Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis. ``` NextCloud ``` [Mon Jul 28 01:31:02.852886 2025] [mpm_prefork:notice] [pid 1:tid 1] AH00170: caught SIGWINCH, shutting down gracefully 10.244.9.1 - - [28/Jul/2025:01:30:57 +0000] "GET /status.php HTTP/1.1" 200 1622 "-" "kube-probe/1.31" ``` If there are too many file pointers open, adding additional processors to the stack may not be helpful. It would depend on a number of factors... If I added another VM to the cluster that is running on the R720, and Redis was running on this new instance, it could utilize the underlying filesystem of that machine, thus alleviating disk pressure.

eric referenced this issue from DevOps/software-infrastructure

2025-07-31 03:54:02 +00:00

PowerEdge R720 QEMU & Kubernetes Setup #23

eric referenced this issue from DevOps/software-infrastructure

2025-07-31 22:15:33 +00:00

PowerEdge R720 QEMU & Kubernetes Setup #23

eric referenced this issue

2025-08-01 17:28:47 +00:00

Pinned PostgreSQL to v16.6 #38

eric referenced this issue from a commit

2025-08-01 17:29:35 +00:00

Pinned PostgreSQL to v16.6 (#38)

eric commented

2025-08-01 17:56:58 +00:00

Using an NFS Client in a Kubernetes Helm Chart

In this first pass at addressing the issue described above, I attempted to configure the Bitnami Helm chart for Redis to use the NFS client deployed in DevOps/software-infrastructure#23. Unfortunately, I encountered two issues in doing so: the first was that the Redis deployment uses a Kubernetes Stateful Set, which does not allow updates to the underlying storage class. I believe I will need to delete this Stateful Set in order to use the new NFS provisioner; however, I am apprehensive to do so out of fear that it might create an unrecoverable situation. The attempted change also caused the Redis pods to be shuffled to R720 nodes, so I am curious to see if the problem is resolved with this unintended result. The second problem encountered during this update was an incompatibility between PostgreSQL versions. I was able to pin the required one, but this can only be a temporary solution. I will need to consult Nextcloud to see how they recommend upgrading PostgreSQL versions when using Helm.

# Using an NFS Client in a Kubernetes Helm Chart <video controls type="video/mp4" src="https://minio.eom.dev/public/Videos/2025-08-01_12-20-52.mp4"></video> In this first pass at addressing the issue described above, I attempted to configure the Bitnami Helm chart for Redis to use the NFS client deployed in DevOps/software-infrastructure#23. Unfortunately, I encountered two issues in doing so: the first was that the Redis deployment uses a Kubernetes Stateful Set, which does not allow updates to the underlying storage class. I believe I will need to delete this Stateful Set in order to use the new NFS provisioner; however, I am apprehensive to do so out of fear that it might create an unrecoverable situation. The attempted change also caused the Redis pods to be shuffled to R720 nodes, so I am curious to see if the problem is resolved with this unintended result. The second problem encountered during this update was an incompatibility between PostgreSQL versions. I was able to pin the required one, but this can only be a temporary solution. I will need to consult Nextcloud to see how they recommend upgrading PostgreSQL versions when using Helm.

eric added spent time 2025-08-01 17:57:11 +00:00

1 hour 5 minutes

eric commented

2025-08-01 20:43:39 +00:00

The Redis pods have not had a restart since the deployment. It is still too soon to tell, but it seems that having these pods cycled to the new R720 nodes may have been sufficient to mitigate the issue. I am going to give it a full 24 hours before closing the issue.

eric commented

2025-08-03 22:34:57 +00:00

After 48 hours running on the new node, I am satisfied that this is now working and switching to the second NFS client was unnecessary.

eric closed this issue

2025-08-03 22:34:58 +00:00

Sign in to join this conversation.

No Label

No Milestone

No project

No Assignees

1 Participants

Notifications

Total Time Spent: 1 hour 5 minutes

eric

1 hour 5 minutes

Due Date

The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: DevOps/ansible-role-eom#37