ESP32-POE-ISO Reconnection Issue & Potential Solutions

I recently encountered an issue with my Olimex ESP32-POE-ISO devices where they failed to reconnect after a brief network outage. This problem became evident during a power outage that caused my network switches to go down and subsequently come back online. Unfortunately, my Olimex ESPs did not reconnect to HomeAssistant (HA) automatically, unlike my other ESPs that use WiFi. I decided to investigate this issue further and document my findings and potential solutions for anyone else facing similar problems.

First, I noticed that rebooting the ESP devices (either by power cycling or through the web interface) restored the connection. This observation suggested that the issue might be related to how the ESP devices handle network disruptions. I checked the HomeAssistant logs and found the following warning:

WARNING <ESP_NAME>: Connection error occurred: [Errno 104] Connection reset by peer

On the ESP console, I saw the following logs:

12:42:03 [D] [api:102] Accepted 192.168.x.x
12:42:03 [W] [api.connection:070] 192.168.x.x: Network unavailable, disconnecting

I am using the esp32-poe-iso board type, which seems correct based on the documentation. The issue persists regardless of whether I power the ESPs via PoE or USB. I also checked the reboot_timeout setting under api, which is enabled by default (set to 15 minutes). However, this did not resolve the issue.

After some research, I found that this problem might be related to how the ESP devices handle network interruptions. One potential solution is to adjust the reboot_timeout setting to a shorter duration, though I am not entirely sure if this will resolve the issue. Another approach could be to implement a more robust network monitoring script or service that actively checks the connection status of the ESP devices and initiates a reboot if a disconnection is detected.

For now, I have implemented a temporary workaround by setting up a cron job that periodically checks the ESP devices’ status and reboots them if they are unreachable. While this is not an ideal long-term solution, it has helped minimize downtime until a more permanent fix is found.

I would appreciate any insights or suggestions from the community on how to resolve this issue permanently. If anyone has encountered similar problems or has successfully implemented a solution, please share your experiences below!

Update: After further testing, I discovered that adjusting the reboot_timeout to a shorter interval (5 minutes) significantly reduced the occurrence of this issue. While it does not completely eliminate the problem, it has improved the reliability of the ESP devices’ network connections.

Stay tuned for more updates as I continue to explore potential solutions!