Okay, the number of devices probably doesn’t play such a major role; around 50 devices really shouldn’t be a big problem. Your issues are likely more related to their spatial distribution and how far away they are from the coordinator. Theoretically, you should be able to solve these problems with a mains-powered Zigbee device, like a smart plug, because those act as routers and extend the Zigbee mesh (similar to how it works with Thread). However, if you already have several Thread Border Routers, making the switch obviously makes a lot of sense too.
I don’t know why your HA would go down, but maybe a system health monitoring integration (like HAGHS, Watchman or Spook) could help, or you could try letting an AI analyze your setup and logs.
My setup is built quite differently to avoid exactly these kinds of dropouts:
Network Structure & Bridging: I use multiple Aqara Hubs (Zigbee + TBR), and every sensor is directly connected to its nearest hub. I also use several Zigbee smart plugs to extend the signal within their respective meshes. Each of these hubs is then integrated into Home Assistant (HA) as a Matter Bridge. For any non-Aqara devices, I run an additional, dedicated Zigbee coordinator and a Z-Wave stick directly on my HA server.
Redundancy & Fallbacks: On top of that, I have several Apple Home Hubs acting as Thread Border Routers. If one fails, the others seamlessly take over. I am currently also considering getting an Aqara M3 Hub. It would take over the lead role in the Aqara network, and if it ever goes down, my existing hubs would just act as fallbacks.
Ecosystems & Automations: All my sensors and devices are connected to Aqara Home, Apple Home, and HA simultaneously. I distribute my automations across the different ecosystems depending on the specific use case. This makes a complete system failure highly unlikely; if something goes down, it only affects a small subsection.
Hardware: My HA instance runs on relatively powerful hardware (a Synology+ NAS) and has literally never gone down. Theoretically, I could set it up to be completely fail-safe with high availability, but honestly, it runs so stably that doing so would basically just be a waste of money.