I just wanted to make sure you were aware - there seem to be long load times especially when loading the community pages (posts load fairly quickly).
I’ve not tried the onion instance since reporting the data loss issue, but in principle a the onion host could be a good candidate for read-only access (scraping).
Would it perhaps make sense to redirect the greedy subnet to the onion instance? I wonder if it’s even possible. The privacyinternational website used to auto-detect requests from Tor exit nodes and automatically redirect to their onion site. In the case of mander, it would do that for the subnet giving problems. They are obviously not using Tor to visit your site, but they could have Tor installed. You would effectively be sending the msg “hey, plz do your scraping on the onion node,” which is gentler than blocking in case there is more legit traffic from the same subnet. That is assuming your problem is not scraping generally but just that they are hogging bandwidth that competes with most users. The Tor network has some built-in anti-DDoS logic now, supposedly, so they would naturally get bottlenecked IIUC.
I guess the next question is whether the onion site has a separate allocation of bandwidth. But even if it doesn’t, Tor has a natural bottleneck b/c traffic can only move as fast as the slowest of the 3 hops the circuit goes through.
I have experienced issues both over tor and over clearnet. The tor front-end exists on its own server, but it connects to the mander server. So, the server that hosts the front-end via Tor will see the exit node connecting to it, and then the mander server gets the requests via that Tor server. Ultimately some bandwidth is used for both servers because the data travels from mander, to the tor front-end, and then to the exit node. There is also another server that hosts and serves the images.
What I see is not a bandwidth problem, though. It seems like the database queries are the bottleneck. There is a limited number of connections to the database, and some of the queries are complex and use a lot of CPU. It is the intense searching through the database what appears to throttle the website.
So, the server that hosts the front-end via Tor will see the exit node connecting to it
The onion eliminates the use of exit nodes. But I know what you mean.
I appreciate the explanation. It sounds like replicating the backend and DB on the Tor node would help. Not sure how complex it would be to have the DBs synchronise during idle moments.
Perhaps a bit radical, but I wonder if it would be interesting to do a nightly DB export to JSON or CSV files that are reachable from the onion front end. Scrapers would prefer that over scraping, and it would be less intrusive on the website. Though I don’t know how tricky it would be to exclude non-public data from the dataset.
Got a few minutes of 503 gateway timeout errors here, but the pages just loaded back up again.
This morning I woke up and a new IP sub-net (43.173.0.0/16) was excessively hitting the site from multiple IPs, probably scraping, making the site unresponsive. I blocked that sub-net and the site is responsive again.
Working great now, thank you!


