It’s that interoperability of unique instances that makes the Fediverse resistant to scraping. The posts are all public, but crawling it all and categorizing everything is probably like untangling a cotton ball.
Don’t really see the problem. If you pick up the content while web crawling, you will end up with a lot of duplicates, but that’s normal. If you wanted to scrape the Fediverse in particular, you’d know the structure of the data.
It’s that interoperability of unique instances that makes the Fediverse resistant to scraping. The posts are all public, but crawling it all and categorizing everything is probably like untangling a cotton ball.
Don’t really see the problem. If you pick up the content while web crawling, you will end up with a lot of duplicates, but that’s normal. If you wanted to scrape the Fediverse in particular, you’d know the structure of the data.
Or you can host your own instance and let the servers send you all their data (instances can still defederate)