Over the years, I made a handful of maps of various things in Cambridge; I have collected some, but not all of them, on this page about housing things in Cambridge.
This includes things like maps of where you could legally build a fourplex (short answer: not many places!); the distribution of tax paid per parcel (Kendall Square pays a lot!) and more.
Fun fact: sharing this link on Mastodon caused my server to serve 112,772,802 bytes of data, in 430 requests, over the 60 seconds after I posted it (>7 r/s). Not because humans wanted them, but because of the LinkFetchWorker, which kicks off 1-60 seconds after Mastodon indexes a post (and possibly before it's ever seen by a human).
Every Mastodon instance fetches and stores their own local copy of my 750kb preview image.
(I was inspired by to look by @jwz's post: https://mastodon.social/@jwz/109411593248255294.)
@crschmidt well, this sounds like a p0 bug. Mastodon is going into robots.txt on many servers once this gets noticed widely.
@cshabsin Don't worry! I just confirmed that Mastodon doesn't respect robots.txt for any of these fetches, so even if it's added to robots.txt, it will have no effect!
@crschmidt how "convenient"
@crschmidt @cshabsin that definitely seems ... inappropriate.
@tw is there a P higher than 0? Or maybe before this detail, it was only p1, and now it's p0.
@cshabsin ah, the classic "P negative one"
@crschmidt robots.txt is so much cheaper though...
@cshabsin depends! Is robots.txt a static file? Or is it a URL served by your content management system, which has a full stack of URL resolution, middleware lookups, etc. in order to determine that, yes, that is a 404, because it don’t have a robots.txt?
For me it’s certainly the first one, but that’s not universally true.
(Still think respecting it is probably correct, just noting that checking it isn’t free.)
@crschmidt I think given the purpose of the file, any web server where robots.txt is expensive to serve is badly implemented.
@tw @crschmidt @cshabsin link preview bots all ignore robots.txt, so mastodon is at least following precedent here.
Except that I think Mastodon's implementation is wrong: on a centralized network the preview is created at the 'request' of the person sharing, so robots.txt doesn't apply. But here it's created fully automatically, so it really should apply. The fix would be to capture the site at sharing time and send it along in the post, which is also more efficient (though prone to abuse?)
@jefftk @tw @cshabsin yeah, the prone to abuse and "hard to standardize across all implementations" are the reasons it was rejected in 2017, and has languished as an untouched feature request since 2020 (respectively). Time to rethink that. (I don't love that a single implementation is 95% of the fediverse, but it is; standardization is frankly secondary to making sure the core implementation works well.
@gme @crschmidt @cshabsin @jefftk That's a pretty dismissive take on software violating an agreed-upon Internet standard...
@gme @crschmidt @tw @cshabsin where do you see that in the blog post? I agree that scraping a preview isn't crawling if you do it at send time, but doing it at automatically at retrieve time is
@gme @crschmidt @tw @cshabsin @jefftk let's put this straight: you think everyone will the smallest blog site which will have a dozen of human visits per day tops should use a CDN because Mastodon can't come up with a reasonable mechanism to share website previews?
@gme stop saying Cloudflare.
@gme @crschmidt @tw @cshabsin @jefftk I thought one of the point of the Fediverse was to get back some of the freedom and independence we allowed big corporation to take from us. If the consequence of Mastodon taking off is to kill the possibility of self-hosting without using another big corporation service, I'm not sure I'm in anymore. Also a CDN service at no cost? You're the product.
@corpsmoderne @gme @crschmidt @tw @cshabsin @jefftk
Well it is high time to get serious about breaking up the large monopolies and oligopolies.
Money is like gravity. It tends to clump together in ever greates amounts. Concentrating power in the hands of very few individuals.
Governments should act as a countervailing force by regulating and breaking up the big conglomerates.
https://en.m.wikipedia.org/wiki/Monopoly
https://en.m.wikipedia.org/wiki/History_of_IBM
https://mattstoller.substack.com/
https://mattstoller.substack.com/s/monopoly-bites
@corpsmoderne @gme @crschmidt @tw @cshabsin @jefftk
Re: CDN/WAF : this sounds like a protection scheme.
Regardless, you know how many people know how to set that up?
Of the people most likely to be vulnerable to a Mastodon DDoS (small businesses, independent bloggers) how many of _them_ even know what Cloudflare is, much less a CDN/WAF?
1/2
@corpsmoderne @gme @crschmidt @tw @cshabsin @jefftk
There are over 2.6k instances, it's crazy to assume that every website any of the 7m+ Mastodon users links to now has to worry about an accidental DDoS for something that they have no part in.
2/2
@gme @crschmidt @tw @cshabsin in the portion of the post that you've screenshotted I'm only talking about fetching at posting time
@gme @crschmidt @tw @cshabsin saying "Mastodon is doing the wrong thing here" doesn't mean Pleroma isn't also doing the wrong thing!
@gme @crschmidt @tw @cshabsin does the existence of CDNs mean Googlebot should also feel free to ignore robots.txt?
@gme it's really not. It's the same argument. Your argument is "My service is impacting you? That's your problem, not mine."
@gme looks like it went from "widely adopted standard" to "published rule" in September. https://en.m.wikipedia.org/wiki/Robots_exclusion_standard
@gme @crschmidt @tw @cshabsin I don't see how? It seems like the position you were staking out is that it doesn't matter whether the server is behaving contrary to spec, all that matters is that there is a technical workaround available to the site?
@gme it's obviously not a security measure, but your point seems valid. It's a method for being a good citizen of the Internet. If Mastodon, as a platform, doesn't care about following these kinds of norms, then I'm not sure how all these otherwise tech savvy people will be able to continue to support it. The social network is trivially weaponizable. Even for people with CDNs, traffic isn't free.
@gme @crschmidt @tw @cshabsin @jefftk I will likely regret wading in here but this is a rather bizarre thread. Spec or not, it’s poor design. It’s irresponsible. It’s messy. It’s resource intensive. To say “put a CDN on it” doesn’t change the waste - it hides the bug/implementation and shifts the responsibility. @jwz and friends are correct to bring attention to it. Don’t bury it. Fix it. @Gargron
@gme @crschmidt @tw @cshabsin @jefftk
Lack of alt text in a text only image... I think many clients even offer to OCR it for you. This small thing goes a long way to making this place more open and accessible!
@crschmidt @cshabsin They at least identify themselves at the useragent level though so you can filter on the server side
@Snausages @crschmidt Well, that's going to make for a nice user experience when Mastodon servers can no longer fetch previews anywhere because everyone has figured out that Mastodon is a DoS engine...
@cshabsin @crschmidt Dont have to block, just bwlimit
@Snausages @cshabsin bandwidth is not the only problem. Many webservices are designed to serve their expected usage; even for popular ish blogs, that is more like "one request per hour" than "50 requests per second". Having 973 requests come in simultaneously and then trickling out the response bytes is not solving the problem on the server side.
@Snausages @cshabsin but even so, the vector for abuse here is massive, every reply to George Takei is literally just a distributed network of 1000+ nodes to make requests to any URL you provide! Reply to him 10 times and you've generated 10,000+ requests to your target over 60 seconds... How many servers aren't designed for 133 qps? How about 266 qps?
Also, how long would it take whatever server he is hosted on to respond to an abuse claim is one was reported?
@Snausages and how many admins are going to take this into account when they shut mastodon out? How many are going to be paying enough attention to remove the block once Mastodon announces they've fixed this?
@crschmidt @cshabsin why can’t the server that link was posted to retrieve the preview, and other servers retrieve it from the server to which it was posted?
@guysherman @cshabsin It can and should. It just needs a champion to implement it. (And work on some standardization across the Fediverse to make the change as effective as possible.) https://github.com/mastodon/mastodon/issues/12738
@crschmidt This needs more than "someone to implement it". There's design work here on multiple fronts:
* a method for including the preview inline with the post (generated by the initiating server)
* a method for users to identify a preview as fradulent/abusive and have their local server regenerate the preview (without admin interaction)
* a moderator interface for reviewing such requests, to get visibility into patterns of fradulent previews
Anything else?