hachyderm.io is one of the many independent Mastodon servers you can use to participate in the fediverse.
Hachyderm is a safe space, LGBTQIA+ and BLM, primarily comprised of tech industry professionals world wide. Note that many non-user account types have restrictions - please see our About page.

Administered by:

Server stats:

9.6K
active users

Over the years, I made a handful of maps of various things in Cambridge; I have collected some, but not all of them, on this page about housing things in Cambridge.

This includes things like maps of where you could legally build a fourplex (short answer: not many places!); the distribution of tax paid per parcel (Kendall Square pays a lot!) and more.

crschmidt.net/housing/cambridg

crschmidt.netHousing Explorations in CambridgeHousing-related explorations in Cambridge.

Fun fact: sharing this link on Mastodon caused my server to serve 112,772,802 bytes of data, in 430 requests, over the 60 seconds after I posted it (>7 r/s). Not because humans wanted them, but because of the LinkFetchWorker, which kicks off 1-60 seconds after Mastodon indexes a post (and possibly before it's ever seen by a human).

Every Mastodon instance fetches and stores their own local copy of my 750kb preview image.

(I was inspired by to look by @jwz's post: mastodon.social/@jwz/109411593.)

Mastodonjwz (@jwz@mastodon.social)Mastodon stampede. "Federation" now apparently means "DDoS yourself." Every time I do a new blog post, within a second I have over a thousand simultaneous hits of that URL on my web server from unique IPs. Load goes over 100, and mariadb stops... https://jwz.org/b/yj6w
Chris Shabsin

@crschmidt well, this sounds like a p0 bug. Mastodon is going into robots.txt on many servers once this gets noticed widely.

@cshabsin Don't worry! I just confirmed that Mastodon doesn't respect robots.txt for any of these fetches, so even if it's added to robots.txt, it will have no effect!

@crschmidt @cshabsin that definitely seems ... inappropriate.

@tw is there a P higher than 0? Or maybe before this detail, it was only p1, and now it's p0.

@cshabsin ah, the classic "P negative one" 🤣

@tw @cshabsin all I can say is that following robots.txt would add another request to the pile for each server, so I’m rather happier it didn’t! (But I acknowledge this is a personal preference.)

@crschmidt robots.txt is so much cheaper though...

@cshabsin depends! Is robots.txt a static file? Or is it a URL served by your content management system, which has a full stack of URL resolution, middleware lookups, etc. in order to determine that, yes, that is a 404, because it don’t have a robots.txt?
For me it’s certainly the first one, but that’s not universally true.
(Still think respecting it is probably correct, just noting that checking it isn’t free.)

@crschmidt I think given the purpose of the file, any web server where robots.txt is expensive to serve is badly implemented.

@tw @crschmidt @cshabsin link preview bots all ignore robots.txt, so mastodon is at least following precedent here.

Except that I think Mastodon's implementation is wrong: on a centralized network the preview is created at the 'request' of the person sharing, so robots.txt doesn't apply. But here it's created fully automatically, so it really should apply. The fix would be to capture the site at sharing time and send it along in the post, which is also more efficient (though prone to abuse?)

@jefftk @tw @cshabsin yeah, the prone to abuse and "hard to standardize across all implementations" are the reasons it was rejected in 2017, and has languished as an untouched feature request since 2020 (respectively). Time to rethink that. (I don't love that a single implementation is 95% of the fediverse, but it is; standardization is frankly secondary to making sure the core implementation works well.

@gme @crschmidt @cshabsin @jefftk That's a pretty dismissive take on software violating an agreed-upon Internet standard...

I read the blog post and at the very top OP even admits that Mastodon is not a crawler. So what "standard" is being broken?

@gme @crschmidt @tw @cshabsin where do you see that in the blog post? I agree that scraping a preview isn't crawling if you do it at send time, but doing it at automatically at retrieve time is

Let's accept your argument to be true for a moment.

Doesn't change the fact that there exists a technical solution to the problem you present in your argument.

Put the site behind a CDN.

@gme @crschmidt @tw @cshabsin @jefftk let's put this straight: you think everyone will the smallest blog site which will have a dozen of human visits per day tops should use a CDN because Mastodon can't come up with a reasonable mechanism to share website previews?

> let's put this straight: you think everyone will the smallest blog site which will have a dozen of human visits per day tops should use a CDN because Mastodon can't come up with a reasonable mechanism to share website previews?

Yes. If you want to prevent your site from getting DDOS'd, which is exactly what we're talking about here. Cloudflare offers a free WAF and CDN to everybody.

If you had an option of deploying a no-cost WAF and CDN in front of your website what's the excuse for not deploying it?

@gme @crschmidt @tw @cshabsin @jefftk I thought one of the point of the Fediverse was to get back some of the freedom and independence we allowed big corporation to take from us. If the consequence of Mastodon taking off is to kill the possibility of self-hosting without using another big corporation service, I'm not sure I'm in anymore. Also a CDN service at no cost? You're the product.

@corpsmoderne @gme @crschmidt @tw @cshabsin @jefftk

Well it is high time to get serious about breaking up the large monopolies and oligopolies.

Money is like gravity. It tends to clump together in ever greates amounts. Concentrating power in the hands of very few individuals.

Governments should act as a countervailing force by regulating and breaking up the big conglomerates.

en.m.wikipedia.org/wiki/Monopo
en.m.wikipedia.org/wiki/Histor
mattstoller.substack.com/
mattstoller.substack.com/s/mon

@corpsmoderne @gme @crschmidt @tw @cshabsin @jefftk

Re: CDN/WAF : this sounds like a protection scheme.

Regardless, you know how many people know how to set that up?

Of the people most likely to be vulnerable to a Mastodon DDoS (small businesses, independent bloggers) how many of _them_ even know what Cloudflare is, much less a CDN/WAF?

1/2

@corpsmoderne @gme @crschmidt @tw @cshabsin @jefftk

There are over 2.6k instances, it's crazy to assume that every website any of the 7m+ Mastodon users links to now has to worry about an accidental DDoS for something that they have no part in.

2/2

@gme @crschmidt @tw @cshabsin in the portion of the post that you've screenshotted I'm only talking about fetching at posting time

You also wrongly assume that the fediverse is only made up of Mastodon servers. When I receive a post on Pleroma my Pleroma instance also fetches the URL to generate a preview.

Again, the technical solution to this technical problem is for a site to be behind a CDN. If a site is getting hammered where it can't handle the legitimate traffic to it then it should be placed behind a CDN.

@gme @crschmidt @tw @cshabsin saying "Mastodon is doing the wrong thing here" doesn't mean Pleroma isn't also doing the wrong thing!

@gme @crschmidt @tw @cshabsin does the existence of CDNs mean Googlebot should also feel free to ignore robots.txt?

@gme it's really not. It's the same argument. Your argument is "My service is impacting you? That's your problem, not mine."

Yes, and you have two choices right?

You can block the requests in your WAF.

You can place your site behind a CDN to absorb the load.

As for Google not respecting robots.txt that's immaterial.

First, robots.txt is a custom. It's not a rule. It's not a law.

Second, plenty of bots and crawlers don't respect robots.txt. Some malicious, sure. But most are benign!

So either block the mastodon user-agent from hammering your site in your WAF, or place your site behind a CDN to absorb and spread the load.

This isn't complicated.

@gme @crschmidt @tw @cshabsin I don't see how? It seems like the position you were staking out is that it doesn't matter whether the server is behaving contrary to spec, all that matters is that there is a technical workaround available to the site?

What spec? What standard? What law? What regulation?

[RFC 9309](https://www.rfc-editor.org/rfc/rfc9309.html)?

Section 1 of RFC 9309 very clearly states:

> These rules are not a form of access authorization.

Section 2.3.1.1 states:

> If the crawler successfully downloads the robots.txt file, the crawler MUST follow the parseable rules.

But noting in the RFC mandates that the robots.txt file actually be downloaded.

In fact, Section 3 of the RFC goes to great lengths to point out:

> The Robots Exclusion Protocol is not a substitute for valid content security measures.

@gme it's obviously not a security measure, but your point seems valid. It's a method for being a good citizen of the Internet. If Mastodon, as a platform, doesn't care about following these kinds of norms, then I'm not sure how all these otherwise tech savvy people will be able to continue to support it. The social network is trivially weaponizable. Even for people with CDNs, traffic isn't free.

At least Cloudflare does have a free plan so there really isn't an excuse. If a site is personal and isn't making any money or generating any revenue there's no fiscal reason why it can't be hosted behind Cloudflare's free CDN & WAF which should mitigate these issues.

Every single one of my sites are behind Cloudflare.

For the sites I actually make money from them I gladly pay the $20 a month.

For the sites that I don't make any money on and that are personal and non-commercial, I'm using the free plan.

Simple.

@gme @crschmidt @tw @cshabsin @jefftk I will likely regret wading in here but this is a rather bizarre thread. Spec or not, it’s poor design. It’s irresponsible. It’s messy. It’s resource intensive. To say “put a CDN on it” doesn’t change the waste - it hides the bug/implementation and shifts the responsibility. @jwz and friends are correct to bring attention to it. Don’t bury it. Fix it. @Gargron

@shanselman I will also likely regret wading in on this, but:

1: Using robots.txt requires also fetching robots.txt first, so that only marginally reduces the total number of requests.

2: Caching and CDNs are already a well-established pattern on the web, and necessary for a lot of things that have nothing to do with Mastodon/fediverse. Solving that problem other ways will likely create new problems, so at least that falls back on existing solutions that are known to work.
but also, I'm reading back up this thread... come on everyone, be nice! We are all on the same side!

@gme @crschmidt @tw @cshabsin @jefftk
Lack of alt text in a text only image... 😭 I think many clients even offer to OCR it for you. This small thing goes a long way to making this place more open and accessible! 🙏

@crschmidt @cshabsin They at least identify themselves at the useragent level though so you can filter on the server side

@Snausages @crschmidt Well, that's going to make for a nice user experience when Mastodon servers can no longer fetch previews anywhere because everyone has figured out that Mastodon is a DoS engine...

@Snausages @cshabsin bandwidth is not the only problem. Many webservices are designed to serve their expected usage; even for popular ish blogs, that is more like "one request per hour" than "50 requests per second". Having 973 requests come in simultaneously and then trickling out the response bytes is not solving the problem on the server side.

@Snausages @cshabsin but even so, the vector for abuse here is massive, every reply to George Takei is literally just a distributed network of 1000+ nodes to make requests to any URL you provide! Reply to him 10 times and you've generated 10,000+ requests to your target over 60 seconds... How many servers aren't designed for 133 qps? How about 266 qps?

Also, how long would it take whatever server he is hosted on to respond to an abuse claim is one was reported?

@Snausages and how many admins are going to take this into account when they shut mastodon out? How many are going to be paying enough attention to remove the block once Mastodon announces they've fixed this?

@crschmidt @cshabsin why can’t the server that link was posted to retrieve the preview, and other servers retrieve it from the server to which it was posted?

@crschmidt This needs more than "someone to implement it". There's design work here on multiple fronts:
* a method for including the preview inline with the post (generated by the initiating server)
* a method for users to identify a preview as fradulent/abusive and have their local server regenerate the preview (without admin interaction)
* a moderator interface for reviewing such requests, to get visibility into patterns of fradulent previews

Anything else?