Hachyderm @hachyderm

Recent searches

Search options

Only available when logged in.

**Christopher Schmidt** @crschmidt@better.boston · Nov 26, 2022

Nov 26, 2022

Christopher Schmidt @crschmidt@better.boston

Over the years, I made a handful of maps of various things in Cambridge; I have collected some, but not all of them, on this page about housing things in Cambridge.

This includes things like maps of where you could legally build a fourplex (short answer: not many places!); the distribution of tax paid per parcel (Kendall Square pays a lot!) and more.

https://crschmidt.net/housing/cambridge/

crschmidt.netHousing Explorations in CambridgeHousing-related explorations in Cambridge.

#Projects #CambMA #Cambridge

**Christopher Schmidt** @crschmidt@better.boston · Nov 26, 2022

Nov 26, 2022

Christopher Schmidt @crschmidt@better.boston

Fun fact: sharing this link on Mastodon caused my server to serve 112,772,802 bytes of data, in 430 requests, over the 60 seconds after I posted it (>7 r/s). Not because humans wanted them, but because of the LinkFetchWorker, which kicks off 1-60 seconds after Mastodon indexes a post (and possibly before it's ever seen by a human).

Every Mastodon instance fetches and stores their own local copy of my 750kb preview image.

(I was inspired by to look by @jwz's post: https://mastodon.social/@jwz/109411593248255294.)

Mastodonjwz (@jwz@mastodon.social)Mastodon stampede. "Federation" now apparently means "DDoS yourself." Every time I do a new blog post, within a second I have over a thousand simultaneous hits of that URL on my web server from unique IPs. Load goes over 100, and mariadb stops... https://jwz.org/b/yj6w

Chris Shabsin @cshabsin@hachyderm.io

@crschmidt well, this sounds like a p0 bug. Mastodon is going into robots.txt on many servers once this gets noticed widely.

Nov 26, 2022, 10:02 PM··Web

0boosts·13favorites

**Christopher Schmidt** @crschmidt@better.boston · Nov 26, 2022

Nov 26, 2022

Christopher Schmidt @crschmidt@better.boston

@cshabsin Don't worry! I just confirmed that Mastodon doesn't respect robots.txt for any of these fetches, so even if it's added to robots.txt, it will have no effect!

**Chris Shabsin** @cshabsin · Nov 26, 2022

Nov 26, 2022

Chris Shabsin @cshabsin

@crschmidt how "convenient"

**Tim W** @tw@cantos.social · Nov 26, 2022

Nov 26, 2022

Tim W @tw@cantos.social

@crschmidt @cshabsin that definitely seems ... inappropriate.

**Chris Shabsin** @cshabsin · Nov 27, 2022

Nov 27, 2022

Chris Shabsin @cshabsin

@tw is there a P higher than 0? Or maybe before this detail, it was only p1, and now it's p0.

**Tim W** @tw@cantos.social · Nov 27, 2022

Nov 27, 2022

Tim W @tw@cantos.social

@cshabsin ah, the classic "P negative one"

**Christopher Schmidt** @crschmidt@better.boston · Nov 27, 2022

Nov 27, 2022

Christopher Schmidt @crschmidt@better.boston

@tw @cshabsin all I can say is that following robots.txt would add another request to the pile for each server, so I’m rather happier it didn’t! (But I acknowledge this is a personal preference.)

**Chris Shabsin** @cshabsin · Nov 27, 2022

Nov 27, 2022

Chris Shabsin @cshabsin

@crschmidt robots.txt is so much cheaper though...

**Christopher Schmidt** @crschmidt@better.boston · Nov 27, 2022

Nov 27, 2022

Christopher Schmidt @crschmidt@better.boston

@cshabsin depends! Is robots.txt a static file? Or is it a URL served by your content management system, which has a full stack of URL resolution, middleware lookups, etc. in order to determine that, yes, that is a 404, because it don’t have a robots.txt?
For me it’s certainly the first one, but that’s not universally true.
(Still think respecting it is probably correct, just noting that checking it isn’t free.)

**Chris Shabsin** @cshabsin · Nov 27, 2022

Nov 27, 2022

Chris Shabsin @cshabsin

@crschmidt I think given the purpose of the file, any web server where robots.txt is expensive to serve is badly implemented.

**Jeff Kaufman** @jefftk@mastodon.mit.edu · Nov 27, 2022

Nov 27, 2022

Jeff Kaufman @jefftk@mastodon.mit.edu

@tw @crschmidt @cshabsin link preview bots all ignore robots.txt, so mastodon is at least following precedent here.

Except that I think Mastodon's implementation is wrong: on a centralized network the preview is created at the 'request' of the person sharing, so robots.txt doesn't apply. But here it's created fully automatically, so it really should apply. The fix would be to capture the site at sharing time and send it along in the post, which is also more efficient (though prone to abuse?)

**Christopher Schmidt** @crschmidt@better.boston · Nov 27, 2022

Nov 27, 2022

Christopher Schmidt @crschmidt@better.boston

@jefftk @tw @cshabsin yeah, the prone to abuse and "hard to standardize across all implementations" are the reasons it was rejected in 2017, and has languished as an untouched feature request since 2020 (respectively). Time to rethink that. (I don't love that a single implementation is 95% of the fediverse, but it is; standardization is frankly secondary to making sure the core implementation works well.

**Jeff Kaufman** @jefftk@mastodon.mit.edu · Nov 27, 2022

Nov 27, 2022

Jeff Kaufman @jefftk@mastodon.mit.edu

@crschmidt @tw @cshabsin Follow-up https://mastodon.mit.edu/@jefftk/109416209502343043 https://www.jefftk.com/p/mastodons-dubious-crawler-exemption https://github.com/mastodon/mastodon/issues/21738

mastodon.mit.eduJeff Kaufman (@jefftk@mastodon.mit.edu)Either Mastodon's link preview bot should obey robots.txt or Mastodon needs O(1) link previews: https://www.jefftk.com/p/mastodons-dubious-crawler-exemption

**George Ellenburg (he/him/his)** @gme@bofh.social · Nov 27, 2022

Nov 27, 2022

George Ellenburg (he/him/his) @gme@bofh.social

It's 2022. Use a CDN. Cloudflare is free.

**Tim W** @tw@cantos.social · Nov 27, 2022

Nov 27, 2022

Tim W @tw@cantos.social

@gme @crschmidt @cshabsin @jefftk That's a pretty dismissive take on software violating an agreed-upon Internet standard...

**George Ellenburg (he/him/his)** @gme@bofh.social · Nov 27, 2022

Nov 27, 2022

George Ellenburg (he/him/his) @gme@bofh.social

I read the blog post and at the very top OP even admits that Mastodon is not a crawler. So what "standard" is being broken?

**Jeff Kaufman** @jefftk@mastodon.mit.edu · Nov 27, 2022

Nov 27, 2022

Jeff Kaufman @jefftk@mastodon.mit.edu

@gme @crschmidt @tw @cshabsin where do you see that in the blog post? I agree that scraping a preview isn't crawling if you do it at send time, but doing it at automatically at retrieve time is

**George Ellenburg (he/him/his)** @gme@bofh.social · Nov 27, 2022

Nov 27, 2022

George Ellenburg (he/him/his) @gme@bofh.social

Let's accept your argument to be true for a moment.

Doesn't change the fact that there exists a technical solution to the problem you present in your argument.

Put the site behind a CDN.

**Marc** @corpsmoderne@mamot.fr · Nov 28, 2022

Nov 28, 2022

Marc @corpsmoderne@mamot.fr

@gme @crschmidt @tw @cshabsin @jefftk let's put this straight: you think everyone will the smallest blog site which will have a dozen of human visits per day tops should use a CDN because Mastodon can't come up with a reasonable mechanism to share website previews?

**George Ellenburg (he/him/his)** @gme@bofh.social · Nov 28, 2022

Nov 28, 2022

George Ellenburg (he/him/his) @gme@bofh.social

> let's put this straight: you think everyone will the smallest blog site which will have a dozen of human visits per day tops should use a CDN because Mastodon can't come up with a reasonable mechanism to share website previews?

Yes. If you want to prevent your site from getting DDOS'd, which is exactly what we're talking about here. Cloudflare offers a free WAF and CDN to everybody.

If you had an option of deploying a no-cost WAF and CDN in front of your website what's the excuse for not deploying it?

**Chris Shabsin** @cshabsin · Nov 28, 2022

Nov 28, 2022

Chris Shabsin @cshabsin

@gme stop saying Cloudflare.

**Marc** @corpsmoderne@mamot.fr · Nov 28, 2022

Nov 28, 2022

Marc @corpsmoderne@mamot.fr

@gme @crschmidt @tw @cshabsin @jefftk I thought one of the point of the Fediverse was to get back some of the freedom and independence we allowed big corporation to take from us. If the consequence of Mastodon taking off is to kill the possibility of self-hosting without using another big corporation service, I'm not sure I'm in anymore. Also a CDN service at no cost? You're the product.

**antipode77** @antipode77@mastodon.nl · Nov 28, 2022

Nov 28, 2022

antipode77 @antipode77@mastodon.nl

@corpsmoderne @gme @crschmidt @tw @cshabsin @jefftk

Well it is high time to get serious about breaking up the large monopolies and oligopolies.

Money is like gravity. It tends to clump together in ever greates amounts. Concentrating power in the hands of very few individuals.

Governments should act as a countervailing force by regulating and breaking up the big conglomerates.

https://en.m.wikipedia.org/wiki/Monopoly
https://en.m.wikipedia.org/wiki/History_of_IBM
https://mattstoller.substack.com/
https://mattstoller.substack.com/s/monopoly-bites

IBM History
Market power and Anti Trust

https://en.m.wikipedia.org/wiki/United_States_antitrust_law

**raynor (not verified)** @raynor@raynor.haus · Nov 28, 2022

Nov 28, 2022

raynor (not verified) @raynor@raynor.haus

@corpsmoderne @gme @crschmidt @tw @cshabsin @jefftk

Re: CDN/WAF : this sounds like a protection scheme.

Regardless, you know how many people know how to set that up?

Of the people most likely to be vulnerable to a Mastodon DDoS (small businesses, independent bloggers) how many of _them_ even know what Cloudflare is, much less a CDN/WAF?

1/2

**raynor (not verified)** @raynor@raynor.haus · Nov 28, 2022

Nov 28, 2022

raynor (not verified) @raynor@raynor.haus

@corpsmoderne @gme @crschmidt @tw @cshabsin @jefftk

There are over 2.6k instances, it's crazy to assume that every website any of the 7m+ Mastodon users links to now has to worry about an accidental DDoS for something that they have no part in.

2/2

**George Ellenburg (he/him/his)** @gme@bofh.social · Nov 27, 2022

Nov 27, 2022

George Ellenburg (he/him/his) @gme@bofh.social

Also:

**Jeff Kaufman** @jefftk@mastodon.mit.edu · Nov 27, 2022

Nov 27, 2022

Jeff Kaufman @jefftk@mastodon.mit.edu

@gme @crschmidt @tw @cshabsin in the portion of the post that you've screenshotted I'm only talking about fetching at posting time

**George Ellenburg (he/him/his)** @gme@bofh.social · Nov 27, 2022

Nov 27, 2022

George Ellenburg (he/him/his) @gme@bofh.social

You also wrongly assume that the fediverse is only made up of Mastodon servers. When I receive a post on Pleroma my Pleroma instance also fetches the URL to generate a preview.

Again, the technical solution to this technical problem is for a site to be behind a CDN. If a site is getting hammered where it can't handle the legitimate traffic to it then it should be placed behind a CDN.

**Jeff Kaufman** @jefftk@mastodon.mit.edu · Nov 27, 2022

Nov 27, 2022

Jeff Kaufman @jefftk@mastodon.mit.edu

@gme @crschmidt @tw @cshabsin saying "Mastodon is doing the wrong thing here" doesn't mean Pleroma isn't also doing the wrong thing!

**George Ellenburg (he/him/his)** @gme@bofh.social · Nov 27, 2022

Nov 27, 2022

George Ellenburg (he/him/his) @gme@bofh.social

It's not wrong.

Use a CDN.

**Jeff Kaufman** @jefftk@mastodon.mit.edu · Nov 27, 2022

Nov 27, 2022

Jeff Kaufman @jefftk@mastodon.mit.edu

@gme @crschmidt @tw @cshabsin does the existence of CDNs mean Googlebot should also feel free to ignore robots.txt?

**George Ellenburg (he/him/his)** @gme@bofh.social · Nov 27, 2022

Nov 27, 2022

George Ellenburg (he/him/his) @gme@bofh.social

That's a straw man.

**Chris Shabsin** @cshabsin · Nov 27, 2022

Nov 27, 2022

Chris Shabsin @cshabsin

@gme it's really not. It's the same argument. Your argument is "My service is impacting you? That's your problem, not mine."

**George Ellenburg (he/him/his)** @gme@bofh.social · Nov 27, 2022

Nov 27, 2022

George Ellenburg (he/him/his) @gme@bofh.social

Yes, and you have two choices right?

You can block the requests in your WAF.

You can place your site behind a CDN to absorb the load.

As for Google not respecting robots.txt that's immaterial.

First, robots.txt is a custom. It's not a rule. It's not a law.

Second, plenty of bots and crawlers don't respect robots.txt. Some malicious, sure. But most are benign!

So either block the mastodon user-agent from hammering your site in your WAF, or place your site behind a CDN to absorb and spread the load.

This isn't complicated.

**Chris Shabsin** @cshabsin · Nov 27, 2022

Nov 27, 2022

Chris Shabsin @cshabsin

@gme looks like it went from "widely adopted standard" to "published rule" in September. https://en.m.wikipedia.org/wiki/Robots_exclusion_standard

en.m.wikipedia.orgRobots exclusion standard - Wikipedia

**Jeff Kaufman** @jefftk@mastodon.mit.edu · Nov 27, 2022

Nov 27, 2022

Jeff Kaufman @jefftk@mastodon.mit.edu

@gme @crschmidt @tw @cshabsin I don't see how? It seems like the position you were staking out is that it doesn't matter whether the server is behaving contrary to spec, all that matters is that there is a technical workaround available to the site?

**George Ellenburg (he/him/his)** @gme@bofh.social · Nov 27, 2022

Nov 27, 2022

George Ellenburg (he/him/his) @gme@bofh.social

What spec? What standard? What law? What regulation?

[RFC 9309](https://www.rfc-editor.org/rfc/rfc9309.html)?

Section 1 of RFC 9309 very clearly states:

> These rules are not a form of access authorization.

Section 2.3.1.1 states:

> If the crawler successfully downloads the robots.txt file, the crawler MUST follow the parseable rules.

But noting in the RFC mandates that the robots.txt file actually be downloaded.

In fact, Section 3 of the RFC goes to great lengths to point out:

> The Robots Exclusion Protocol is not a substitute for valid content security measures.

**Chris Shabsin** @cshabsin · Nov 27, 2022

Nov 27, 2022

Chris Shabsin @cshabsin

@gme it's obviously not a security measure, but your point seems valid. It's a method for being a good citizen of the Internet. If Mastodon, as a platform, doesn't care about following these kinds of norms, then I'm not sure how all these otherwise tech savvy people will be able to continue to support it. The social network is trivially weaponizable. Even for people with CDNs, traffic isn't free.

**George Ellenburg (he/him/his)** @gme@bofh.social · Nov 27, 2022

Nov 27, 2022

George Ellenburg (he/him/his) @gme@bofh.social

At least Cloudflare does have a free plan so there really isn't an excuse. If a site is personal and isn't making any money or generating any revenue there's no fiscal reason why it can't be hosted behind Cloudflare's free CDN & WAF which should mitigate these issues.

Every single one of my sites are behind Cloudflare.

For the sites I actually make money from them I gladly pay the $20 a month.

For the sites that I don't make any money on and that are personal and non-commercial, I'm using the free plan.

Simple.

**Scott Hanselman** @shanselman · Nov 28, 2022

Nov 28, 2022

Scott Hanselman @shanselman

@gme @crschmidt @tw @cshabsin @jefftk I will likely regret wading in here but this is a rather bizarre thread. Spec or not, it’s poor design. It’s irresponsible. It’s messy. It’s resource intensive. To say “put a CDN on it” doesn’t change the waste - it hides the bug/implementation and shifts the responsibility. @jwz and friends are correct to bring attention to it. Don’t bury it. Fix it. @Gargron

**Aaron Parecki** @aaronpk@aaronparecki.com · Nov 28, 2022

Nov 28, 2022

Aaron Parecki @aaronpk@aaronparecki.com

@shanselman I will also likely regret wading in on this, but:

1: Using robots.txt requires also fetching robots.txt first, so that only marginally reduces the total number of requests.

2: Caching and CDNs are already a well-established pattern on the web, and necessary for a lot of things that have nothing to do with Mastodon/fediverse. Solving that problem other ways will likely create new problems, so at least that falls back on existing solutions that are known to work.

**Aaron Parecki** @aaronpk@aaronparecki.com · Nov 28, 2022

Nov 28, 2022

Aaron Parecki @aaronpk@aaronparecki.com

but also, I'm reading back up this thread... come on everyone, be nice! We are all on the same side!

**haliphax** @haliphax · Nov 28, 2022

Nov 28, 2022

haliphax @haliphax

@gme @crschmidt @tw @cshabsin @jefftk
Lack of alt text in a text only image... I think many clients even offer to OCR it for you. This small thing goes a long way to making this place more open and accessible!

**Snausages** @Snausages@infosec.exchange · Nov 27, 2022

Nov 27, 2022

Snausages @Snausages@infosec.exchange

@crschmidt @cshabsin They at least identify themselves at the useragent level though so you can filter on the server side

**Chris Shabsin** @cshabsin · Nov 27, 2022

Nov 27, 2022

Chris Shabsin @cshabsin

@Snausages @crschmidt Well, that's going to make for a nice user experience when Mastodon servers can no longer fetch previews anywhere because everyone has figured out that Mastodon is a DoS engine...

**Snausages** @Snausages@infosec.exchange · Nov 27, 2022 *

Nov 27, 2022 *

Snausages @Snausages@infosec.exchange

@cshabsin @crschmidt Dont have to block, just bwlimit

**Christopher Schmidt** @crschmidt@better.boston · Nov 27, 2022

Nov 27, 2022

Christopher Schmidt @crschmidt@better.boston

@Snausages @cshabsin bandwidth is not the only problem. Many webservices are designed to serve their expected usage; even for popular ish blogs, that is more like "one request per hour" than "50 requests per second". Having 973 requests come in simultaneously and then trickling out the response bytes is not solving the problem on the server side.

**Christopher Schmidt** @crschmidt@better.boston · Nov 27, 2022

Nov 27, 2022

Christopher Schmidt @crschmidt@better.boston

@Snausages @cshabsin but even so, the vector for abuse here is massive, every reply to George Takei is literally just a distributed network of 1000+ nodes to make requests to any URL you provide! Reply to him 10 times and you've generated 10,000+ requests to your target over 60 seconds... How many servers aren't designed for 133 qps? How about 266 qps?

Also, how long would it take whatever server he is hosted on to respond to an abuse claim is one was reported?

**Chris Shabsin** @cshabsin · Nov 28, 2022

Nov 28, 2022

Chris Shabsin @cshabsin

@Snausages and how many admins are going to take this into account when they shut mastodon out? How many are going to be paying enough attention to remove the block once Mastodon announces they've fixed this?

**ACAB for Cutie** @CassOfDunshire@social.stlouist.com · Nov 28, 2022

Nov 28, 2022

ACAB for Cutie @CassOfDunshire@social.stlouist.com

@crschmidt
Oh my
@cshabsin

**Guy Sherman** @guysherman@mastodon.nz · Nov 28, 2022

Nov 28, 2022

Guy Sherman @guysherman@mastodon.nz

@crschmidt @cshabsin why can’t the server that link was posted to retrieve the preview, and other servers retrieve it from the server to which it was posted?

**Christopher Schmidt** @crschmidt@better.boston · Nov 28, 2022

Nov 28, 2022

Christopher Schmidt @crschmidt@better.boston

@guysherman @cshabsin It can and should. It just needs a champion to implement it. (And work on some standardization across the Fediverse to make the change as effective as possible.) https://github.com/mastodon/mastodon/issues/12738

GitHubFetch link metadata on sender instance rather than receiver instances · Issue #12738 · mastodon/mastodonBy BenLubar

**Chris Shabsin** @cshabsin · Nov 28, 2022

Nov 28, 2022

Chris Shabsin @cshabsin

@crschmidt This needs more than "someone to implement it". There's design work here on multiple fronts:
* a method for including the preview inline with the post (generated by the initiating server)
* a method for users to identify a preview as fradulent/abusive and have their local server regenerate the preview (without admin interaction)
* a moderator interface for reviewing such requests, to get visibility into patterns of fradulent previews

Anything else?

Drag & drop to upload

Recent searches

Search options

Administered by:

Server stats:

Recent searches

Search options

Administered by:

Server stats:

Back