Hachyderm @hachyderm

Recent searches

Search options

Only available when logged in.

Stephan Buys @stephanbuys@hachyderm.io

Question for the #datascience and #dataengineering folks. What would you say is the most important task or activity that data engineers do or fulfill on a regular basis? Is that also the thing that takes the most time or causes the most resource constraints?

Nov 08, 2022, 05:29 AM··Toot!

2boosts·2favorites

**mariah** @mariah@tilde.zone · Nov 8, 2022

Nov 8, 2022

mariah @mariah@tilde.zone

@alexkyllo solving people problems

**mariah** @mariah@tilde.zone · Nov 8, 2022

Nov 8, 2022

mariah @mariah@tilde.zone

@stephanbuys my app has a glitch that was a reply for you

**Stephan Buys** @stephanbuys · Nov 8, 2022

Nov 8, 2022

Stephan Buys @stephanbuys

@mariah it all comes down to people in the end I see on your profile that you are in the solar industry. Use mqtt/modbus/opcua at all?
In the 15+ years that I've been in what is now called #dataEngineering, the people part of the process has always been pivotal. (Where's the data, who owns it, where should it be, what should it look like, why doesn't it look the same anymore, what answers do you need, etc, etc, all comes back to people :-) )

**Kasi** @Tiamat@octodon.social · Nov 8, 2022

Nov 8, 2022

Kasi @Tiamat@octodon.social

@stephanbuys I’d say the most important is finding ways to communicate connections and concepts, but the thing that take the longest, IMO, is often formatting the data

**Stephan Buys** @stephanbuys · Nov 8, 2022

Nov 8, 2022

Stephan Buys @stephanbuys

@Tiamat precisely! Communication is key. As for "formatting the data", it is one of the things we're actively trying to make easier over at https://hotrod.app.

hotrod.appHotrod | HotrodData Engineering Simplified

**qrios** @qrios@mathstodon.xyz · Nov 8, 2022

Nov 8, 2022

qrios @qrios@mathstodon.xyz

@stephanbuys For me it‘s „how to select x by y in $lang/$datastrucuture?“.

**Stephan Buys** @stephanbuys · Nov 8, 2022

Nov 8, 2022

Stephan Buys @stephanbuys

@qrios well said I've got some hope for things built around the `arrow` ecosystem.

**emk** @emk@mastodon.xyz · Nov 8, 2022

Nov 8, 2022

emk @emk@mastodon.xyz

@stephanbuys I'm probably not a proper data engineer. But I do build tools to support a data science team. In practice, this means a lot of time spent moving and transforming data at scale, and designing/managing batch job systems.

Rust is really fantastic for a lot of this stuff, and we've open sourced several tools. But at scale, data munging starts to blend back into (distributed) software engineering. Which is a really fun challenge.

**Stephan Buys** @stephanbuys · Nov 8, 2022

Nov 8, 2022

Stephan Buys @stephanbuys

@emk I've always appreciated your work around #rust and #opensource!

We had a similar trajectory. Started building things in Docker and NodeJS at first, then started building Rust tools, eventually we could see some patterns emerging, and one of the main problems were "how do we manage all of this", we've built some of our answers into our app, but there are always more challenges. Management of the "data estate" is a huge problem in it's own right.

**emk** @emk@mastodon.xyz · Nov 8, 2022

Nov 8, 2022

emk @emk@mastodon.xyz

@stephanbuys Yeah, wrangling data at scale is just endlessly challenging, but in an interesting way.

We don't talk enough about the open source #RustLang stuff we've built. But we should!

Our data mover: http://www.dbcrossbar.org/
Our itty-bitty Pachyderm replacement: https://github.com/faradayio/falconeri/blob/main/guide/src/SUMMARY.md
Tiny CSV stuff: https://github.com/faradayio/csv-tools/
Geocoding manager : https://github.com/faradayio/geocode-csv/

Rust makes this stuff so easy and so utterly reliable. Many thanks to crate authors!

www.dbcrossbar.orgdbcrossbar Guide - Using dbcrossbardbcrossbar copies tabular data between PostgreSQL, BigQuery, CSV and many other databases and formats.

**Mark F. | On the Mark Data** @mark@data-folks.masto.host · Nov 8, 2022

Nov 8, 2022

Mark F. | On the Mark Data @mark@data-folks.masto.host

@stephanbuys
What many spend the most time on:
- putting out fires and being reactionary when data pipelines break.

What drives the most value:
- Steering the company to develop data infrastructure, data assets, and a data model that enables 1) a clear snapshot of truth, 2) quicker iterations for product development, and 3) scalability within the next 2 years.

I interviewed a leader on this: https://tinyurl.com/58ajm5cf

@joereis and Matt Housley's book goes into this!

Scaling DataOpsSDO 001 - What is DataOps? - Christopher BergBy On the Mark Data

#data

Drag & drop to upload

Recent searches

Search options

Administered by:

Server stats:

Recent searches

Search options

Administered by:

Server stats:

Back