Question for the #datascience and #dataengineering folks. What would you say is the most important task or activity that data engineers do or fulfill on a regular basis? Is that also the thing that takes the most time or causes the most resource constraints?
@alexkyllo solving people problems
@stephanbuys my app has a glitch that was a reply for you
@mariah it all comes down to people in the end I see on your profile that you are in the solar industry. Use mqtt/modbus/opcua at all?
In the 15+ years that I've been in what is now called #dataEngineering, the people part of the process has always been pivotal. (Where's the data, who owns it, where should it be, what should it look like, why doesn't it look the same anymore, what answers do you need, etc, etc, all comes back to people :-) )
@stephanbuys I’d say the most important is finding ways to communicate connections and concepts, but the thing that take the longest, IMO, is often formatting the data
@Tiamat precisely! Communication is key. As for "formatting the data", it is one of the things we're actively trying to make easier over at https://hotrod.app.
@stephanbuys For me it‘s „how to select x by y in $lang/$datastrucuture?“.
@qrios well said I've got some hope for things built around the `arrow` ecosystem.
@stephanbuys I'm probably not a proper data engineer. But I do build tools to support a data science team. In practice, this means a lot of time spent moving and transforming data at scale, and designing/managing batch job systems.
Rust is really fantastic for a lot of this stuff, and we've open sourced several tools. But at scale, data munging starts to blend back into (distributed) software engineering. Which is a really fun challenge.
@emk I've always appreciated your work around #rust and #opensource!
We had a similar trajectory. Started building things in Docker and NodeJS at first, then started building Rust tools, eventually we could see some patterns emerging, and one of the main problems were "how do we manage all of this", we've built some of our answers into our app, but there are always more challenges. Management of the "data estate" is a huge problem in it's own right.
@stephanbuys Yeah, wrangling data at scale is just endlessly challenging, but in an interesting way.
We don't talk enough about the open source #RustLang stuff we've built. But we should!
Our data mover: http://www.dbcrossbar.org/
Our itty-bitty Pachyderm replacement: https://github.com/faradayio/falconeri/blob/main/guide/src/SUMMARY.md
Tiny CSV stuff: https://github.com/faradayio/csv-tools/
Geocoding manager : https://github.com/faradayio/geocode-csv/
Rust makes this stuff so easy and so utterly reliable. Many thanks to crate authors!
@stephanbuys
What many spend the most time on:
- putting out fires and being reactionary when data pipelines break.
What drives the most value:
- Steering the company to develop data infrastructure, data assets, and a data model that enables 1) a clear snapshot of truth, 2) quicker iterations for product development, and 3) scalability within the next 2 years.
I interviewed a leader on this: https://tinyurl.com/58ajm5cf
@joereis and Matt Housley's book goes into this!