What I Learned After One Year of Building A Data Platform From Scratch - by Jeremy Surget - Medium
What I Learned After One Year of Building A Data Platform From Scratch - by Jeremy Surget - Medium
You are signed out. Sign in with your member account (al__@g__.com) to view other
member-only stories. Sign in
Search Write
2.8K 44
Photo by Luke Chesser on Unsplash
You are signed out. Sign in with your member account (al__@g__.com) to view other
member-only stories. Sign in
One year ago, I joined a French start-up called Allowa, which is on a mission
to be the marketplace for real estate services. I joined as the first data guy to
help structure all their data and ultimately extract value from it.
The data stack is a typical ELT stack, almost 100% open source, hosted on
AWS.
You don’t have to maintain custom scripts, these tools come with +300
connectors, basic scheduling, and error handling.
We used Airbyte for data extraction and scheduled dbt transformations since
it comes with simple scheduling out of the box. We also used AWS’s
EventBridge to schedule Python jobs via ECS tasks. This method was
effective and uncomplicated, and it allowed us to prioritize simplicity while
ensuring that our core needs were met.
You are signed out. Sign in with your member account (al__@g__.com) to view other
member-only stories. Sign in
The KISS (Keep It Simple and Stupid) principle enabled us to make progress
without overcomplicating our workflow. We can now consider whether more
complex scheduling and orchestrations would be beneficial as the stack
grows and the team scales.
Let’s be honest, unless you have massive amounts of data, more than
hundreds of To’s of data, all these fancy data warehouses like Redshift just
aren’t worth the cost. Redshift isn’t open source, so you can’t have a complete
mini-data stack on your local computer for testing purposes. Plus, Redshift,
being built on top of Postgres 8, sometimes lacks the cool features that the
newer releases of Postgres have.
When you get started, at least, apply the basics for data security :
If you forget about the security of your infrastructure it might head back to
you one time.
Data Quality
“Garbage in, garbage out”
While I could have added this under the data stack part, Data Quality is so
important that It deserves its own section.
People will always question the validity of the metrics presented to them,
and they might be right if you cannot demonstrate the accuracy of the data.
Fixing data quality issues takes time and during this process, it may seem
like we are not delivering tangible value to stakeholders. This is why we
sometimes rush into getting a dashboard in front of them, as it offers a more
palpable value than data quality. However, this way of doing only leads to a
lack of confidence in the data team due to poor data quality. Confidence in
data is hard to get but is easy to lose. You should avoid at all costs showing
inaccurate data to stakeholders, otherwise, their confidence in data will fade
very quickly. Taking care of data quality is an investment that is worth doing
as early as possible.
That’s why, even before knowing if there are data quality issues (there are,
always) you should establish a framework for checking and monitoring data
across the organization. This can also serve as an initial step in giving
business people ownership of the data. Showing them what is wrong with
their data and how they could fix it.
At first, you may have some surfacing problems to solve with data, but
hopefully, people will want to go deeper and ask you to solve more exciting
problems using data.
Of course, one presentation isn’t enough and you have to constantly remind
people about best practices concerning their data. Communicating often
helps you to get more and more people concerned about data in the
company and eventually, you will have people becoming data champions
and spreading the word all over the company. Cherish your data champions
as they are your greatest allies in creating a more data-driven company
culture.
In most cases, you won’t find neatly structured data waiting for you in an
SQL database. That’s why it’s crucial to remain flexible and adaptable when
working with data.
2.9K Followers
I share my journey to become better at Data Engineering - Fullstack Data (Data Engineer,
ML Engineer, Data Ops, Certified AWS SA)
10 min read · Feb 22, 2024 6 min read · Dec 19, 2023
290 5 266 3
12 min read · Aug 11, 2023 10 min read · Apr 13, 2023
249 1 145 2
14 min read · Nov 10, 2022 4 min read · Jan 26, 2024
2.3K 45 1.1K 17
Lists
OpenYou are signed out. Sign in with your member account (al__@g__.com) to view other
Source Data Engineering Building a Data Platform in 2024
member-only stories. Sign in
Landscape 2024 How to build a modern, scalable data platform
Exploration of the open source software in to power your analytics and data science…
data engineering ecosystem
611 15 3K 43
Benedict Neo in bitgrit Data Science Publication Anna Geller in Level Up Coding
11 min read · Mar 11, 2024 6 min read · Jan 29, 2024
12.1K 129 1K 6