0% found this document useful (0 votes)
86 views15 pages

What I Learned After One Year of Building A Data Platform From Scratch - by Jeremy Surget - Medium

Uploaded by

tataxp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
86 views15 pages

What I Learned After One Year of Building A Data Platform From Scratch - by Jeremy Surget - Medium

Uploaded by

tataxp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Open in app Sign up Sign in

You are signed out. Sign in with your member account (al__@g__.com) to view other
member-only stories. Sign in
Search Write

What I learned after one year of


building a Data Platform from
scratch
Jeremy Surget · Follow
9 min read · Nov 14, 2023

2.8K 44
Photo by Luke Chesser on Unsplash

You are signed out. Sign in with your member account (al__@g__.com) to view other
member-only stories. Sign in
One year ago, I joined a French start-up called Allowa, which is on a mission
to be the marketplace for real estate services. I joined as the first data guy to
help structure all their data and ultimately extract value from it.

Building a data platform from scratch is an amazing experience and I


wanted to share the lessons that I learned along the way.

Here are some of the key takeaways I’ll share:

You don’t need a fancy data stack to get started

KISS — Keep It Simple and Stupid at first, then improve if needed

Data quality is the root of all your problems

Tech is easy, people are challenging

It takes time to get traction around data

The data stack


Disclaimer: This section is a bit technical

The data stack is a typical ELT stack, almost 100% open source, hosted on
AWS.

Simplicity allows you to deliver value to stakeholders early on.


You are signed out. Sign in with your member account (al__@g__.com) to view other
member-only stories. Sign in

High-level design of the stack

Using an Extract & Load tool is essential


In today’s data world, there are so many options for an EL tool to avoid you
developing your own extracting script and to help you gain a LOT of time.

Fivetran, Mage, and Airbyte to mention a few.

You don’t have to maintain custom scripts, these tools come with +300
connectors, basic scheduling, and error handling.

Among these options, my personal favorite is Airbyte. It is easily deployable,


manageable, and has an amazing community. While it’s not perfect, it does
exactly what I need it to do: efficiently move data from my sources to my
data warehouse.

Although some argue that using an EL tool is slower in extracting data


compared to custom scripts, the choice ultimately lies with you. Would you
prefer to maintain over 50 extraction scripts, handle testing, deployments,
and manage secrets? Or would you rather have a streamlined Extraction and
You are
Loading signed out.
process withSign no
in with your memberoverhead
additional account (al__@g__.com) to view other
when building your data stack?
member-only stories. Sign in

You don’t need an orchestrator


I know, on the schema, there is an orchestrator, but it was deployed just
recently. When the stack was first launched, the orchestrator was not yet a
part of the infrastructure. Instead, simple scheduling methods were used to
manage data extraction and transformation jobs. This was manageable since
there were few components to oversee.

We used Airbyte for data extraction and scheduled dbt transformations since
it comes with simple scheduling out of the box. We also used AWS’s
EventBridge to schedule Python jobs via ECS tasks. This method was
effective and uncomplicated, and it allowed us to prioritize simplicity while
ensuring that our core needs were met.
You are signed out. Sign in with your member account (al__@g__.com) to view other
member-only stories. Sign in

Our simple scheduling stack

The KISS (Keep It Simple and Stupid) principle enabled us to make progress
without overcomplicating our workflow. We can now consider whether more
complex scheduling and orchestrations would be beneficial as the stack
grows and the team scales.

The trade-off between quick wins and long-term


Sometimes as someone with a technical background, it’s hard to do
something that you know doesn’t scale well. I have the bias of doing things
that can scale. But in reality, the thing you provisioned for scaling might
never need to scale. So you end up with something complicated that could
have been a lot easier. Again, KISS.

Redshift was a mistake


I mean, as much as I love AWS services, setting up Redshift as our data
You are signed
warehouse was out. Sign in with
a mistake yourPostgres
and member account
would(al__@g__.com)
have been to view other better
a much
member-only stories. Sign in
alternative.

Let’s be honest, unless you have massive amounts of data, more than
hundreds of To’s of data, all these fancy data warehouses like Redshift just
aren’t worth the cost. Redshift isn’t open source, so you can’t have a complete
mini-data stack on your local computer for testing purposes. Plus, Redshift,
being built on top of Postgres 8, sometimes lacks the cool features that the
newer releases of Postgres have.

I know Postgres is a transactional database, but I think it’s a solid first


approach for a data warehouse. If you’re dealing with tables with less than 50
million rows and under 10 terabytes of data (which is the case for most
startups), Postgres might outperform Redshift. And the best part is, you can
have it up and running on your local computer, making it incredibly
convenient for quick iterations.

And a later migration to a “proper” data warehouse, if planned correctly, can


be done smoothly.

Don’t forget the security of your infrastructure


Having strict security rules when you want to go fast can be a big constraints,
but after all, we are dealing with data, and data is a valuable asset that need
to be protected.

When you get started, at least, apply the basics for data security :

Never expose a database or your warehouse to the internet

Use encryption at rest and in transit whenever possible


Use a secret manager such as AWS Secret Manager to securely deal with
You are and
token, signeddatabase
out. Sign in with your member account (al__@g__.com) to view other
password
member-only stories. Sign in

Do not expose the ssh port of your instance to the internet

If you forget about the security of your infrastructure it might head back to
you one time.

Other tech learnings I wouldn’t discuss in detail here


Logging: Don’t forget basic logging. It will prevent you from scratching
your head over the table because you can’t have a proper stack trace of
errors

Slack is a perfect place to start for alerting

Infrastructure As Code might be hard to get on track in the beginning but


definitely worth it, I used Terraform and Ansible but switched to Pulumi
in a recent project

Data Quality
“Garbage in, garbage out”

While I could have added this under the data stack part, Data Quality is so
important that It deserves its own section.

Without data quality, there is no point in having data at all


One of the first metrics that I shared with stakeholders turned out to be
inaccurate. This inaccuracy was a direct result of the low quality of the
underlying data. Common data quality issues include missing information,
incorrect data types, and no foreign key for data linkage.
I learned that improving data quality and monitoring it along the way is a
You are signed out. Sign in with your member account (al__@g__.com) to view other
priority.
member-only stories. Sign in

People will always question the validity of the metrics presented to them,
and they might be right if you cannot demonstrate the accuracy of the data.

Fixing data quality issues takes time and during this process, it may seem
like we are not delivering tangible value to stakeholders. This is why we
sometimes rush into getting a dashboard in front of them, as it offers a more
palpable value than data quality. However, this way of doing only leads to a
lack of confidence in the data team due to poor data quality. Confidence in
data is hard to get but is easy to lose. You should avoid at all costs showing
inaccurate data to stakeholders, otherwise, their confidence in data will fade
very quickly. Taking care of data quality is an investment that is worth doing
as early as possible.

That’s why, even before knowing if there are data quality issues (there are,
always) you should establish a framework for checking and monitoring data
across the organization. This can also serve as an initial step in giving
business people ownership of the data. Showing them what is wrong with
their data and how they could fix it.

Spreading data culture over the company


Tech is good, but I find that the hardest thing is to spread a data culture over
the company. Getting people to understand that the data they produce is a
valuable asset for them and the company is a journey that requires time and
effort.

Communicate early, and often


The month after I arrived, I delivered a presentation about the importance
You are signed
and benefits out. Sign
of data ininour
with company.
your member account (al__@g__.com)
It helped to view other
people realize how valuable
member-only stories. Sign in
their data was and what they could do with it.

At first, you may have some surfacing problems to solve with data, but
hopefully, people will want to go deeper and ask you to solve more exciting
problems using data.

Of course, one presentation isn’t enough and you have to constantly remind
people about best practices concerning their data. Communicating often
helps you to get more and more people concerned about data in the
company and eventually, you will have people becoming data champions
and spreading the word all over the company. Cherish your data champions
as they are your greatest allies in creating a more data-driven company
culture.

I hate Excel, but it holds a lot of value


The ultimate goal of data is to create value, right? Sometimes you have to
make trade-offs to prove what data can do. I hate Excel as much as you
probably do, but some teams add their data on Excel with no immediate
solution to migrate them to a database or some kind of platform. At first, I
didn’t want to ingest Excel data in the warehouse, because, well, it’s Excel.
But these Excel data hold significant value for the business, and as a Data
Engineer, my goal is to extract value from data. So what? Let’s ingest this
data.

It can be quite challenging to obtain valuable data from Excel, However, by


implementing efficient processes and educating the team on data-related
guidelines for Excel, we managed to make it work. We created a template for
the Excel files to respect data quality and validation rules, cleaning header
names, columns, and merged cells. Now, anyone with a spreadsheet who
Youto
wishes arehave
signedtheir
out. Sign in with
data inyour
themember account (al__@g__.com)
warehouse to view other rules and
knows the necessary
member-only stories. Sign in
the format their file should follow.

Of course, this is not a sustainable long-term solution. However, what I


discovered is that people quickly take ownership of their data in Excel. These
files are an integral part of their daily work, and they become emotionally
attached to them. That’s the reason why they take responsibility and clean
the data every day.

In most cases, you won’t find neatly structured data waiting for you in an
SQL database. That’s why it’s crucial to remain flexible and adaptable when
working with data.

Bad processes lead to no data


Sometimes things can get pretty chaotic, with data scattered all over the
place, and not properly organized or structured. As a data person, your role
is to be a facilitator, finding solutions to ensure that the right data reaches
the right person. But sometimes you also have to change the process data are
collected. Because a wrong process in the first place can lead to bad data or
no data at the end. You will break things, but it’s fine, as long as it is for the
greater good.

Building traction around data takes time


At first, I thought that it was a matter of 2 months before getting the
company to use data in their daily work life. It was not. For all the reasons I
mentioned earlier, the tech, the people, the processes, the data quality…
Building traction around data takes time.
The first dashboard was released after 3 months. Some dashboards were
beingYou are signed
used hereout.
andSign in with your
there, butmember
it wasaccount (al__@g__.com)
only after 7 months to view
of other
being in the
member-only stories. Sign in
company and preaching data that we managed to release dashboards that
the business and people started to use every day. They now actively manage
some reporting on Metabase and follow key metrics for their daily job.

So, be patient and persistent in promoting data usage.

So, what comes next?


It’s been a remarkable year of growth. Building a data platform is a never-
ending journey. There is still a lot to do and a lot to learn along the way.

Some focus will remain for the following year :

Encouraging a data-driven culture

Enhancing and Monitoring Data Quality (of course)

Maintaining a stable data platform to meet the increasing demand for


data

And new one will be developed :

Establish a data governance system for the company

Start to push self-serve data, so teams can be autonomous

Building a data platform is something that can feel overwhelming at first,


but with the right principles, persistence, and a commitment to data quality,
you can unlock the full potential of your data to drive meaningful insights
and decisions within your organization.
Thanks for reading and feel free to share your thoughts about this article !
You are signed out. Sign in with your member account (al__@g__.com) to view other
member-only stories. Sign in

Data Data Engineering Data Science Tech

Written by Jeremy Surget Follow

2.9K Followers

I share my journey to become better at Data Engineering - Fullstack Data (Data Engineer,
ML Engineer, Data Ops, Certified AWS SA)

More from Jeremy Surget

Jeremy Surget Jeremy Surget


Production-ready Data Stack in a How contributing to open-source
weekYou are signed out. Sign in with your member account
software helped
(al__@g__.com) meother
to view grow as a Dat…
A fast member-only stories.
and simple way Signyour
to get in data stack Contributing helps you improve your skills in
up and running many ways. See how you can leverage it in…

10 min read · Feb 22, 2024 6 min read · Dec 19, 2023

290 5 266 3

Jeremy Surget Jeremy Surget

Managing Airbyte with code: A Airbyte Configuration as Code with


Guide to Using the Terraform… Octavia CLI
Deploy and manage Airbyte resources with Managing Airbyte from code instead of the
Terraform web UI

12 min read · Aug 11, 2023 10 min read · Apr 13, 2023

249 1 145 2

See all from Jeremy Surget

Recommended from Medium


You are signed out. Sign in with your member account (al__@g__.com) to view other
member-only stories. Sign in

Niels Cautaerts in Better Programming Dorian Teffo in DataDrivenInvestor

Data Engineering is Not Software Freelance Data Engineering


Engineering Roadmap+Project ideas
Pretending like data and software are the We are in 2024, and you want to become a
same is counterproductive to the success of… successful Data Engineer and Freelancer. Yo…

14 min read · Nov 10, 2022 4 min read · Jan 26, 2024

2.3K 45 1.1K 17

Lists

Predictive Modeling w/ Practical Guides to Machine


Python Learning
20 stories · 1231 saves 10 stories · 1485 saves

data science and AI Coding & Development


40 stories · 169 saves 11 stories · 628 saves
Alireza Sadeghi Dave Melillo in Towards Data Science

OpenYou are signed out. Sign in with your member account (al__@g__.com) to view other
Source Data Engineering Building a Data Platform in 2024
member-only stories. Sign in
Landscape 2024 How to build a modern, scalable data platform
Exploration of the open source software in to power your analytics and data science…
data engineering ecosystem

11 min read · Feb 4, 2024 9 min read · Feb 6, 2024

611 15 3K 43

Benedict Neo in bitgrit Data Science Publication Anna Geller in Level Up Coding

Roadmap to Learn AI in 2024 2024 Data Engineering Trends


A free curriculum for hackers and Trends and the impact of AI on data tooling
programmers to learn AI and data job market

11 min read · Mar 11, 2024 6 min read · Jan 29, 2024

12.1K 129 1K 6

See more recommendations

You might also like