0% found this document useful (0 votes)

86 views15 pages

What I Learned After One Year of Building A Data Platform From Scratch - by Jeremy Surget - Medium

Uploaded by

tataxp

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

86 views15 pages

What I Learned After One Year of Building A Data Platform From Scratch - by Jeremy Surget - Medium

Uploaded by

tataxp

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

Open in app Sign up Sign in

You are signed out. Sign in with your member account (al__@g__.com) to view other
member-only stories. Sign in
Search Write

What I learned after one year of

building a Data Platform from
scratch
Jeremy Surget · Follow
9 min read · Nov 14, 2023

2.8K 44
Photo by Luke Chesser on Unsplash

You are signed out. Sign in with your member account (al__@g__.com) to view other
member-only stories. Sign in
One year ago, I joined a French start-up called Allowa, which is on a mission
to be the marketplace for real estate services. I joined as the first data guy to
help structure all their data and ultimately extract value from it.

Building a data platform from scratch is an amazing experience and I

wanted to share the lessons that I learned along the way.

Here are some of the key takeaways I’ll share:

You don’t need a fancy data stack to get started

KISS — Keep It Simple and Stupid at first, then improve if needed

Data quality is the root of all your problems

Tech is easy, people are challenging

It takes time to get traction around data

The data stack

Disclaimer: This section is a bit technical

The data stack is a typical ELT stack, almost 100% open source, hosted on
AWS.

Simplicity allows you to deliver value to stakeholders early on.

You are signed out. Sign in with your member account (al__@g__.com) to view other
member-only stories. Sign in

High-level design of the stack

Using an Extract & Load tool is essential

In today’s data world, there are so many options for an EL tool to avoid you
developing your own extracting script and to help you gain a LOT of time.

Fivetran, Mage, and Airbyte to mention a few.

You don’t have to maintain custom scripts, these tools come with +300
connectors, basic scheduling, and error handling.

Among these options, my personal favorite is Airbyte. It is easily deployable,

manageable, and has an amazing community. While it’s not perfect, it does
exactly what I need it to do: efficiently move data from my sources to my
data warehouse.

Although some argue that using an EL tool is slower in extracting data

compared to custom scripts, the choice ultimately lies with you. Would you
prefer to maintain over 50 extraction scripts, handle testing, deployments,
and manage secrets? Or would you rather have a streamlined Extraction and
You are
Loading signed out.
process withSign no
in with your memberoverhead
additional account (al__@g__.com) to view other
when building your data stack?
member-only stories. Sign in

You don’t need an orchestrator

I know, on the schema, there is an orchestrator, but it was deployed just
recently. When the stack was first launched, the orchestrator was not yet a
part of the infrastructure. Instead, simple scheduling methods were used to
manage data extraction and transformation jobs. This was manageable since
there were few components to oversee.

We used Airbyte for data extraction and scheduled dbt transformations since
it comes with simple scheduling out of the box. We also used AWS’s
EventBridge to schedule Python jobs via ECS tasks. This method was
effective and uncomplicated, and it allowed us to prioritize simplicity while
ensuring that our core needs were met.
You are signed out. Sign in with your member account (al__@g__.com) to view other
member-only stories. Sign in

Our simple scheduling stack

The KISS (Keep It Simple and Stupid) principle enabled us to make progress
without overcomplicating our workflow. We can now consider whether more
complex scheduling and orchestrations would be beneficial as the stack
grows and the team scales.

The trade-off between quick wins and long-term

Sometimes as someone with a technical background, it’s hard to do
something that you know doesn’t scale well. I have the bias of doing things
that can scale. But in reality, the thing you provisioned for scaling might
never need to scale. So you end up with something complicated that could
have been a lot easier. Again, KISS.

Redshift was a mistake

I mean, as much as I love AWS services, setting up Redshift as our data
You are signed
warehouse was out. Sign in with
a mistake yourPostgres
and member account
would(al__@g__.com)
have been to view other better
a much
member-only stories. Sign in
alternative.

Let’s be honest, unless you have massive amounts of data, more than
hundreds of To’s of data, all these fancy data warehouses like Redshift just
aren’t worth the cost. Redshift isn’t open source, so you can’t have a complete
mini-data stack on your local computer for testing purposes. Plus, Redshift,
being built on top of Postgres 8, sometimes lacks the cool features that the
newer releases of Postgres have.

I know Postgres is a transactional database, but I think it’s a solid first

approach for a data warehouse. If you’re dealing with tables with less than 50
million rows and under 10 terabytes of data (which is the case for most
startups), Postgres might outperform Redshift. And the best part is, you can
have it up and running on your local computer, making it incredibly
convenient for quick iterations.

And a later migration to a “proper” data warehouse, if planned correctly, can

be done smoothly.

Don’t forget the security of your infrastructure

Having strict security rules when you want to go fast can be a big constraints,
but after all, we are dealing with data, and data is a valuable asset that need
to be protected.

When you get started, at least, apply the basics for data security :

Never expose a database or your warehouse to the internet

Use encryption at rest and in transit whenever possible

Use a secret manager such as AWS Secret Manager to securely deal with
You are and
token, signeddatabase
out. Sign in with your member account (al__@g__.com) to view other
password
member-only stories. Sign in

Do not expose the ssh port of your instance to the internet

If you forget about the security of your infrastructure it might head back to
you one time.

Other tech learnings I wouldn’t discuss in detail here

Logging: Don’t forget basic logging. It will prevent you from scratching
your head over the table because you can’t have a proper stack trace of
errors

Slack is a perfect place to start for alerting

Infrastructure As Code might be hard to get on track in the beginning but

definitely worth it, I used Terraform and Ansible but switched to Pulumi
in a recent project

Data Quality
“Garbage in, garbage out”

While I could have added this under the data stack part, Data Quality is so
important that It deserves its own section.

Without data quality, there is no point in having data at all

One of the first metrics that I shared with stakeholders turned out to be
inaccurate. This inaccuracy was a direct result of the low quality of the
underlying data. Common data quality issues include missing information,
incorrect data types, and no foreign key for data linkage.
I learned that improving data quality and monitoring it along the way is a
You are signed out. Sign in with your member account (al__@g__.com) to view other
priority.
member-only stories. Sign in

People will always question the validity of the metrics presented to them,
and they might be right if you cannot demonstrate the accuracy of the data.

Fixing data quality issues takes time and during this process, it may seem
like we are not delivering tangible value to stakeholders. This is why we
sometimes rush into getting a dashboard in front of them, as it offers a more
palpable value than data quality. However, this way of doing only leads to a
lack of confidence in the data team due to poor data quality. Confidence in
data is hard to get but is easy to lose. You should avoid at all costs showing
inaccurate data to stakeholders, otherwise, their confidence in data will fade
very quickly. Taking care of data quality is an investment that is worth doing
as early as possible.

That’s why, even before knowing if there are data quality issues (there are,
always) you should establish a framework for checking and monitoring data
across the organization. This can also serve as an initial step in giving
business people ownership of the data. Showing them what is wrong with
their data and how they could fix it.

Spreading data culture over the company

Tech is good, but I find that the hardest thing is to spread a data culture over
the company. Getting people to understand that the data they produce is a
valuable asset for them and the company is a journey that requires time and
effort.

Communicate early, and often

The month after I arrived, I delivered a presentation about the importance
You are signed
and benefits out. Sign
of data ininour
with company.
your member account (al__@g__.com)
It helped to view other
people realize how valuable
member-only stories. Sign in
their data was and what they could do with it.

At first, you may have some surfacing problems to solve with data, but
hopefully, people will want to go deeper and ask you to solve more exciting
problems using data.

Of course, one presentation isn’t enough and you have to constantly remind
people about best practices concerning their data. Communicating often
helps you to get more and more people concerned about data in the
company and eventually, you will have people becoming data champions
and spreading the word all over the company. Cherish your data champions
as they are your greatest allies in creating a more data-driven company
culture.

I hate Excel, but it holds a lot of value

The ultimate goal of data is to create value, right? Sometimes you have to
make trade-offs to prove what data can do. I hate Excel as much as you
probably do, but some teams add their data on Excel with no immediate
solution to migrate them to a database or some kind of platform. At first, I
didn’t want to ingest Excel data in the warehouse, because, well, it’s Excel.
But these Excel data hold significant value for the business, and as a Data
Engineer, my goal is to extract value from data. So what? Let’s ingest this
data.

It can be quite challenging to obtain valuable data from Excel, However, by

implementing efficient processes and educating the team on data-related
guidelines for Excel, we managed to make it work. We created a template for
the Excel files to respect data quality and validation rules, cleaning header
names, columns, and merged cells. Now, anyone with a spreadsheet who
Youto
wishes arehave
signedtheir
out. Sign in with
data inyour
themember account (al__@g__.com)
warehouse to view other rules and
knows the necessary
member-only stories. Sign in
the format their file should follow.

Of course, this is not a sustainable long-term solution. However, what I

discovered is that people quickly take ownership of their data in Excel. These
files are an integral part of their daily work, and they become emotionally
attached to them. That’s the reason why they take responsibility and clean
the data every day.

In most cases, you won’t find neatly structured data waiting for you in an
SQL database. That’s why it’s crucial to remain flexible and adaptable when
working with data.

Bad processes lead to no data

Sometimes things can get pretty chaotic, with data scattered all over the
place, and not properly organized or structured. As a data person, your role
is to be a facilitator, finding solutions to ensure that the right data reaches
the right person. But sometimes you also have to change the process data are
collected. Because a wrong process in the first place can lead to bad data or
no data at the end. You will break things, but it’s fine, as long as it is for the
greater good.

Building traction around data takes time

At first, I thought that it was a matter of 2 months before getting the
company to use data in their daily work life. It was not. For all the reasons I
mentioned earlier, the tech, the people, the processes, the data quality…
Building traction around data takes time.
The first dashboard was released after 3 months. Some dashboards were
beingYou are signed
used hereout.
andSign in with your
there, butmember
it wasaccount (al__@g__.com)
only after 7 months to view
of other
being in the
member-only stories. Sign in
company and preaching data that we managed to release dashboards that
the business and people started to use every day. They now actively manage
some reporting on Metabase and follow key metrics for their daily job.

So, be patient and persistent in promoting data usage.

So, what comes next?

It’s been a remarkable year of growth. Building a data platform is a never-
ending journey. There is still a lot to do and a lot to learn along the way.

Some focus will remain for the following year :

Encouraging a data-driven culture

Enhancing and Monitoring Data Quality (of course)

Maintaining a stable data platform to meet the increasing demand for

data

And new one will be developed :

Establish a data governance system for the company

Start to push self-serve data, so teams can be autonomous

Building a data platform is something that can feel overwhelming at first,

but with the right principles, persistence, and a commitment to data quality,
you can unlock the full potential of your data to drive meaningful insights
and decisions within your organization.
Thanks for reading and feel free to share your thoughts about this article !
You are signed out. Sign in with your member account (al__@g__.com) to view other
member-only stories. Sign in

Data Data Engineering Data Science Tech

Written by Jeremy Surget Follow

2.9K Followers

I share my journey to become better at Data Engineering - Fullstack Data (Data Engineer,
ML Engineer, Data Ops, Certified AWS SA)

Jeremy Surget Jeremy Surget

Production-ready Data Stack in a How contributing to open-source
weekYou are signed out. Sign in with your member account
software helped
(al__@g__.com) meother
to view grow as a Dat…
A fast member-only stories.
and simple way Signyour
to get in data stack Contributing helps you improve your skills in
up and running many ways. See how you can leverage it in…

10 min read · Feb 22, 2024 6 min read · Dec 19, 2023

290 5 266 3

Jeremy Surget Jeremy Surget

Managing Airbyte with code: A Airbyte Configuration as Code with

Guide to Using the Terraform… Octavia CLI
Deploy and manage Airbyte resources with Managing Airbyte from code instead of the
Terraform web UI

12 min read · Aug 11, 2023 10 min read · Apr 13, 2023

249 1 145 2

See all from Jeremy Surget

Recommended from Medium

You are signed out. Sign in with your member account (al__@g__.com) to view other
member-only stories. Sign in

Niels Cautaerts in Better Programming Dorian Teffo in DataDrivenInvestor

Data Engineering is Not Software Freelance Data Engineering

Engineering Roadmap+Project ideas
Pretending like data and software are the We are in 2024, and you want to become a
same is counterproductive to the success of… successful Data Engineer and Freelancer. Yo…

14 min read · Nov 10, 2022 4 min read · Jan 26, 2024

2.3K 45 1.1K 17

Lists

Predictive Modeling w/ Practical Guides to Machine

Python Learning
20 stories · 1231 saves 10 stories · 1485 saves

data science and AI Coding & Development

40 stories · 169 saves 11 stories · 628 saves
Alireza Sadeghi Dave Melillo in Towards Data Science

OpenYou are signed out. Sign in with your member account (al__@g__.com) to view other
Source Data Engineering Building a Data Platform in 2024
member-only stories. Sign in
Landscape 2024 How to build a modern, scalable data platform
Exploration of the open source software in to power your analytics and data science…
data engineering ecosystem

11 min read · Feb 4, 2024 9 min read · Feb 6, 2024

611 15 3K 43

Benedict Neo in bitgrit Data Science Publication Anna Geller in Level Up Coding

Roadmap to Learn AI in 2024 2024 Data Engineering Trends

A free curriculum for hackers and Trends and the impact of AI on data tooling
programmers to learn AI and data job market

11 min read · Mar 11, 2024 6 min read · Jan 29, 2024

12.1K 129 1K 6

See more recommendations

Inventory Optimization
100% (6)
Inventory Optimization
328 pages
Excel Basics To Advanced - Design Robust Spreadsheet Applications Powered With Formatting
100% (13)
Excel Basics To Advanced - Design Robust Spreadsheet Applications Powered With Formatting
171 pages
Data Lake Bootcamp: Building Reliable Data Lakes
No ratings yet
Data Lake Bootcamp: Building Reliable Data Lakes
29 pages
BP Lubricants Achieves BIGS Success: Case Study
No ratings yet
BP Lubricants Achieves BIGS Success: Case Study
6 pages
SAP Project Budgeting & Planning For SAP S/4HANA Cloud (2YG) Id: 2yg
100% (1)
SAP Project Budgeting & Planning For SAP S/4HANA Cloud (2YG) Id: 2yg
13 pages
Setting Up (1YB) : Import Connection Setup With SAP Analytics Cloud
No ratings yet
Setting Up (1YB) : Import Connection Setup With SAP Analytics Cloud
14 pages
Active Report 2.0 Standard Edition
0% (4)
Active Report 2.0 Standard Edition
455 pages
Big-Data - Analytics Projects Failure - A Literature Review
No ratings yet
Big-Data - Analytics Projects Failure - A Literature Review
10 pages
SAP S/4HANA Cloud Sales Content With SAP Analytics Cloud ID: 3N0
No ratings yet
SAP S/4HANA Cloud Sales Content With SAP Analytics Cloud ID: 3N0
7 pages
Hands-on Data Virtualization with Polybase: Administer Big Data, SQL Queries and Data Accessibility Across Hadoop, Azure, Spark, Cassandra, MongoDB, CosmosDB, MySQL and PostgreSQL (English Edition)
From Everand
Hands-on Data Virtualization with Polybase: Administer Big Data, SQL Queries and Data Accessibility Across Hadoop, Azure, Spark, Cassandra, MongoDB, CosmosDB, MySQL and PostgreSQL (English Edition)
Pablo Alejandro Echeverria Barrios
No ratings yet
Oracle Warehouse Builder 11g: Getting Started
From Everand
Oracle Warehouse Builder 11g: Getting Started
Bob Griesemer
No ratings yet
Be Data Curious!: Be Data Curious!, #1
From Everand
Be Data Curious!: Be Data Curious!, #1
Nick Jewell
No ratings yet
Knight's Microsoft Business Intelligence 24-Hour Trainer: Leveraging Microsoft SQL Server Integration, Analysis, and Reporting Services with Excel and SharePoint
From Everand
Knight's Microsoft Business Intelligence 24-Hour Trainer: Leveraging Microsoft SQL Server Integration, Analysis, and Reporting Services with Excel and SharePoint
Brian Knight
3/5 (1)
Database testing Third Edition
From Everand
Database testing Third Edition
Gerardus Blokdyk
No ratings yet
The Analytics Sandwich: Bringing people and Artificial Intelligence together to unlock business value
From Everand
The Analytics Sandwich: Bringing people and Artificial Intelligence together to unlock business value
Bernardo Almada-Lobo
No ratings yet
Self Service BI
No ratings yet
Self Service BI
6 pages
Seminar
No ratings yet
Seminar
16 pages
Quick-Start Guide For Dynamics 365 Human Resources
No ratings yet
Quick-Start Guide For Dynamics 365 Human Resources
1 page
PQ Cheatsheet EN PDF
No ratings yet
PQ Cheatsheet EN PDF
6 pages
Saas Startup Success: The Entrepreneur's Odyssey, #1
From Everand
Saas Startup Success: The Entrepreneur's Odyssey, #1
Niels Kokholm Nielsen
No ratings yet
Informatica Power Center Best Practices
No ratings yet
Informatica Power Center Best Practices
8 pages
Tableau Exasol WhitePaper
No ratings yet
Tableau Exasol WhitePaper
9 pages
Power BI
From Everand
Power BI
Vishal Mehra
No ratings yet
Data Cleaning with Power BI: The definitive guide to transforming dirty data into actionable insights
From Everand
Data Cleaning with Power BI: The definitive guide to transforming dirty data into actionable insights
Gus Frazer
No ratings yet
Big Data's Human Component
No ratings yet
Big Data's Human Component
4 pages
Minor Project Report By-Dhruv Rai
No ratings yet
Minor Project Report By-Dhruv Rai
56 pages
Qlik_Replicate_More_Data_AnalyticsReady_White_Paper_US
No ratings yet
Qlik_Replicate_More_Data_AnalyticsReady_White_Paper_US
14 pages
FYP Microsoft Project Online
No ratings yet
FYP Microsoft Project Online
41 pages
data engineering design patterns
No ratings yet
data engineering design patterns
53 pages
TOGAF® Business Architecture Level 1 Study Guide
From Everand
TOGAF® Business Architecture Level 1 Study Guide
Andrew Josey
No ratings yet
The Ultimate Data Observability Checklist Guide
No ratings yet
The Ultimate Data Observability Checklist Guide
8 pages
Modern Data Architecture: Bywhinmon
No ratings yet
Modern Data Architecture: Bywhinmon
10 pages
CODE201911 Practices DataVisualizations
No ratings yet
CODE201911 Practices DataVisualizations
19 pages
Troubleshooting Spark Challenges
No ratings yet
Troubleshooting Spark Challenges
7 pages
Business rules A Complete Guide
From Everand
Business rules A Complete Guide
Gerardus Blokdyk
No ratings yet
Instant Access To Data Lake Architecture Designing The Data Lake and Avoiding The Garbage Dump First Edition Bill Inmon Ebook Full Chapters
100% (5)
Instant Access To Data Lake Architecture Designing The Data Lake and Avoiding The Garbage Dump First Edition Bill Inmon Ebook Full Chapters
62 pages
Mobile edge computing A Clear and Concise Reference
From Everand
Mobile edge computing A Clear and Concise Reference
Gerardus Blokdyk
No ratings yet
Cloud Application
No ratings yet
Cloud Application
45 pages
Big Data Solution For Tourism PDF
No ratings yet
Big Data Solution For Tourism PDF
10 pages
A Comparison of in Memory Databases
No ratings yet
A Comparison of in Memory Databases
6 pages
Big Data Benchmarking 2014
0% (1)
Big Data Benchmarking 2014
164 pages
Odi Architecture
No ratings yet
Odi Architecture
26 pages
Bussiness Intelligence
No ratings yet
Bussiness Intelligence
6 pages
Power BI Interview Questions 1717598715
No ratings yet
Power BI Interview Questions 1717598715
24 pages
RDBMS To MongoDB Migration
No ratings yet
RDBMS To MongoDB Migration
19 pages
A Tale of Two Architectures
No ratings yet
A Tale of Two Architectures
16 pages
From Data To Decisions: Driving Performance in the Age of Analytics
From Everand
From Data To Decisions: Driving Performance in the Age of Analytics
Babatunde Yusuf
No ratings yet
Data Analysis and Harmonization: A Simple Guide
From Everand
Data Analysis and Harmonization: A Simple Guide
Jeff Voivoda
No ratings yet
MICROSOFT AZURE ADMINISTRATOR EXAM PREP(AZ-104) Part-4: AZ 104 EXAM STUDY GUIDE
From Everand
MICROSOFT AZURE ADMINISTRATOR EXAM PREP(AZ-104) Part-4: AZ 104 EXAM STUDY GUIDE
Devi Prasad
No ratings yet
Azure AnalysisServiceOverview
No ratings yet
Azure AnalysisServiceOverview
173 pages
PPDM3.8 Roadmaps Booklet
No ratings yet
PPDM3.8 Roadmaps Booklet
24 pages
Business Intelligence & Business Analytics
No ratings yet
Business Intelligence & Business Analytics
8 pages
The AI Hierarchy of Needs
No ratings yet
The AI Hierarchy of Needs
8 pages
Data vault modeling Complete Self-Assessment Guide
From Everand
Data vault modeling Complete Self-Assessment Guide
Gerardus Blokdyk
No ratings yet
Xquery and Xpath 2
No ratings yet
Xquery and Xpath 2
25 pages
One World, One Touch: Standard Chartered Bank Provides An Innovative " Single-Touch" Custody Model
No ratings yet
One World, One Touch: Standard Chartered Bank Provides An Innovative " Single-Touch" Custody Model
24 pages
Ssas Rolap For SQL Server
No ratings yet
Ssas Rolap For SQL Server
42 pages
What Are Digital Supply Networks (DSN) ?
No ratings yet
What Are Digital Supply Networks (DSN) ?
2 pages
Real Time Object Detection Using Deep Learning Andmachine Learning Project
No ratings yet
Real Time Object Detection Using Deep Learning Andmachine Learning Project
56 pages
Ruijie - SME Product Mapping - Poster
No ratings yet
Ruijie - SME Product Mapping - Poster
1 page
Data Governance for Tax Administrations: A Practical Guide
From Everand
Data Governance for Tax Administrations: A Practical Guide
Inter-American Center of Tax Administrations – CIAT
No ratings yet
Automated Testing Second Edition
From Everand
Automated Testing Second Edition
Gerardus Blokdyk
No ratings yet
Lecture 2: Markov Decision Processes: David Silver
No ratings yet
Lecture 2: Markov Decision Processes: David Silver
57 pages
Road Maps 2 A Guide To Learning System Dynamics
No ratings yet
Road Maps 2 A Guide To Learning System Dynamics
18 pages
HDInsight Essentials - Second Edition
From Everand
HDInsight Essentials - Second Edition
Rajesh Nadipalli
No ratings yet
Records and Information Management
No ratings yet
Records and Information Management
75 pages
EOI Selections Table+ (Jun-13-2024)
No ratings yet
EOI Selections Table+ (Jun-13-2024)
23 pages
103484-Application Guide IGE (FINAL Aug-28-2020)
No ratings yet
103484-Application Guide IGE (FINAL Aug-28-2020)
28 pages
Jasper - Ai Vs MyEssayWriter - Ai - A Battle of AI Writing Tools - by Mary William - Oct, 2023 - Medium
No ratings yet
Jasper - Ai Vs MyEssayWriter - Ai - A Battle of AI Writing Tools - by Mary William - Oct, 2023 - Medium
15 pages
Why Data Storytelling Is Your New Superpower - Data Storytelling Corner
No ratings yet
Why Data Storytelling Is Your New Superpower - Data Storytelling Corner
11 pages
Elevating Power BI Reports With HTML & CSS - Joining Forces ? - by Isabelle Bittar - Microsoft Power BI - Feb, 2024 - Medium
No ratings yet
Elevating Power BI Reports With HTML & CSS - Joining Forces ? - by Isabelle Bittar - Microsoft Power BI - Feb, 2024 - Medium
16 pages
PVS 20240506 Buy
No ratings yet
PVS 20240506 Buy
21 pages
SAP Analytics Cloud - Landscape Architecture and Life-Cycle Management
No ratings yet
SAP Analytics Cloud - Landscape Architecture and Life-Cycle Management
57 pages
SAP S/4HANA Cloud Receivable Management Content With SAP Analytics Cloud ID: 4A6
No ratings yet
SAP S/4HANA Cloud Receivable Management Content With SAP Analytics Cloud ID: 4A6
8 pages
Workforce Planning For Sap S/4Hana: Id: 3Dj
No ratings yet
Workforce Planning For Sap S/4Hana: Id: 3Dj
65 pages
System Requirement and Prereq
No ratings yet
System Requirement and Prereq
17 pages
Project Control - Finance: Setting Up (1NT)
No ratings yet
Project Control - Finance: Setting Up (1NT)
12 pages
FSD Ce2108
No ratings yet
FSD Ce2108
274 pages
Import Data Connections To SAP S4HANA
No ratings yet
Import Data Connections To SAP S4HANA
4 pages
Chapter 1: Getting Started With SAP Analytics Cloud
No ratings yet
Chapter 1: Getting Started With SAP Analytics Cloud
260 pages
Connecting SAC With SAP ANALYTICS Cloud Kit 1.0 - SAP Blogs
No ratings yet
Connecting SAC With SAP ANALYTICS Cloud Kit 1.0 - SAP Blogs
17 pages
DI Exploration and Production
No ratings yet
DI Exploration and Production
7 pages
Financial Standing Evaluation
No ratings yet
Financial Standing Evaluation
51 pages
Backend Roles For SAP BPC & Managing User From Backend System
No ratings yet
Backend Roles For SAP BPC & Managing User From Backend System
6 pages
AA-Mock Test 1 - Results: Return To Review
No ratings yet
AA-Mock Test 1 - Results: Return To Review
30 pages
Pert Master Macro Tutorials
No ratings yet
Pert Master Macro Tutorials
11 pages
MB-500 Microsoft Exam Updated Dumps
No ratings yet
MB-500 Microsoft Exam Updated Dumps
27 pages
(eBook PDF) Essentials of Modern Business Statistics with Microsoft Office Excel 7th Editioninstant download
100% (5)
(eBook PDF) Essentials of Modern Business Statistics with Microsoft Office Excel 7th Editioninstant download
51 pages
Trial Exam For Final 1 - Attempt Review 22
No ratings yet
Trial Exam For Final 1 - Attempt Review 22
88 pages
JBGP 0000 Civ Sow BPC 0000 00001 00 R0 PDF
No ratings yet
JBGP 0000 Civ Sow BPC 0000 00001 00 R0 PDF
41 pages
5 Things You Need To Know How To Do in Excel 2007
100% (1)
5 Things You Need To Know How To Do in Excel 2007
14 pages
XLent
No ratings yet
XLent
4 pages
A G Readm e - FD .W PD O Ctober 6, 2014
No ratings yet
A G Readm e - FD .W PD O Ctober 6, 2014
4 pages
Human Resource, Payroll, Time & Attendance Software System RFP Template
100% (1)
Human Resource, Payroll, Time & Attendance Software System RFP Template
223 pages
Phast Release Notes
No ratings yet
Phast Release Notes
20 pages
Manual - En: Twincat 3 - Ls Light Solution
No ratings yet
Manual - En: Twincat 3 - Ls Light Solution
54 pages
Excel Youtube Data Analysis
100% (1)
Excel Youtube Data Analysis
43 pages
Chapter 1 Welcome To Microsoft Excel 2007 - 2009 - A Guide To Microsoft Excel 2007 For Scientists and Engineers
No ratings yet
Chapter 1 Welcome To Microsoft Excel 2007 - 2009 - A Guide To Microsoft Excel 2007 For Scientists and Engineers
11 pages
Excel IF Function and IF Statements
No ratings yet
Excel IF Function and IF Statements
7 pages
Excel Expert 2019 eBook
No ratings yet
Excel Expert 2019 eBook
225 pages
100- Excel ChatGPT Prompts
No ratings yet
100- Excel ChatGPT Prompts
10 pages
Guidelines Shop and Establishment Act PDF
No ratings yet
Guidelines Shop and Establishment Act PDF
8 pages
Computer Mcqs PACEGK
No ratings yet
Computer Mcqs PACEGK
22 pages
Pavement Design Manual
No ratings yet
Pavement Design Manual
295 pages
Full Forms
No ratings yet
Full Forms
15 pages
Instructor Manual
No ratings yet
Instructor Manual
35 pages
Groovy 9 - Capturing Rawrequest & Response: or or
No ratings yet
Groovy 9 - Capturing Rawrequest & Response: or or
4 pages
Chain Ladder
No ratings yet
Chain Ladder
51 pages
Traverse Sheet
No ratings yet
Traverse Sheet
61 pages
Instructiuni ToolPac 13 (Engleza)
No ratings yet
Instructiuni ToolPac 13 (Engleza)
26 pages
MS OFFICE LAB QUESTIONS
No ratings yet
MS OFFICE LAB QUESTIONS
1 page

What I Learned After One Year of Building A Data Platform From Scratch - by Jeremy Surget - Medium

Uploaded by

What I Learned After One Year of Building A Data Platform From Scratch - by Jeremy Surget - Medium

Uploaded by

Open in app Sign up Sign in

What I learned after one year of

Building a data platform from scratch is an amazing experience and I

Here are some of the key takeaways I’ll share:

You don’t need a fancy data stack to get started

KISS — Keep It Simple and Stupid at first, then improve if needed

Data quality is the root of all your problems

Tech is easy, people are challenging

It takes time to get traction around data

The data stack

Simplicity allows you to deliver value to stakeholders early on.

High-level design of the stack

Using an Extract & Load tool is essential

Fivetran, Mage, and Airbyte to mention a few.

Among these options, my personal favorite is Airbyte. It is easily deployable,

Although some argue that using an EL tool is slower in extracting data

You don’t need an orchestrator

Our simple scheduling stack

The trade-off between quick wins and long-term

Redshift was a mistake

I know Postgres is a transactional database, but I think it’s a solid first

And a later migration to a “proper” data warehouse, if planned correctly, can

Don’t forget the security of your infrastructure

Never expose a database or your warehouse to the internet

Use encryption at rest and in transit whenever possible

Do not expose the ssh port of your instance to the internet

Other tech learnings I wouldn’t discuss in detail here

Slack is a perfect place to start for alerting

Infrastructure As Code might be hard to get on track in the beginning but

Without data quality, there is no point in having data at all

Spreading data culture over the company

Communicate early, and often

I hate Excel, but it holds a lot of value

It can be quite challenging to obtain valuable data from Excel, However, by

Of course, this is not a sustainable long-term solution. However, what I

Bad processes lead to no data

Building traction around data takes time

So, be patient and persistent in promoting data usage.

So, what comes next?

Some focus will remain for the following year :

Encouraging a data-driven culture

Enhancing and Monitoring Data Quality (of course)

Maintaining a stable data platform to meet the increasing demand for

And new one will be developed :

Establish a data governance system for the company

Start to push self-serve data, so teams can be autonomous

Building a data platform is something that can feel overwhelming at first,

Data Data Engineering Data Science Tech

Written by Jeremy Surget Follow

More from Jeremy Surget

Jeremy Surget Jeremy Surget

Jeremy Surget Jeremy Surget

Managing Airbyte with code: A Airbyte Configuration as Code with

See all from Jeremy Surget

Recommended from Medium

Niels Cautaerts in Better Programming Dorian Teffo in DataDrivenInvestor

Data Engineering is Not Software Freelance Data Engineering

Predictive Modeling w/ Practical Guides to Machine

data science and AI Coding & Development

11 min read · Feb 4, 2024 9 min read · Feb 6, 2024

Roadmap to Learn AI in 2024 2024 Data Engineering Trends

See more recommendations

You might also like