0% found this document useful (0 votes)
8 views

ETL for Data infra-2

The document outlines the importance of building a data infrastructure using Cloud ETL tools to enhance a company's profitability through effective data integration and analysis. It details the roles, components, and steps necessary for establishing such an infrastructure, emphasizing the need for quick decision-making, operational efficiency, and a competitive advantage. Additionally, it discusses the challenges faced by organizations of different sizes in implementing and maintaining data infrastructure.

Uploaded by

Ashlesha Shetty
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

ETL for Data infra-2

The document outlines the importance of building a data infrastructure using Cloud ETL tools to enhance a company's profitability through effective data integration and analysis. It details the roles, components, and steps necessary for establishing such an infrastructure, emphasizing the need for quick decision-making, operational efficiency, and a competitive advantage. Additionally, it discusses the challenges faced by organizations of different sizes in implementing and maintaining data infrastructure.

Uploaded by

Ashlesha Shetty
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Cloud ETL tools make it possible

Step-by-Step Guide to Building a Data


Infrastructure
Introduction

2
Introduction.

Building a data infrastructure is key to maximizing a company's profitability. By


integrating and analyzing data, you can understand customer needs and market
trends and make decisions quickly. It also contributes to improving sales
efficiency, enhancing customer experience, and discovering new business
opportunities, thereby strengthening competitiveness and reducing costs. Data
infrastructure is a strategic asset that supports corporate growth.

The purpose of this document, "Understanding Data Infrastructure Construction


Achieved with Cloud ETL Tools," is to provide a simple explanation of what steps
should be taken by those in charge of utilizing and analyzing data to build a data
infrastructure.

3
What is data infrastructure?

Data infrastructure is a term that refers to "the overall infrastructure or infrastructure that allows an organization or system
to collect, manage, process, and analyze data.

Data infrastructure can be divided into two types of


environment construction: "on-premise environment" and
"cloud environment. on-premise environment cloud environment

Low implementation costs due to


On-premise environments are those in which servers are cost
High cost of installing servers and
the use of an environment built by
other hardware
installed and operated in-house, whereas cloud environments the service vendor
are those in which services and other resources are used by
other companies. operational burden
In-house system maintenance and Vendor provides maintenance and
upkeep required upkeep.

While the ease of customization in an on-premise environment


Freely customize functions and
is an advantage, disadvantages exist, such as high initial customizability
systems
Limited customizability
costs and operation and maintenance costs.

data integrity Regular data backups are required Automatic backup to the cloud.
On the other hand, the cloud environment does not incur initial
costs and requires fewer man-hours for operation and
maintenance than on-premise, but it does not allow for Build your own security Dependent on vendor's security
Security Performance
environment structure
detailed customization.

Since each has different characteristics, it is important to 4


choose the best one for your company's situation.
Four main roles of the data infrastructure

First, a data infrastructure is the foundation that enables an organization to effectively collect, manage, and analyze data. This allows data to be
centralized, maintained in quality, and used to make decisions quickly and accurately. The data infrastructure supports data-driven decision making
by preparing the data needed for business intelligence, predictive analysis, and operational optimization.

Data collection and integration Data Storage and Management

The primary role of a data infrastructure is to collect data from a variety It is responsible for storing vast amounts of data accurately and
of internal and external data sources and to integrate the information. securely. By managing information through data governance, organization,
Stream and patch data are collected and integrated in one place to cleansing, and other data management systems, we can comply with
increase the versatility of data utilization. laws and regulations and enhance security.

Data provision and utilization Data Conversion and Processing

The data infrastructure is responsible for converting the collected data


By building a data infrastructure, you can provide analytical results and into a format suitable for analysis and utilization, and eliminating
processed data to users who lack expertise; by using BI tools and APIs, problems such as inconsistencies, missing data, and duplication.
you can provide visualization data information to other departments, who This process improves data quality and ensures the reliability of
can gain useful insights from the data provided. analytical results. High-quality data is an important factor in supporting
accurate decision-making and efficient business operations.
5
Three reasons why you need a data infrastructure

The need for a data infrastructure is to support quick and accurate decision making, to improve operational efficiency and reduce costs, and to promote data-driven management
and gain competitive advantage. Having the foundation in place to achieve these goals is essential for modern companies to achieve sustainable growth.

Quick and accurate data-driven decision making is required.

In today's business environment, fast and accurate decision-making is essential to remain competitive and grow. However, the data owned by companies is
diverse and often siloed, and in some cases lacks consistency and reliability. By integrating and centrally managing data from disparate systems and
formats, data infrastructures provide companies with instant access to the data they need, when they need it.

To improve operational efficiency and reduce costs

When each department of a company manages its own data, data duplication and inconsistencies occur, requiring significant effort and cost to resolve.
In addition, manual data processing is prone to errors and inefficient. Building a data infrastructure can solve these problems.

To promote data-driven management and gain competitive advantage

Data is also referred to as the "new oil" in modern corporate activities, and its use determines competitive advantage. By developing a data infrastructure,
companies can strategically utilize data to gain a competitive edge in the marketplace. They can analyze customer purchasing behavior to develop
personalized marketing and utilize real-time data to plan production in response to demand.

6
Difficulty of data infrastructure operation work
Building a data infrastructure on-premises requires a significant amount of man-hours for a data engineer of any size.
Especially in a start-up environment with a small number of engineers, the implementation can take a long time because it is sometimes done by an engineer who is not dedicated to the project on a part-time basis.

Data Engineer for start-up scale


Data Engineer for large organization
organization

Infrastructu Infrastructu The infrastructure is already in place, and the main focus is
re re on operations.
Operations I don't know best practices for operations. Operations The division of labor is so divided that it is difficult to acquire
One person cannot cover all skills. a range of skills.
Encountered hard problems specific to large scale
Unknown problems or errors often occur. Unable to create a satisfactory process under the jurisdiction
Inability to respond immediately to renewal processing and of a different department, etc.
On-premise environment makes it difficult to build a data
request response infrastructure.
Start small and gradually expand, but it takes time. Inquiry control costs from each department are
(degree of) (degree of)
busyness . Lack of manpower and in-house knowledge busyness burdensome.
No time to create manual materials Many costs are incurred for manuals and other materials.

Explanation costs for data use to related parties are high.


cooperation We are still not large enough to handle many simple cooperation
Data processing management is burdensome due to the
requests. high volume of data processing.

7
Data Infrastructure Challenges by Organization Size

Start-up scale organization large organization

● Cannot replace due to current tools or on-


premise environment
● Difficult to invest in expensive tools
● Because the staffing plan is fixed, it is difficult to
● Difficulty in securing full-time personnel,
increase the number of employees even if there
resulting in one-sided operations.
is a shortage.
● Difficulty in securing resources and delays in
● Since many people are expected to use the
responding to requests
system, it is necessary to develop an
environment that is easy to use.

Establishing a data infrastructure requires a large number of resources, personnel, and


an environment conducive to utilization.

8
Components of Data Infrastructure

9
Components of Data Infrastructure

Data infrastructure is the collective term for the technologies and processes that enable companies to efficiently collect, manage, process, and analyze data.
Its components are categorized as follows

(3) Data storage and


(1) Data collection (2) Data processing
management
▼ Role ▼ Role
▼ Role
Collect data from a wide variety Convert, clean, and integrate data
Efficiently store collected data
of data sources. into a format that can be analyzed
and make it available for later use.
and used.

(API, logs, from DB, ETL) (ETL, real-time processing) (Data Lake, DWH)

(4) Data analysis and (5) Data Quality and (6) Monitoring and
visualization Governance operation
▼ Role
▼ Role ▼ Role
Data can be used to gain insights,
Manage data quality, security and Monitor and manage the entire data
Support decision-making.
privacy. infrastructure to ensure that it is
operating properly.
(BI, machine learning platform) (data catalog, metadata (log management and performance
management) monitoring tools)

10
Overall diagram of data infrastructure construction and utilization

The following is an example of an overall architectural diagram of a data infrastructure.


Some companies develop their own products using open source on their own, while others use external tools to achieve this.

Data Source ETL/ELT DWH BI

time of failure
notice
execution
log
Slack
launch
object S3
Synchronizatio
read in n/Writing
Data conversion
with dbt BI
Snowflak Snowflak

generatio e e

n YML
CRM

entry (e.g. dbt-core


manual to a form)
updating

SFA

11
3STEP for building data infrastructure

12
STEP 1: Clarify objectives and requirements

Clarification of Objectives and Requirements" is one of the most important steps in data infrastructure design. If this process is
inadequate, it will have a significant impact on later design, implementation, and operation. By clearly defining the objectives and
requirements, the data infrastructure can best meet the needs of the business and maximize its value.
▼ Clarification of
▼ Clarification of objectives
requirements

Understand your business goals Data Collection


Identify strategic goals for the company. Examples include developing new Establish requirements for which sources the data will be collected from
products, improving customer satisfaction, and reducing operating costs. and the format of the data.
Identify business areas (e.g., sales, marketing, manufacturing, human Data Storage
resources, etc.) that should be supported by the data infrastructure. Where will the data be stored, such as in a data lake or data warehouse?
Storage requirements based on the amount, duration, and frequency of
Determine specific data application methods data to be stored.
We consider how the data can be used. For example, creating dashboards Data Processing
for decision support, marketing strategies using predictive analytics, and Which should be prioritized, batch processing or real-time processing?
optimizing operations based on real-time data. The ETL (Extract, Transform, Load) process required.
Data Analysis
Understanding of business processes How will the data be visualized (BI tools or dashboards).
Understand the current business processes and how the data Data Governance
infrastructure relates to them. This will clarify how the infrastructure will Review policies to ensure data quality, data consistency, data integrity,
actually help. and compliance with privacy regulations (GDPR, CCPA, etc.).

13
STEP2: Evaluation of current data

It is necessary to identify what information is currently in the company, how it is being used, and what the issues are. This will allow us to
prioritize what information is unnecessary and what information is necessary.

company-internal
Customer Information Activity Log Information Sales Activity Information
information

Company Name Web access history Support Status CS support


Address Mail opened Lost Order History Follow-up History
Phone number Email Click Number of calls Employee Information
E-mail address Click on the link in the page Number of business Inventory Information
Name of Representative Number of events negotiations Labor Information
Capital Advertising Banner Clicks Appointment handling Accounting Information
Number of employees Ad Delivery Mail distribution status (lessening the significance or
Listed/Unlisted Form Entry Number of orders value of the previous word) the
Industry Download Materials Number of appointments likes of
Business Contact us Order Receipt Rate
Website URL Free Trial Gross profit margin
Department (lessening the significance or Unit Sales
Position value of the previous word) (lessening the significance or
(lessening the significance the likes of value of the previous word) the
or value of the previous likes of
word) the likes of 14
STEP3: Architecture Design - ①Data Collection

Architectural design is the critical design of the overall structure for efficiently collecting, processing, storing, and analyzing data based on a
company's data strategy. Successful architecture design maximizes data utilization, improves operational efficiency, and enhances business
outcomes.

Design ① Data collection ▼ To be considered

Objective. ■Data Sources


➡Internal data

The primary purpose of data collection is to ensure that Database, CRM system, ERP system, log files, etc.
➡External Data
decisions are data-driven. By collecting accurate and
SNS, Web APIs, third-party data sources, etc.
relevant data, companies can gain an accurate understanding
of their current situation and make forward-looking decisions.
For example, customer purchase history and behavioral data ■Collection Processing Methods
can be used to analyze which products are popular and which ➡Batch processing

campaigns are effective. In this way, data supports scientific A method of collecting data in batches on a regular basis. For
example, one process per day.
decision-making that does not rely on qualitative senses or
➡Real-time processing
guesswork.
How to capture data as soon as it is generated
(Real-time analysis of customer behavior in online stores)
15
STEP3: Architectural Design – ② Data Storage and Management

Architectural design is the critical design of the overall structure for efficiently collecting, processing, storing, and analyzing data based on a
company's data strategy. Successful architecture design maximizes data utilization, improves operational efficiency, and enhances business
outcomes.

Design ② Data storage and management ▼ To be considered

Objective. ■Data Lake


Stores raw data (structured, semi-structured, and unstructured data) as is.
The most basic purpose of data storage is to securely Large-scale storage for future analysis and machine learning.
store the vast amount of data collected by a company.
Data such as customer information, financial data, transaction
■ Data Warehousing
records, and product information are very important assets in
Stores structured data, especially for business intelligence (BI) and report
corporate activities. Storage systems are designed to protect analysis. Optimized for fast query processing.
these data from loss or corruption. For example, by providing
data backup functions and disaster recovery measures
(disaster recovery), they provide an environment in which data ■ Database (OLTP)
Stores data for transaction processing. Used to retrieve data directly from
can be preserved and business operations can continue even
business systems.
in the event of unexpected problems.

16
STEP3: Architectural design - ③ Data processing

Architectural design is the critical design of the overall structure for efficiently collecting, processing, storing, and analyzing data based on a
company's data strategy. Successful architecture design maximizes data utilization, improves operational efficiency, and enhances business
outcomes.

Design ③ Data processing ▼ To be considered

Objective. ■Use of ETL (Extract, Transform, Load)


It is the process of extracting data, transforming (cleansing, formatting, and
aggregating) it as needed, and finally loading it into the desired storage (e.g.,
data warehouse).
Because of the diversity of data collection sources and formats,
problems such as missing, duplicate, and inconsistent data can ■Use of ELT (Extract, Load, Transform)
occur. Data processing aims to correct or eliminate these A technique in which data is extracted, first loaded into storage, and then
transformed in storage as needed. It is particularly suited for processing large
problems and prepare high-quality data. For example, it
amounts of data.
unifies duplicate customer information or completes missing
values in numerical data, thereby improving the accuracy of
■Utilization of streaming processing
analysis and decision-making. Real-time data processing techniques. Used when event-driven architecture
and real-time analysis are required.

17
STEP3: Architectural Design - ④ Data Analysis and Visualization

Architectural design is the critical design of the overall structure for efficiently collecting, processing, storing, and analyzing data based on a
company's data strategy. Successful architecture design maximizes data utilization, improves operational efficiency, and enhances business
outcomes.

Design ④ Data analysis and visualization ▼ To be considered

Purpose ■Visualization with BI tools


Tools for creating dashboards and reports. It is also useful for collaboration with
upper management and other departments as it allows for easy visualization of
various information.
The fundamental purpose of data analysis is to discover
Examples: Tableau, Power BI, Looker, Qlik, etc.
meaningful patterns and trends and gain insights from
large amounts of data. Visualization presents these insights
in visual form, such as graphs and charts, to provide an
■Data analysis of visualized information
intuitive understanding of trends and outliers in the data. Data
Visualized information is analyzed by BI tools and utilized for the next measures.
visualization also provides complex data in a concise, easy-to-
It also allows for more sophisticated analysis using accumulated aggregate data
understand format to support quick and accurate decision- and machine learning models.
making. Examples: Python, R, Azure Machine Learning, AWS SageMaker, etc.

18
Key points in architectural design of data infrastructure

The points are as follows

It is important to design the system to be able to handle increased data volume and traffic. Scalability can be ensured by
scalability
utilizing cloud services and distributed systems.

To minimize system downtime, redundancy and backup designs are required. High system availability is required, especially
availability
when relying on real-time business activities.

Fast data processing speed and query response times are important for business intelligence and real-time analysis.
performance
Appropriate index design and distributed processing are effective in accelerating data processing.

Flexibility and As the company grows, the data infrastructure must be designed to easily expand. A modular architecture is desirable to
Scalability accommodate the addition of new data sources and analysis methods.

20
STEP3: Determination of schedule

The details of the specific schedule will vary depending on the size of the project and requirements, and can be considered roughly as follows

Medium-scale projects (6-12


Small projects (2-4 months) Large-scale projects (1-2+ years)
months)

Purpose Purpose Purpose


Small data analysis infrastructure, simple BI Building data lakes and data warehouses, Enterprise-wide data platform, enhanced
environment integrating multiple data sources security

Scale Scale Scale


Limited data sources (several), mostly Structured and unstructured data, both real- Integration of large data sources,
batch processing time and batch processing multinational users, and machine learning
infrastructure
[Example. [Example. [Example.
Data analysis platform for SMEs Integrated analysis platform for marketing information Integrated data platform for overseas companies
SaaS data integration and visualization Real-time monitoring system using IoT data Advanced risk management platform for financial
environment construction (lessening the significance or value of the previous institutions
(lessening the significance or value of the previous word) the likes of (lessening the significance or value of the previous
word) the likes of word) the likes of

21
Building a data infrastructure using cloud ETL tools

22
Effective ETL tools for data platforms

ETL tools are software and platforms that automate and streamline the three processes of Extract, Transform, and Load. They are responsible
for extracting data from different systems and sources (databases, files, APIs, etc.), transforming and processing them according to the
purpose, and then loading them into the target system (data warehouse, BI tool, etc.) for analysis and use.
It is mainly used to build a foundation for business intelligence (BI) and data analysis, and is characterized by its ability to efficiently integrate
and quality control data.
■As Is: How to aggregate data manually To Be: Utilizing ETL services
selection (from a selection (from a
conversion Visualization/Export conversion Visualization/Export
group) group)

CRM/SFA
updating
Manual
CSV

Salesforce Excel MA
spreadsheet Salesforce
MA

Extraction, processing,
30 minutes~1 hour for conversion, and updating
30 minutes~1 hour
extraction and can be performed
for update
processing
automatically

23
Improve operational efficiency by using ETL tools and DWH

It takes a lot of time to build a data infrastructure in-house, and for non-engineers, there must be people in charge who want to perform small-
scale analysis by linking Salesforce and Kintone, or by linking Marketo and Kintone, etc. There must be people in charge who want to conduct
small-scale analysis. In order to perform such small-scale analysis, smooth integration is possible by using "ETL" tools.
However, there is a limit to data linkage only between service A and service B. Therefore, by introducing a DWH that can accumulate data in
between, it is possible to utilize complex information and conduct more effective analysis.

■Utilize ELT to move data between services ■Utilize ETL to aggregate multiple data sources into a single DWH
selection (from a
Basically one service information can be visualized/exported to another service conversion Visualization/Export
group)
Information on DWH with BI
selection (from a tools
conversion Visualization/Export Sales
group) Direct visualization is
Informatio
possible
n

Customer BI Tools
CRM/SFA Informatio
n data warehousing
(DWH)
Salesforce behavioral
informatio CRM/SFA
MA n

Business
Difficult to integrate Informatio
n MA
and visualize/export
complex information
Integrated visualization/export of complex
information 24
Overall diagram of data infrastructure construction and ETL tool utilization

The architecture diagram introduced at the beginning of this section can be implemented with an ETL tool as follows.

Data Source ETL/ELT BI


DWH
time of failure
notice

execution
log
Slack

object S3
read in Synchronization/
Writing
Data conversion
with dbt BI

Snowflake Snowflake

CRM

dbt-core

SFA

25
summary

Building a data infrastructure is an important factor in increasing a company's


competitiveness and supporting growth. Centralizing data and utilizing it
quickly and accurately improves decision-making speed and operational
efficiency. To achieve data-driven management, a data infrastructure is an
indispensable investment, increasing a company's flexibility and enabling it to
respond quickly to a fast-changing market.

In this document, we have simply explained the importance of the data


infrastructure to the elements necessary to build it. We hope it has been of
some help to you.

26
Introduction of cloud ETL tool "TROCCO

27
What is TROCCO?
TROCCO® is a cloud ETL service that has been implemented by more than 2,000 companies
and organizations.
In addition to ETL (data transfer and data conversion) functions, it also provides workflow functions, authority management, and other functions
necessary for building and operating data infrastructures.
TROCCO automates a series of data engineering processes to help you make the most of your data.

Eliminate the muddle of data engineering,


Helping them to focus on "offensive work."

TROCCO® is a SaaS that supports the construction and operation of analysis


infrastructure, covering data engineering areas such as ETL/data transfer, data mart
generation, job management, and data governance.
Automates the linkage, maintenance, and operation of all types of data to quickly build a
data analysis infrastructure. We create an environment that makes it easy to gain insight
and time to focus on analytical work.

Easy deployment of professional data engineer-level ETL/ELT pipelines in as little as 5 minutes

Set up workflows from the GUI for a fresh, well-organized data environment at any time

Data Catalog" makes operations in the area of analysis overwhelmingly more efficient.

28
Role of TROCCO®︎ in building data infrastructure

TROCCO is useful for the linkage part required to build a data infrastructure. TROCCO can be implemented in low-code and has low hurdles for
implementation, making it possible to build a data infrastructure without stress.

Sales data
¥ Merchandise and purchase data on
the core system BI Tools
DWH Monitoring and analysis
Customer Data
CRM and MA tools
Application
Updating and utilizing data
advertising expenses
Advertising Costs
data lake warehouse mart Service DB
Reflecting Recommendation
activity log Service Measures
Site activity logs, email opening logs Data Modeling

Workflow
Automation

With TROCCO®︎ , you can build your data infrastructure in an explosion of 29


speed.
Advantages of introducing TROCCO®︎

Reduce development man-hours and costs Reduced analysis lead time and The data environment has been developed and
Generating Resources Increased speed of data utilization Increasing democratization of data

This eliminates the need for engineers to be The data you want is immediately linked and Since it will be possible to refer to unique data that
attached to operations and allows them to focus automatically updated. Low learning costs and the has been maintained, gaps in recognition and other
their efforts on measures where they would normally ability for non-engineers to acquire data on their problems will be eliminated. In addition, since
devote their resources. own makes it easier to obtain data and speeds up everyone can easily access the centralized data,
analysis. awareness of data utilization will take root in the
organization.

Data utilization creates the groundwork for easy creation of business value.
30
90% reduction in integration work time with TROCCO®.

Comparison of data integration work time before and after implementation (duration: 1 year)
TROCCO® is a data
integration
Initial Dev
Of the time it takes,
480 h 90% Over 90% reduction
Building and
Cost Initial
Development
Maintenance Saving
5h
960 h Building and
Maintainance
60 h

Embulk TROCCO
31
More than 100 types of data can be linked with

Core systems, MA/SFA tools, advertising data, spreadsheets... scattered data automatically
It can be collected stably. It is also characterized by its wide range of support, including domestic services.

32
List of Companies

A wide variety of companies and organizations, regardless of industry or type of business, use our services.

© primeNumber Inc. 33

You might also like