ETL for Data infra-2
ETL for Data infra-2
2
Introduction.
3
What is data infrastructure?
Data infrastructure is a term that refers to "the overall infrastructure or infrastructure that allows an organization or system
to collect, manage, process, and analyze data.
data integrity Regular data backups are required Automatic backup to the cloud.
On the other hand, the cloud environment does not incur initial
costs and requires fewer man-hours for operation and
maintenance than on-premise, but it does not allow for Build your own security Dependent on vendor's security
Security Performance
environment structure
detailed customization.
First, a data infrastructure is the foundation that enables an organization to effectively collect, manage, and analyze data. This allows data to be
centralized, maintained in quality, and used to make decisions quickly and accurately. The data infrastructure supports data-driven decision making
by preparing the data needed for business intelligence, predictive analysis, and operational optimization.
The primary role of a data infrastructure is to collect data from a variety It is responsible for storing vast amounts of data accurately and
of internal and external data sources and to integrate the information. securely. By managing information through data governance, organization,
Stream and patch data are collected and integrated in one place to cleansing, and other data management systems, we can comply with
increase the versatility of data utilization. laws and regulations and enhance security.
The need for a data infrastructure is to support quick and accurate decision making, to improve operational efficiency and reduce costs, and to promote data-driven management
and gain competitive advantage. Having the foundation in place to achieve these goals is essential for modern companies to achieve sustainable growth.
In today's business environment, fast and accurate decision-making is essential to remain competitive and grow. However, the data owned by companies is
diverse and often siloed, and in some cases lacks consistency and reliability. By integrating and centrally managing data from disparate systems and
formats, data infrastructures provide companies with instant access to the data they need, when they need it.
When each department of a company manages its own data, data duplication and inconsistencies occur, requiring significant effort and cost to resolve.
In addition, manual data processing is prone to errors and inefficient. Building a data infrastructure can solve these problems.
Data is also referred to as the "new oil" in modern corporate activities, and its use determines competitive advantage. By developing a data infrastructure,
companies can strategically utilize data to gain a competitive edge in the marketplace. They can analyze customer purchasing behavior to develop
personalized marketing and utilize real-time data to plan production in response to demand.
6
Difficulty of data infrastructure operation work
Building a data infrastructure on-premises requires a significant amount of man-hours for a data engineer of any size.
Especially in a start-up environment with a small number of engineers, the implementation can take a long time because it is sometimes done by an engineer who is not dedicated to the project on a part-time basis.
Infrastructu Infrastructu The infrastructure is already in place, and the main focus is
re re on operations.
Operations I don't know best practices for operations. Operations The division of labor is so divided that it is difficult to acquire
One person cannot cover all skills. a range of skills.
Encountered hard problems specific to large scale
Unknown problems or errors often occur. Unable to create a satisfactory process under the jurisdiction
Inability to respond immediately to renewal processing and of a different department, etc.
On-premise environment makes it difficult to build a data
request response infrastructure.
Start small and gradually expand, but it takes time. Inquiry control costs from each department are
(degree of) (degree of)
busyness . Lack of manpower and in-house knowledge busyness burdensome.
No time to create manual materials Many costs are incurred for manuals and other materials.
7
Data Infrastructure Challenges by Organization Size
8
Components of Data Infrastructure
9
Components of Data Infrastructure
Data infrastructure is the collective term for the technologies and processes that enable companies to efficiently collect, manage, process, and analyze data.
Its components are categorized as follows
(API, logs, from DB, ETL) (ETL, real-time processing) (Data Lake, DWH)
(4) Data analysis and (5) Data Quality and (6) Monitoring and
visualization Governance operation
▼ Role
▼ Role ▼ Role
Data can be used to gain insights,
Manage data quality, security and Monitor and manage the entire data
Support decision-making.
privacy. infrastructure to ensure that it is
operating properly.
(BI, machine learning platform) (data catalog, metadata (log management and performance
management) monitoring tools)
10
Overall diagram of data infrastructure construction and utilization
time of failure
notice
execution
log
Slack
launch
object S3
Synchronizatio
read in n/Writing
Data conversion
with dbt BI
Snowflak Snowflak
generatio e e
n YML
CRM
SFA
11
3STEP for building data infrastructure
12
STEP 1: Clarify objectives and requirements
Clarification of Objectives and Requirements" is one of the most important steps in data infrastructure design. If this process is
inadequate, it will have a significant impact on later design, implementation, and operation. By clearly defining the objectives and
requirements, the data infrastructure can best meet the needs of the business and maximize its value.
▼ Clarification of
▼ Clarification of objectives
requirements
13
STEP2: Evaluation of current data
It is necessary to identify what information is currently in the company, how it is being used, and what the issues are. This will allow us to
prioritize what information is unnecessary and what information is necessary.
company-internal
Customer Information Activity Log Information Sales Activity Information
information
Architectural design is the critical design of the overall structure for efficiently collecting, processing, storing, and analyzing data based on a
company's data strategy. Successful architecture design maximizes data utilization, improves operational efficiency, and enhances business
outcomes.
The primary purpose of data collection is to ensure that Database, CRM system, ERP system, log files, etc.
➡External Data
decisions are data-driven. By collecting accurate and
SNS, Web APIs, third-party data sources, etc.
relevant data, companies can gain an accurate understanding
of their current situation and make forward-looking decisions.
For example, customer purchase history and behavioral data ■Collection Processing Methods
can be used to analyze which products are popular and which ➡Batch processing
campaigns are effective. In this way, data supports scientific A method of collecting data in batches on a regular basis. For
example, one process per day.
decision-making that does not rely on qualitative senses or
➡Real-time processing
guesswork.
How to capture data as soon as it is generated
(Real-time analysis of customer behavior in online stores)
15
STEP3: Architectural Design – ② Data Storage and Management
Architectural design is the critical design of the overall structure for efficiently collecting, processing, storing, and analyzing data based on a
company's data strategy. Successful architecture design maximizes data utilization, improves operational efficiency, and enhances business
outcomes.
16
STEP3: Architectural design - ③ Data processing
Architectural design is the critical design of the overall structure for efficiently collecting, processing, storing, and analyzing data based on a
company's data strategy. Successful architecture design maximizes data utilization, improves operational efficiency, and enhances business
outcomes.
17
STEP3: Architectural Design - ④ Data Analysis and Visualization
Architectural design is the critical design of the overall structure for efficiently collecting, processing, storing, and analyzing data based on a
company's data strategy. Successful architecture design maximizes data utilization, improves operational efficiency, and enhances business
outcomes.
18
Key points in architectural design of data infrastructure
It is important to design the system to be able to handle increased data volume and traffic. Scalability can be ensured by
scalability
utilizing cloud services and distributed systems.
To minimize system downtime, redundancy and backup designs are required. High system availability is required, especially
availability
when relying on real-time business activities.
Fast data processing speed and query response times are important for business intelligence and real-time analysis.
performance
Appropriate index design and distributed processing are effective in accelerating data processing.
Flexibility and As the company grows, the data infrastructure must be designed to easily expand. A modular architecture is desirable to
Scalability accommodate the addition of new data sources and analysis methods.
20
STEP3: Determination of schedule
The details of the specific schedule will vary depending on the size of the project and requirements, and can be considered roughly as follows
21
Building a data infrastructure using cloud ETL tools
22
Effective ETL tools for data platforms
ETL tools are software and platforms that automate and streamline the three processes of Extract, Transform, and Load. They are responsible
for extracting data from different systems and sources (databases, files, APIs, etc.), transforming and processing them according to the
purpose, and then loading them into the target system (data warehouse, BI tool, etc.) for analysis and use.
It is mainly used to build a foundation for business intelligence (BI) and data analysis, and is characterized by its ability to efficiently integrate
and quality control data.
■As Is: How to aggregate data manually To Be: Utilizing ETL services
selection (from a selection (from a
conversion Visualization/Export conversion Visualization/Export
group) group)
CRM/SFA
updating
Manual
CSV
Salesforce Excel MA
spreadsheet Salesforce
MA
Extraction, processing,
30 minutes~1 hour for conversion, and updating
30 minutes~1 hour
extraction and can be performed
for update
processing
automatically
23
Improve operational efficiency by using ETL tools and DWH
It takes a lot of time to build a data infrastructure in-house, and for non-engineers, there must be people in charge who want to perform small-
scale analysis by linking Salesforce and Kintone, or by linking Marketo and Kintone, etc. There must be people in charge who want to conduct
small-scale analysis. In order to perform such small-scale analysis, smooth integration is possible by using "ETL" tools.
However, there is a limit to data linkage only between service A and service B. Therefore, by introducing a DWH that can accumulate data in
between, it is possible to utilize complex information and conduct more effective analysis.
■Utilize ELT to move data between services ■Utilize ETL to aggregate multiple data sources into a single DWH
selection (from a
Basically one service information can be visualized/exported to another service conversion Visualization/Export
group)
Information on DWH with BI
selection (from a tools
conversion Visualization/Export Sales
group) Direct visualization is
Informatio
possible
n
Customer BI Tools
CRM/SFA Informatio
n data warehousing
(DWH)
Salesforce behavioral
informatio CRM/SFA
MA n
Business
Difficult to integrate Informatio
n MA
and visualize/export
complex information
Integrated visualization/export of complex
information 24
Overall diagram of data infrastructure construction and ETL tool utilization
The architecture diagram introduced at the beginning of this section can be implemented with an ETL tool as follows.
execution
log
Slack
object S3
read in Synchronization/
Writing
Data conversion
with dbt BI
Snowflake Snowflake
CRM
dbt-core
SFA
25
summary
26
Introduction of cloud ETL tool "TROCCO
27
What is TROCCO?
TROCCO® is a cloud ETL service that has been implemented by more than 2,000 companies
and organizations.
In addition to ETL (data transfer and data conversion) functions, it also provides workflow functions, authority management, and other functions
necessary for building and operating data infrastructures.
TROCCO automates a series of data engineering processes to help you make the most of your data.
Set up workflows from the GUI for a fresh, well-organized data environment at any time
Data Catalog" makes operations in the area of analysis overwhelmingly more efficient.
28
Role of TROCCO®︎ in building data infrastructure
TROCCO is useful for the linkage part required to build a data infrastructure. TROCCO can be implemented in low-code and has low hurdles for
implementation, making it possible to build a data infrastructure without stress.
Sales data
¥ Merchandise and purchase data on
the core system BI Tools
DWH Monitoring and analysis
Customer Data
CRM and MA tools
Application
Updating and utilizing data
advertising expenses
Advertising Costs
data lake warehouse mart Service DB
Reflecting Recommendation
activity log Service Measures
Site activity logs, email opening logs Data Modeling
Workflow
Automation
Reduce development man-hours and costs Reduced analysis lead time and The data environment has been developed and
Generating Resources Increased speed of data utilization Increasing democratization of data
This eliminates the need for engineers to be The data you want is immediately linked and Since it will be possible to refer to unique data that
attached to operations and allows them to focus automatically updated. Low learning costs and the has been maintained, gaps in recognition and other
their efforts on measures where they would normally ability for non-engineers to acquire data on their problems will be eliminated. In addition, since
devote their resources. own makes it easier to obtain data and speeds up everyone can easily access the centralized data,
analysis. awareness of data utilization will take root in the
organization.
Data utilization creates the groundwork for easy creation of business value.
30
90% reduction in integration work time with TROCCO®.
Comparison of data integration work time before and after implementation (duration: 1 year)
TROCCO® is a data
integration
Initial Dev
Of the time it takes,
480 h 90% Over 90% reduction
Building and
Cost Initial
Development
Maintenance Saving
5h
960 h Building and
Maintainance
60 h
Embulk TROCCO
31
More than 100 types of data can be linked with
Core systems, MA/SFA tools, advertising data, spreadsheets... scattered data automatically
It can be collected stably. It is also characterized by its wide range of support, including domestic services.
32
List of Companies
A wide variety of companies and organizations, regardless of industry or type of business, use our services.
© primeNumber Inc. 33