0% found this document useful (0 votes)
19 views10 pages

Validate CSV Data Loading in Snowflake

The document provides a comprehensive guide on how to validate data while loading CSV files into Snowflake, detailing the use of various SQL commands and parameters. It covers the creation of file formats, customer tables, and the execution of copy commands with and without validation flags, highlighting the implications on compute costs and processing time. Additionally, it addresses potential errors and limitations associated with data validation during the loading process.

Uploaded by

ksnyogatuni
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views10 pages

Validate CSV Data Loading in Snowflake

The document provides a comprehensive guide on how to validate data while loading CSV files into Snowflake, detailing the use of various SQL commands and parameters. It covers the creation of file formats, customer tables, and the execution of copy commands with and without validation flags, highlighting the implications on compute costs and processing time. Additionally, it addresses potential errors and limitations associated with data validation during the loading process.

Uploaded by

ksnyogatuni
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

Validate Data In Snowflake While

Loading

Summary

1. How To Validate Data In Snowflake While Loading CSV Files


2. Sample Data Set
3. SQL Used In This Deduplication Article
 The File Format To Support Allow Duplicate
 Create Customer Table
 Create Standard File Format & Put Command
 SnowSQL Put Commmand & List User Stage
 Run Copy Command (Without Validation Flag)
 Run Copy Command with Validation Flag
 Multiple Line error
 One Line Many errors
 Large Files & Validation Paramter

This blogs explain how to run validation process while loading small or large
CSV files into snowflake. This can be done using a parameter and it has
certain limitations and if you know how to use them approproately, you will
save lot of compute cost.
How To Validate Data In
Snowflake While Loading CSV
Files
Data loading may take time and if any coversion issue or parsing issue
occures while loading data into table, the cycle to fix the issue and re-
running copy command is time consuming. This chapter helps you to
understand how data can be validated before it is loaded into snowflake and
how different options are there to debug the data issues

This episode

1. Does File Format has data validation parameter?


2. Does Copy command has data validation parmaeter?
3. What is the limitation of data validation option and when it can not be
used?
4. Adding validation parameter increase compute cost?
5. Adding validation parameter takes lot of time for large files?
6. How to check the past validation error for all errors and partition files?
This episode helps you to describe each of the different data validation
options and help you to solve a problem using hands-on excercise . Once you
complete this video, you will be able to answer following questions

Once you complete this video, you will be able to answer following questions
1. Does File Format has data validation parameter?
2. Does Copy command has data validation parmaeter?
3. What is the limitation of data validation option and when it can not be
used?
Sample Data Set
Here is the data set which is used in this guide.

SQL Used In This Deduplication


Article
The File Format To Support Allow Duplicate
Create Customer Table

1
-- customer table with 15 columns having different data types
2
create or replace transient table customer_validation (
3
customer_pk number(38,0),
4
salutation varchar(10),
5
first_name varchar(50),
6
last_name varchar(50),
7
gender varchar(1),
8
marital_status varchar(1),
9
day_of_birth date,
10
birth_country varchar(60),
11
email_address varchar(50),
12
city_name varchar(60),
13
zip_code varchar(10),
14
country_name varchar(20),
15
gmt_timezone_offset number(10),
16
preferred_cust_flag boolean,
17
registration_time timestamp_ltz(9)
18
);
19
20

Create Standard File Format & Put Command

1 -- Create a file format called csv_ff


2 create or replace file format csv_ff
3 type = 'csv'
4 compression = 'none'
5 field_delimiter = ','
6 record_delimiter = '\n'
7 skip_header = 1
8 field_optionally_enclosed_by = '\047';
9
10 -- put command
11

SnowSQL Put Commmand & List User Stage


1 -- lets load the data using put command
2 put
3 [Link]
4 @~/ch08/small-csv
5 auto_compress=false;
6
7 -- list the user stage location
8 list @~/ch08/small-csv ;
9

Run Copy Command (Without Validation Flag)

-- run copy command to load data from stage to table


1
copy into customer_validation
2
from @~/ch08/small-csv/customer_01_one_error.csv
3
file_format = csv_ff;
4
5
-- now rerun the copy command using validation_mode
6
-- this is one additioal property, which will just validate but not
7
load the data
8
copy into customer_validation
9
from @~/ch08/small-csv/customer_01_one_error.csv
10
file_format = csv_ff
11
on_error = 'continue';
12
13

Run Copy Command with Validation Flag

1
2 -- Option-1
3 -- validation_mode = return_errors;
4 -- Returns all errors (parsing, conversion, etc.) across all files
5 specified in the COPY statement.
6
7 -- Option-2
8 -- validation_mode = return_n_rows;
9 -- validation_mode = return_errors;
10
11 -- Option-3
12 -- validation_mode = return_all_errors;
13 -- Returns all errors across all files specified in the COPY
14 statement,
15
16 copy into customer_validation
17 from @~/ch08/small-csv/customer_01_one_error.csv
18 file_format = csv_ff
19 force = true
20 validation_mode = return_all_errors;
21
22
23 -- Option-2
24 copy into customer_validation
25 from @~/ch08/small-csv/customer_01_one_error.csv
26 file_format = csv_ff;
validation_mode = return_1_rows;
27
28 -- Option-3
29 copy into customer_validation
30 from @~/ch08/small-csv/customer_01_one_error.csv
31 file_format = csv_ff;
32 validation_mode = return_all_errors;
--

Multiple Line error


-- list the user stage location (again if you are not see my stage rleated
1
video, pls watch them later.)
2
list @~/ch08/small-csv ;
3
4
-- run copy command to load data from stage to table
5
-- we don't know where all are the issues
6
copy into customer_validation
7
from @~/ch08/small-csv/customer_02_three_errors.csv
8
file_format = csv_ff
9
on_error = 'continue'
10
force = true
11
validation_mode = return_errors;
12
13
14
15
16
-- run without validation mode and skip the error records
17
copy into customer_validation
18
from @~/ch08/small-csv/customer_02_three_errors.csv
19
file_format = csv_ff
20
on_error = 'continue'
21
force = true;
22
23
-- check the table
24
select * from customer_validation;

One Line Many errors


1 list @~/ch08/small-csv ;
2
3 -- run copy command to load data from stage to table
4 -- we don't know where all are the issues
5 copy into customer_validation
6 from @~/ch08/small-csv/customer_03_one_line_many_error.csv
7 file_format = csv_ff
8 on_error = 'continue'
9 force = true
10 validation_mode = return_errors;
11
12
13
14
15 -- run without validation mode and skip the error records
16 copy into customer_validation
17 from @~/ch08/small-csv/customer_03_one_line_many_error.csv
18 file_format = csv_ff
19 on_error = 'continue'
20 force = true;
21
22 -- check the table
23 select * from customer_validation;
24

Large Files & Validation Paramter

```sql – here is my large files (csv compressed with gzip) list


@~/ch08/csv/partition;

– there are 10 data files in user stage.. close to 3.5m data set – there are 3
files having data error – we don’t know which all rows under different files.

– since they are gz files.. we need a new file format

create or replace file format csv_gz_ff type = ‘csv’ compression = ‘gzip’


field_delimiter = ‘,’ field_optionally_enclosed_by = ‘\042’ skip_header = 1 ;

– copy command with validation mode = for all errors copy into
customer_validation from @~/ch08/csv/partition file_format = csv_gz_ff
on_error = ‘continue’ force = true pattern=’.*[.]csv[.]gz’; validation_mode =
return_all_errors;

select * from customer_validation limit 10; – lets check the result .. – does it
bring data issue from all files – parsing as well as conversion – and total time
to parse it.

– what happens if we say first 10 rows copy into customer_validation from


@~/ch08/csv/partition file_format = csv_gz_ff on_error = ‘continue’ force =
true pattern=’.*[.]csv[.]gz’ validation_mode = return_10_rows;

1 -- check the time taken, & which all files are picked
2
3
4 select * from table(validate(customer_validation, job_id=>'query-id'));
5 -- 01a81f86-3200-94b8-0002-14d20005a25e
6
7 select * from table(validate(customer_validation, job_id=>'01a81f81-
8 3200-93d3-0002-14d20005d3a6'));
9
10
11 -- validation mode does not support transformation
12 copy into customer_validation
13 from
14 (
select distinct * from @~/ch08/csv/partition t
15
)
16
file_format = csv_gz_ff
17
on_error = 'continue'
18
force = true
19
pattern='.*[.]csv[.]gz'
20
validation_mode = return_all_errors;
21
22
```
06 Nov 2022

You might also like