0% found this document useful (0 votes)
76 views

1.loading Data Into Mysql

1. Banking data was loaded into MySQL tables from CSV files and then exported to HDFS using Sqoop. 2. External tables were created in Hive using the data in HDFS to allow analysis. 3. Various analyses were performed on the data including finding users with pending loans or healthy credit cards, maximum share profits by date, and survey ratings.

Uploaded by

Ram Guggul
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
76 views

1.loading Data Into Mysql

1. Banking data was loaded into MySQL tables from CSV files and then exported to HDFS using Sqoop. 2. External tables were created in Hive using the data in HDFS to allow analysis. 3. Various analyses were performed on the data including finding users with pending loans or healthy credit cards, maximum share profits by date, and survey ratings.

Uploaded by

Ram Guggul
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Banking data analysis

1.Loading data into mysql


Entering into mysql shell

mysql -u root -p

Creating database bank in mysql

CREATE DATABASE bank;


USE bank;

Creating tables in mysql and inserting the data into mysql tables

Creating table loan_info


CREATE TABLE loan_info (
loan_id int,
user_id int,
last_payment_date DATE,
payment_installation DOUBLE,
date_payable DATE
);

Inserting data into loan_info table

insert into loan_info values(1234,5678,'2017-02-20',509,'2017-03-20');


insert into loan_info values(1243,5687,'2016-02-18',9087,'2016-03-18');
insert into loan_info values(1324,5786,'2017-03-01',8976,'2017-04-01');
insert into loan_info values(4312,8976,'2017-01-18',9087,'2017-02-18');
Checking the data in loan_info table

select * from loan_info

Creating table credit_card_info

CREATE TABLE credit_card_info


(
cc_number bigint,
user_id int,
maximum_credit DOUBLE,
outstanding_balance DOUBLE,
due_date DATE
);

Inserting data into the credit_card_info table

insert into credit_card_info values(1234678753672899,1234,50000,35000,'2017-03-22');


insert into credit_card_info values(1234678753672900,1243,500000,500000,'2017-03-12');
insert into credit_card_info values(1234678753672902,1324,15000,12000,'2017-03-09');
insert into credit_card_info values(1234678753672908,4312,60000,60000,'2017-02-16');
Checking the data in credit_card_info table

select * from credit_card_info;

Creating table shares_info

CREATE TABLE shares_info


(
share_id varchar(10),
company_name varchar(20),
gmt_timestamp bigint,
share_price DOUBLE
);

Inserting data into shares_info table

insert into shares_info values('S102',"MyCorp",1488412702,100);


insert into shares_info values('S102',"MyCorp",1488411802,110);
insert into shares_info values('S102',"MyCorp",1488411902,90);
insert into shares_info values('S102',"MyCorp",1488412502,80);
insert into shares_info values('S102',"MyCorp",1488411502,120);
Checking the data in shares_info table

select * from shares_info;

Commit;

2.Exporting data from Mysql to HDFS using sqoop


Now we have data in mysql. We need to export this data into HDFS. We will do it using sqoop.

For exporting data into HDFS we will first create an user in mysql.

CREATE USER 'myuser'@'localhost' IDENTIFIED BY 'myuser';


grant all on *.* to 'myuser'@'localhost' with grant option;
flush privileges;
commit;

Now let's transfer these tables into HDFS by writing sqoop jobs.

We will protect our mysql password by saving the password in a file.

echo -n "myuser">>sqoop_mysql_passwrd
You need to use the option -n. Otherwise, a new line will be created unknowingly and while reading
the password, Sqoop throws an error Access Denied for User.

/*

Sqoop job to transfer data in loan_info table

Create a directory in HDFS loan_info_stg to store the table data

hadoop fs -mkdir /bank

hadoop fs -mkdir /bank/loan_info_stg

*/
Above step is not required while performing normal import.

Creating sqoop job for sqoop_loan_info

sqoop job --create sqoop_loan_info -- import --connect jdbc:mysql://localhost/bank --username


myuser --table loan_info --password-file file:///home/acadgild/sqoop_mysql_passwrd --target-dir
/bank/loan_info_stg -m1

sqoop job --list

Executing the sqoop job sqoop_loan_info


sqoop job --exec sqoop_loan_info
Sqoop job to transfer data in credit_card_info table

Create a directory in HDFS for storing credit_card_info table data

hadoop fs -mkdir /bank/credit_card_info_stg

Creating sqoop job credit_card_info

sqoop job --create sqoop_credit_card_info -- import --connect jdbc:mysql://localhost/bank --


username myuser --table credit_card_info --password-file
file:///home/acadgild/sqoop_mysql_passwrd --target-dir /bank/credit_card_info_stg -m 1

sqoop job --list

Executing sqoop job credit_card_info

sqoop job --exec sqoop_credit_card_info


Sqoop job to transfer data in shares_info_table

Creating HDFS directory to store shares_info table data

hadoop fs -mkdir /bank/shares_info_stg

Creating sqoop job shares_info

sqoop job --create sqoop_shares_info -- import --connect jdbc:mysql://localhost/bank --username


myuser --table shares_info --password-file file:///home/acadgild/sqoop_mysql_passwrd --target-dir
/bank/shares_info_stg -m 1

sqoop job --list

Executing sqoop_job shares_info

sqoop job --exec sqoop_shares_info


3.Creating external tables in hive

Create database

CREATE DATABASE bank;

USE bank;

Creating table loan_info_stg

As this table is an external table, we just need to give the location of the data.

CREATE EXTERNAL TABLE loan_info_stg (


Loan_id int,
User_id int,
last_payment_date string,
payment_installation double,
Date_payable string
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/bank/loan_info_stg';
Creating table credit_card_info_stg

CREATE EXTERNAL TABLE credit_card_info_stg


(
cc_number string,
user_id int,
maximum_credit double,
outstanding_balance double,
due_date string
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/bank/credit_card_info_stg';

Creating table shares_info_stg

CREATE EXTERNAL TABLE shares_info_stg


(
Share_id string,
Company_name string,
Gmt_timestamp bigint,
Share_price double
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/bank/shares_info_stg';
Creating core tables and loading the data into the core tables from stg tables

Adding the udf into hive shell.

CREATE TEMPORARY FUNCTION encrypt AS 'encryption.AESencrypt';


CREATE TEMPORARY FUNCTION decrypt AS 'encryption.AESdecrypt';

Creating loan_info table

CREATE TABLE loan_info (


Loan_id string,
User_id string,
last_payment_date string,
payment_installation string,
Date_payable string
) STORED AS ORC;

Inserting data into loan_info table

INSERT INTO TABLE loan_info


SELECT encrypt(Loan_id),
encrypt(User_id),
encrypt(last_payment_date),
encrypt(payment_installation),
encrypt(Date_payable)
FROM loan_info_stg;

Creating credit_card_info table

CREATE TABLE credit_card_info


(
cc_number string,
user_id string,
maximum_credit string,
outstanding_balance string,
due_date string
) STORED AS ORC;

Inserting data into credit_card_info table


INSERT INTO TABLE credit_card_info
SELECT encrypt(cc_number),
encrypt(User_id),
encrypt(maximum_credit),
encrypt(outstanding_balance),
encrypt(due_date)
FROM credit_card_info_stg;

Creating shares_info table

CREATE TABLE shares_info


(
Share_id string,
Company_name string,
Gmt_timestamp string,
Share_price string
) STORED AS ORC;

Inserting data into shares_info table

INSERT INTO TABLE shares_info


SELECT encrypt(Share_id),
encrypt(Company_name),
encrypt(Gmt_timestamp),
encrypt(Share_price)
FROM shares_info_stg;

Checking the data in the three tables

As the data is bank data, we have encrypted the data.

You can truncate the data from stg tables.

5.Analysis
Decrypting the data for analysis

CREATE TEMPORARY FUNCTION encrypt AS 'encryption.AESencrypt';


CREATE TEMPORARY FUNCTION decrypt AS 'encryption.AESdecrypt';
CREATE TEMPORARY FUNCTION max_profit AS 'maxprofit.MaxProfit';
SET hive.auto.convert.join=false;

6.1. Find out the list of users who have at least 2 loan instalments pending.

SELECT decrypt(user_id)
FROM loan_info
WHERE datediff(from_unixtime(unix_timestamp(), 'yyyy-MM-dd'),
decrypt(last_payment_date)) >= 60;

6.2. Find the list of users who have a healthy credit card but outstanding loan account.
Healthy credit card means no outstanding balance.

SELECT decrypt(li.user_id)
FROM loan_info li INNER JOIN credit_card_info cci
ON decrypt(li.user_id) = decrypt(cci.user_id)
WHERE CAST(decrypt(cci.outstanding_balance) AS double) = 0.0
AND datediff(from_unixtime(unix_timestamp(), 'yyyy-MM-dd'), decrypt(li.last_payment_date)) >=
30;
6.3. For every share and for every date, find the maximum profit one could have made on the
share. Bear in mind that a share purchase must be before share sell and if share prices fall
throughout the day, maximum possible profit may be negative.

SELECT share_id, share_date, max_profit(collect_list(share_price))


FROM
(
SELECT decrypt(Share_id) AS share_id,
decrypt(Gmt_timestamp) AS Gmt_timestamp,
from_unixtime(CAST(decrypt(Gmt_timestamp) AS int), 'yyyy-MM-dd') AS share_date,
CAST (decrypt(Share_price) AS double) AS share_price
FROM shares_info
DISTRIBUTE BY share_id,
from_unixtime(CAST(Gmt_timestamp AS int), 'yyyy-MM-dd')
SORT BY share_id,
CAST(Gmt_timestamp AS int)
) inne GROUP BY share_id, share_date;

Output

7.Archival

8.Survey data analysis


We have 3 survery part files. So we will copy the contents into a single file using the below linux
commands.

cd /home/acadgild/survey_files
cat *.txt > survey_data
rm *.txt

Now we have the concated data in survey_data file.

Creating hive table to load survey_data

CREATE TABLE survey_analysis (


survey_date string,
survey_question string,
Rating int,
user_id int,
survey_id string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';

Loading data into survey_analysis table

LOAD DATA LOCAL INPATH '/home/acadgild/survey_files/survey_data' INTO TABLE


bank.survey_analysis;
8.1. How many surveys got the average rating less than 3, provided at least 10 distinct users
gave the rating?

SELECT survey_id, AVG(rating) FROM


(
SELECT survey_id, rating, COUNT(user_id) OVER (PARTITION BY survey_id) AS num_users
FROM bank.survey_analysis
) inne
WHERE num_users >= 10
GROUP BY survey_id
HAVING AVG(rating) < 3;

Output
8.2. Find the details of the survey which received the minimum rating. The condition is that the
survey must have been rated by at least 20 users.

SELECT survey_id, rank FROM


(
SELECT survey_id, RANK() OVER (ORDER BY avg_rating) AS rank
FROM
(
SELECT survey_id, AVG(rating) AS avg_rating FROM
(
SELECT survey_id, rating, COUNT(user_id) OVER (PARTITION BY survey_id) AS num_users
FROM bank.survey_analysis
) inner_1
WHERE num_users >= 20
GROUP BY survey_id
) inner_2
) inner_3
WHERE rank = 1;

Output

Email data analysis

The organisation also has lots of emails stored in small files.


The metadata about the email is present in an XML file email_metadata.xml
Read the XML file for email structure and pack all the email files in HDFS.

Run in the python shell


import xml.etree.ElementTree as ET
import commands
base_str = file("/home/acadgild/email_schema.xml", "r").read().replace("\t","").replace(" ","")
root = ET.fromstring(base_str)

structure_list = []
for each_col in root.findall("column"):
name = each_col.find("name").text
type = each_col.find("type").text
structure_list.append(name + " " + type)

create_table = "CREATE TABLE email_analysis (" + ",".join(structure_list) + ") ROW FORMAT


DELIMITED FIELDS TERMINATED BY ',';"

hive_file = file("/home/acadgild/hive_query.hql", "w")


hive_file.write("CREATE DATABASE IF NOT EXISTS bank;\n")
hive_file.write("USE bank;\n")
hive_file.write(create_table)
hive_file.close()
status, output = commands.getstatusoutput("hive -f " + hive_file.name)

A file with name hive_query.hql and a table will get created in the bank database with name
email_analysis.

Concatenating the small files

cd /home/acadgild/email_files
cat *.txt > email_data
rm *.txt

hive -e "LOAD DATA LOCAL INPATH '/home/acadgild/email_files/email_data' INTO TABLE


bank.email_analysis"
Checking the data

1. Which is the longest running email?

SELECT id FROM
(
SELECT id, RANK() OVER (ORDER BY datediff(closed_date, opened_date) DESC) AS rank
FROM
(
SELECT id,
MIN(IF(opened="YES",reporting_date,NULL)) AS opened_date,
MIN(IF(closed="YES",reporting_date,NULL)) AS closed_date
FROM email_analysis
GROUP BY id
) inner_1
WHERE opened_date IS NOT NULL AND closed_date IS NOT NULL
) inner_2
WHERE rank = 1;

2. Find out the list of emails which were unanswered.

SELECT id
FROM
(
SELECT id,
MIN(IF(opened="YES",reporting_date,NULL)) AS opened_date,
MIN(IF(closed="YES",reporting_date,NULL)) AS closed_date
FROM email_analysis
GROUP BY id
) inne
WHERE opened_date IS NULL AND closed_date IS NOT NULL;

You might also like