Mini query runner

A python implementation of JOIN, ORDERBY, COUNTBY, TAKE, SELECT and FROM that runs against .csv files.

Specifically:

JOIN: offers inner join using Hash Join and Sort-Merge Join. See speed comparison at the end.
COUNTBY: equivalent to COUNT(*) ... GROUPBY <col>, appends column count.

The client and query runner server communicate through sockets. The message consists of:

header (4 bytes, big endian): integer representing the size of body
body: byte string (query or query output)

1. Run

In one terminal, start the server:

python3 server.py -port <port_number>

In a new terminal, start the client:

python3 query_runner_client.py -host 127.0.0.1 -port <port_number>

The client starts an interactive shell where you can type in your query. The query should be a single line. Hit enter to print records retrieved.

2. Query Syntax

Query should start with FROM followed by path to a .csv file. The query keywords and be mix-and-matched.

2.1 JOIN

FROM <path_to_csv_1> JOIN <path_to_csv_2> <join_column>

where join_column is a common column in both files. Multiple joins are supported. The order of columns are maintained based on the order of files.

2.2 SELECT

FROM <path_to_csv> SELECT <col_1>,<col_2>,...<col_n>

same as SQL SELECT.

2.3 COUNTBY

FROM <path_to_csv> COUNTBY <column>

groups data by column specified and count rows within each group.

2.4 ORDERBY

FROM <path_to_csv> ORDERBY <numerical_column>

sorts rows by column specified.

2.5 TAKE

FROM <path_to_csv> TAKE <num_rows>

returns top n rows.

3. Examples

Data in ./data is from Pokemon - Weedle's Cave.

3.1 Multiple JOINs

FROM ../data/pokemon.csv JOIN ../data/stats.csv id JOIN ../data/legendary.csv id TAKE 3

Output:

id,name,type1,type2,hp,attack,defense,special_attack,special_defence,speed,generation,is_legendary
1,Bulbasaur,Grass,Poison,45,49,49,65,65,45,1,False
2,Ivysaur,Grass,Poison,60,62,63,80,80,60,1,False
3,Venusaur,Grass,Poison,80,82,83,100,100,80,1,False

3.2 COUNTBY, ORDERBY, TAKE

FROM ../data/pokemon.csv COUNTBY type1 ORDERBY count TAKE 5

Output:

type1,count
Water,112
Normal,98
Grass,70
Bug,69
Psychic,57

4. Speed of Hash Join and Sort-Merge Join

Data - two randomly generated datasets:

dummy_1: 30000 rows, 4 columns
dummy_2: 20000 rows, 3 columns
index column is randomly ordered

To generate the datasets:

import numpy as np
import pandas as pd

def write_random_df(n_rows, col_names, write_pth):
    """
    Generate dataframe of random numbers 
    with specified dimension. Shuffle rows then
    write to write_pth.
    """
    df = pd.DataFrame(
        data=np.random.rand(n_rows, len(col_names)),
        columns=col_names
    )
    df.reset_index(inplace=True)
    df = df.sample(n_rows)

    df.to_csv(write_pth, index=False)

write_random_df(
    n_rows=30000,
    col_names=["a", "b", "c"],
    write_pth="./data/dummy_1.csv"
)
write_random_df(
    n_rows=20000,
    col_names=["d", "e"],
    write_pth="./data/dummy_2.csv"
)

Below query is run 50 times for each algorithm using src/query_runner.py:

FROM ../data/dummy_1.csv JOIN ../data/dummy_2.csv index

On average, hash join took 167.6 ms to complete the query, with an standard deviation of 57.6 ms. Sort merge join took 215.67 ms, with an standard deviation of 70.1 ms. A Mann Whitney U test shows that this difference in speed is statistically significant.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
data		data
src		src
README.md		README.md
hash_join_vs_merge_join.png		hash_join_vs_merge_join.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Mini query runner

1. Run

2. Query Syntax

2.1 JOIN

2.2 SELECT

2.3 COUNTBY

2.4 ORDERBY

2.5 TAKE

3. Examples

3.1 Multiple JOINs

3.2 COUNTBY, ORDERBY, TAKE

4. Speed of Hash Join and Sort-Merge Join

About

Uh oh!

Releases

Packages

Languages

s93chen/mini_query_runner

Folders and files

Latest commit

History

Repository files navigation

Mini query runner

1. Run

2. Query Syntax

2.1 JOIN

2.2 SELECT

2.3 COUNTBY

2.4 ORDERBY

2.5 TAKE

3. Examples

3.1 Multiple JOINs

3.2 COUNTBY, ORDERBY, TAKE

4. Speed of Hash Join and Sort-Merge Join

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages