0% found this document useful (0 votes)
42 views13 pages

Prakhar-Software Engineer Intern

The document outlines the enhancement of Wrangler with new parsers for byte size and time duration units, detailing the tasks and changes made to the ANTLR4 grammar file. It includes a comprehensive list of directives for data transformation, parsing, and output formatting, along with links to documentation and demo videos. The GitHub repository is provided for further reference and collaboration.

Uploaded by

prakhar singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views13 pages

Prakhar-Software Engineer Intern

The document outlines the enhancement of Wrangler with new parsers for byte size and time duration units, detailing the tasks and changes made to the ANTLR4 grammar file. It includes a comprehensive list of directives for data transformation, parsing, and output formatting, along with links to documentation and demo videos. The GitHub repository is provided for further reference and collaboration.

Uploaded by

prakhar singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Assignment: Enhance Wrangler with Byte Size and Time

Duration Units Parsers

AI Prompts
1. "How to implement custom token parsing in ANTLR4?"

2. "Best practices for implementing aggregate directives in CDAP Wrangler"

3. "Java code to convert between different byte size units"

4. "Example of time duration parsing in Java"


5. “How to test ANTLR grammar changes in a Maven project?”
6. “Fix the errors in the terminal”

GitHub Repository:- https://github.com/prakharrrrrrsingh/Wrangler-ps

Tasks:-
This is an ANTLR4 grammar file that defines the syntax for a directive-based language. The
grammar includes:

1. Parser rules for:

 Recipe structure with statements

 Directives with various parameter types

 Control structures (if-else statements)

 Expressions and code blocks

 Macros and pragmas

 Property lists and value definitions

2. Lexer rules for:

 Basic tokens (braces, operators, punctuation)

 Data types (Bool, Number, String)

 Identifiers and columns

 Comments and whitespace

 Special values for byte sizes and time durations


The grammar is designed for a domain-specific language that appears to be used for
data transformation directives, with support for conditional logic, property configurations, and
various data manipulations.

Changes made according to the assignment

Readme.md-
# Data Prep

![cm-available](https://cdap-users.herokuapp.com/assets/cm-available.svg)
![cdap-transform](https://cdap-users.herokuapp.com/assets/cdap-transform.svg)
[![Build Status](https://travis-
ci.org/cdapio/hydratorplugins.svg?branch=develop)](https://travis-
ci.org/cdapio/hydrator-plugins)
[![Coverity Scan Build
Status](https://scan.coverity.com/projects/11434/badge.svg)](https://scan.cove
rity.com/projects/hydrator-wrangler-transform)
[![Maven Central](https://maven-
badges.herokuapp.com/mavencentral/io.cdap.wrangler/wrangler-
core/badge.svg)](https://mavenbadges.herokuapp.com/maven-
central/io.cdap.wrangler/wrangler-core)
[![Javadoc](https://javadoc-
emblem.rhcloud.com/doc/io.cdap.wrangler/wranglercore/badge.svg)](http://www.javadoc.
io/doc/io.cdap.wrangler/wrangler-core)
[![License](https://img.shields.io/badge/License-
Apache%202.0blue.svg)](https://opensource.org/licenses/Apache-2.0)
[![Join CDAP
community](https://cdapusers.herokuapp.com/badge.svg?t=wrangler)](https://cdapusers.
herokuapp.com?t=1)

A collection of libraries, a pipeline plugin, and a CDAP service for performing


data cleansing, transformation, and filtering using a set of data manipulation
instructions
(directives). These instructions are either generated using an interative
visual tool or are manually created.

* Data Prep defines few concepts that might be useful if you are just getting
started with it. Learn about them [here](wrangler-docs/concepts.md)
* The Data Prep Transform is [separately
documented](wranglertransform/wrangler-docs/data-prep-transform.md).
* [Data Prep Cheatsheet](wrangler-docs/cheatsheet.md)
## New Features
More [here](wrangler-docs/upcoming-features.md) on upcoming features.
* **User Defined Directives, also known as UDD**, allow you to create
custom functions to transform records within CDAP DataPrep or a.k.a Wrangler.
CDAP comes with a comprehensive library of functions. There are however some
omissions, and some specific cases for which UDDs are the solution. Additional
information on how you can build your custom directives
[here](wranglerdocs/custom-directive.md).
* Migrating directives from version 1.0 to version 2.0
[here](wranglerdocs/directive-migration.md)
* Information about Grammar [here](wrangler-docs/grammar/grammar-info.md)
* Various `TokenType` supported by system
[here](../api/src/main/java/io/cdap/wrangler/api/parser/TokenType.java)
* Custom Directive Implementation Internals [here](wrangler-
docs/uddinternal.md)

* A new capability that allows CDAP Administrators to **restrict the


directives** that are accessible to their users.
More information on configuring can be found [here](wrangler-
docs/exclusionand-aliasing.md)

## Demo Videos and Recipes

Videos and Screencasts are best way to learn, so we have compiled simple,
short screencasts that shows some of the features of Data Prep. Additional
videos can be found [here](https://www.youtube.com/playlist?list=PLhmsfNvXKJn-
neqefOrcl4n7zU4TWmIr)

### Videos

* [SCREENCAST] [Creating Lookup Dataset and


Joining](https://www.youtube.com/watch?v=Nc1b0rsELHQ)
* [SCREENCAST] [Restricted
Directives](https://www.youtube.com/watch?v=71EcMQU714U)
* [SCREENCAST] [Parse Excel files in
CDAP](https://www.youtube.com/watch?v=su5L1noGlEk)
* [SCREENCAST] [Parse File As AVRO
File](https://www.youtube.com/watch?v=tmwAw4dKUNc)
* [SCREENCAST] [Parsing Binary Coded AVRO
Messages](https://www.youtube.com/watch?v=Ix_lPo-PDJY)
* [SCREENCAST] [Parsing Binary Coded AVRO Messages & Protobuf messages
using schema registry](https://www.youtube.com/watch?v=LVLIdWnUX1k)
* [SCREENCAST] [Quantize a column -
Digitize](https://www.youtube.com/watch?v=VczkYX5SRtY)
* [SCREENCAST] [Data Cleansing capability with send-to-error
directive](https://www.youtube.com/watch?v=aZd5H8hIjDc) * [SCREENCAST]
[Building Data Prep from the GitHub source](https://youtu.be/pGGjKU04Y38)
* [VOICE-OVER] [End-to-End Demo Video](https://youtu.be/AnhF0qRmn24)
* [SCREENCAST] [Ingesting into
Kudu](https://www.youtube.com/watch?v=KBW7a38vlUM)
* [SCREENCAST] [Realtime HL7 CCDA XML from Kafka into Time Parititioned
Parquet](https://youtu.be/0fqNmnOnD-0)
* [SCREENCAST] [Parsing JSON file](https://youtu.be/vwnctcGDflE)
* [SCREENCAST] [Flattening arrays](https://youtu.be/SemHxgBYIsY)
* [SCREENCAST] [Data cleansing with send-to-error
directive](https://www.youtube.com/watch?v=aZd5H8hIjDc)
* [SCREENCAST] [Publishing to
Kafka](https://www.youtube.com/watch?v=xdc8pvvlI48)
* [SCREENCAST] [Fixed length to
JSON](https://www.youtube.com/watch?v=3AXu4m1swuM)

### Recipes

* [Parsing Apache Log Files](wrangler-demos/parsing-apache-log-files.md)


* [Parsing CSV Files and Extracting Column Values](wrangler-demos/parsingcsv-
extracting-column-values.md)
* [Parsing HL7 CCDA XML Files](wrangler-demos/parsing-hl7-ccda-xml-
files.md)

## Available Directives

These directives are currently available:

| Directive |
Description |
| |
|
|
**Parsers** |
|
| [JSON Path](wrangler-docs/directives/json-
path.md) | Uses a DSL (a JSON path expression)
for parsing JSON records |
| [Parse as AVRO](wrangler-docs/directives/parse-asavro.md)
| Parsing an AVRO encoded message - either as binary or json |
| [Parse as AVRO File](wrangler-docs/directives/parse-as-
avrofile.md) | Parsing an AVRO data file
| | [Parse as CSV](wrangler-docs/directives/parse-as-
csv.md) | Parsing an input record as comma-separated
values |
| [Parse as Date](wrangler-docs/directives/parse-asdate.md)
| Parsing dates using natural language processing
|
| [Parse as Excel](wrangler-docs/directives/parse-asexcel.md)
| Parsing excel
file. |
| [Parse as Fixed Length](wrangler-docs/directives/parse-as-
fixedlength.md) | Parses as a fixed length record with specified
widths |
| [Parse as HL7](wrangler-docs/directives/parse-ashl7.md)
| Parsing Health Level 7 Version 2 (HL7 V2) messages |
| [Parse as JSON](wrangler-docs/directives/parse-asjson.md)
| Parsing a JSON object | |
[Parse as Log](wrangler-docs/directives/parse-aslog.md)
| Parses access log files as from Apache HTTPD and nginx servers |
| [Parse as Protobuf](wrangler-docs/directives/parse-aslog.md)
| Parses an Protobuf encoded in-memory message using descriptor |
| [Parse as Simple Date](wrangler-docs/directives/parse-as-
simpledate.md) | Parses date strings
| | [Parse XML To JSON](wrangler-docs/directives/parse-xml-
tojson.md) | Parses an XML document into a JSON
structure |
| [Parse as Currency](wrangler-docs/directives/parse-ascurrency.md)
| Parses a string representation of currency into a number. |
| [Parse as Datetime](wrangler-docs/directives/parse-asdatetime.md)
| Parses strings with datetime values to CDAP datetime type |
| **Output
Formatters** |
| |
[Write as CSV](wrangler-docs/directives/write-ascsv.md)
| Converts a record into CSV format
|
| [Write as JSON](wrangler-docs/directives/write-as-
jsonmap.md) | Converts the record into a
JSON map |
| [Write JSON Object](wrangler-docs/directives/write-as-jsonobject.md)
| Composes a JSON object based on the fields specified. |
| [Format as Currency](wrangler-docs/directives/format-ascurrency.md)
| Formats a number as currency as specified by locale. |
|
**Transformations** |
|
| [Changing Case](wrangler-docs/directives/changingcase.md) |
Changes the case of column values | | [Cut
Character](wrangler-docs/directives/cutcharacter.md) |
Selects parts of a string value | | [Set
Column](wrangler-docs/directives/setcolumn.md) |
Sets the column value to the result of an expression execution |
| [Find and Replace](wrangler-docs/directives/find-andreplace.md)
| Transforms string column values using a "sed"like expression |
| [Index Split](wrangler-docs/directives/indexsplit.md)
|
(_Deprecated_) |
| [Invoke HTTP](wrangler-docs/directives/invokehttp.md)
| Invokes an HTTP Service (_Experimental_, potentially slow) | |
[Quantization](wranglerdocs/directives/quantize.md)
| Quantizes a column based on specified ranges |
| [Regex Group Extractor](wrangler-docs/directives/extract-regexgroups.md)
| Extracts the data from a regex group into its own column |
| [Setting Character Set](wrangler-docs/directives/set-
charset.md) | Sets the encoding and then converts the data to a
UTF-8 String |
| [Setting Record Delimiter](wrangler-docs/directives/set-
recorddelim.md) | Sets the record delimiter
| | [Split by Separator](wrangler-docs/directives/split-byseparator.md)
| Splits a column based on a separator into two columns |
| [Split Email Address](wrangler-docs/directives/split-
email.md) | Splits an email ID into an account and its
domain |
| [Split URL](wrangler-docs/directives/spliturl.md)
| Splits a URL into its constituents
|
| [Text Distance (Fuzzy String Match)](wrangler-
docs/directives/textdistance.md) | Measures the difference between two
sequences of characters |
| [Text Metric (Fuzzy String Match)](wrangler-
docs/directives/textmetric.md) | Measures the difference between two
sequences of characters |
| [URL Decode](wrangler-docs/directives/urldecode.md)
| Decodes from the `application/x-wwwform-urlencoded` MIME format |
| [URL Encode](wrangler-docs/directives/urlencode.md)
| Encodes to the `application/x-wwwform-urlencoded` MIME format |
| [Trim](wranglerdocs/directives/trim.md)
| Functions for trimming white spaces around string data | |
**Encoders and
Decoders** |
|
| [Decode](wranglerdocs/directives/decode.md)
| Decodes a column value as one of `base32`, `base64`, or `hex` |
| [Encode](wranglerdocs/directives/encode.md)
| Encodes a column value as one of `base32`, `base64`, or `hex` |
| **Unique
ID** |
| | [UUID
Generation](wrangler-docs/directives/generateuuid.md) |
Generates a universally unique identifier (UUID) .Recommended to use with
Wrangler version 4.4.0 and above due to an important bug fix [CDAP-
17732](https://cdap.atlassian.net/browse/CDAP-
17732) |
| **Date
Transformations** |
|
| [Diff Date](wrangler-docs/directives/diffdate.md)
| Calculates the difference between two dates |
| [Format Date](wrangler-docs/directives/formatdate.md)
| Custom patterns for date-time formatting
|
| [Format Unix Timestamp](wrangler-docs/directives/format-
unixtimestamp.md) | Formats a UNIX timestamp as a date
|
| **DateTime
Transformations** |
| | [Current
DateTime](wrangler-docs/directives/currentdatetime.md) |
Generates the current datetime using the given zone or UTC by default|
| [Datetime To Timestamp](wrangler-docs/directives/datetime-totimestamp.md)
| Converts a datetime value to timestamp with the given zone |
| [Format Datetime](wrangler-docs/directives/formatdatetime.md)
| Formats a datetime value to custom date time pattern strings |
| [Timestamp To Datetime](wrangler-docs/directives/timestamp-
todatetime.md) | Converts a timestamp value to datetime
|
|
**Lookups** |
|
| [Catalog Lookup](wrangler-docs/directives/catalog-
lookup.md) | Static catalog lookup of ICD-9, ICD-10-2016,
ICD-10-2017 codes |
| [Table Lookup](wrangler-docs/directives/tablelookup.md)
| Performs lookups into Table datasets
|
| **Hashing &
Masking** |
|
| [Message Digest or Hash](wranglerdocs/directives/hash.md)
| Generates a message digest | |
[Mask Number](wrangler-docs/directives/masknumber.md)
| Applies substitution masking on the column values |
| [Mask Shuffle](wrangler-docs/directives/maskshuffle.md)
| Applies shuffle masking on the column values |
| **Row
Operations** |
|
| [Filter Row if Matched](wrangler-docs/directives/filter-row-
ifmatched.md) | Filters rows that match a pattern for a
column |
| [Filter Row if True](wrangler-docs/directives/filter-row-
iftrue.md) | Filters rows if the condition is
true. |
| [Filter Row Empty of Null](wrangler-docs/directives/filter-empty-ornull.md)
| Filters rows that are empty of null. |
| [Flatten](wranglerdocs/directives/flatten.md)
| Separates the elements in a repeated field |
| [Fail on condition](wranglerdocs/directives/fail.md)
| Fails processing when the condition is evaluated to true. |
| [Send to Error](wrangler-docs/directives/send-toerror.md)
| Filtering of records to an error collector
|
| [Send to Error And Continue](wrangler-docs/directives/send-to-error-
andcontinue.md) | Filtering of records to an error collector and continues
processing |
| [Split to Rows](wrangler-docs/directives/split-torows.md)
| Splits based on a separator into multiple records |
| **Column
Operations** |
|
| [Change Column Case](wrangler-docs/directives/change-columncase.md)
| Changes column names to either lowercase or uppercase |
| [Changing Case](wrangler-docs/directives/changingcase.md)
| Change the case of column values
|
| [Cleanse Column Names](wrangler-docs/directives/cleanse-
columnnames.md) | Sanatizes column names, following
specific rules |
| [Columns Replace](wrangler-
docs/directives/columnsreplace.md) |
Alters column names in bulk
|
| [Copy](wranglerdocs/directives/copy.md)
| Copies values from a source column into a destination column |
| [Drop Column](wranglerdocs/directives/drop.md)
| Drops a column in a record | | [Fill
Null or Empty Columns](wrangler-docs/directives/fill-null-orempty.md) |
Fills column value with a fixed value if null or empty |
| [Keep Columns](wranglerdocs/directives/keep.md)
| Keeps specified columns from the record |
| [Merge Columns](wranglerdocs/directives/merge.md)
| Merges two columns by inserting a third column |
| [Rename Column](wrangler-
docs/directives/rename.md) | Renames an existing
column in the record | | [Set Column
Header](wrangler-docs/directives/setheaders.md) | Sets
the names of columns, in the order they are specified |
| [Split to Columns](wrangler-docs/directives/split-to-
columns.md) | Splits a column based on a separator into
multiple columns | | [Swap Columns](wranglerdocs/directives/swap.md)
| Swaps column names of two columns | | [Set
Column Data Type](wrangler-docs/directives/settype.md) |
Convert data type of a column |
Integration Assignment: Bidirectional ClickHouse & Flat File Data
Ingestion Tool

GitHub Repository: - https://github.com/prakharrrrrrsingh/BidirectionalClickHouse

AI Prompts:-
Prompts Used for Development

 Initial Prompt
 "Integration Assignment: Bidirectional ClickHouse & Flat File Data Ingestion Tool - Create a web-based
application with a simple user interface that facilitates data ingestion between a ClickHouse database
and Flat Files. Support bidirectional data flow, JWT token authentication, column selection, and
record count reporting."

 Detailed Project Structure Planning


 "Create a project structure for a Flask web application that will handle bidirectional data transfer
between ClickHouse and flat files, with user authentication via JWT tokens."

 ClickHouse Client Implementation


 "Implement a Python class for interacting with ClickHouse database that supports JWT token
authentication, retrieving table schema, and efficient data ingestion."

 Flat File Handling Implementation


 "Create a Python class for handling flat file operations including reading CSV files with custom
delimiters, column selection, and saving data to files."

 Frontend UI Design
 "Design a responsive HTML/CSS/JS user interface for a data ingestion tool that allows users to select
source/target, configure connections, select columns, and view progress/results."

 API Endpoint Implementation


 "Implement Flask API endpoints for connecting to ClickHouse, uploading flat files, previewing data,
and handling bidirectional data ingestion processes."

 JavaScript Client-Side Logic


 "Write JavaScript code to handle form submission, API calls, UI state management, and data
visualization for a data ingestion web application."

 Testing and Documentation


 "Create comprehensive README documentation explaining how to install, configure and use a
ClickHouse and Flat File data ingestion tool, including examples and testing instructions."

Readme.md: -

# ClickHouse & Flat File Data Ingestion Tool

A web-based application for bidirectional data ingestion between ClickHouse


databases and flat files.

## Features

- Bidirectional data flow:


- ClickHouse to Flat File
- Flat File to ClickHouse
- JWT token-based authentication for ClickHouse
- Column selection for ingestion
- Data preview
- Record count reporting
- Progress tracking

## Requirements

- Python 3.7+
- Flask
- clickhouse-driver
- pandas
- Other dependencies listed in requirements.txt

## Installation

1. Clone the repository:


```
git clone https://github.com/yourusername/clickhouse-flat-file-tool.git
cd clickhouse-flat-file-tool
```

2. Create a virtual environment (optional but recommended):


```
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
```

3. Install dependencies:
```
pip install -r requirements.txt
```
## Running the Application

1. Start the Flask application:


```
python app.py
```

2. Open your browser and navigate to:


```
http://localhost:5000
```

## Usage

### ClickHouse to Flat File

1. Select "ClickHouse" as the source and "Flat File" as the target


2. Enter ClickHouse connection details (Host, Port, Database, User, JWT Token)
3. Click "Connect" to establish connection
4. Select a table from the dropdown
5. Click "Load Columns" to view available columns
6. Select columns to ingest
7. (Optional) Click "Preview Data" to see a sample
8. Enter output file name and delimiter for the flat file
9. Click "Start Ingestion" to begin the data transfer
10. View the results including record count

### Flat File to ClickHouse

1. Select "Flat File" as the source and "ClickHouse" as the target


2. Upload a flat file and specify its delimiter
3. Select columns to ingest
4. (Optional) Click "Preview Data" to see a sample
5. Enter ClickHouse connection details if not already connected
6. Enter target table name in ClickHouse
7. Click "Start Ingestion" to begin the data transfer
8. View the results including record count

## Testing

The application can be tested with example ClickHouse datasets:


- uk_price_paid
- ontime

For more information on these datasets, visit:


https://clickhouse.com/docs/getting-started/example-datasets

## Project Structure

```
clickhouse-flat-file-tool/

├── app/ # Application package
│ ├── init .py # Flask app initialization
│ ├── main.py # Main routes & API endpoints
│ ├── models/ # Data models
│ │ ├── clickhouse_client.py # ClickHouse client
│ │ └── flat_file.py # Flat file handling
│ ├── static/ # Static assets
│ │ └── main.js # Frontend JavaScript
│ ├── templates/ # HTML templates
│ │ └── index.html # Main UI
│ └── uploads/ # Directory for uploaded files

├── app.py # Application entry point
├── requirements.txt # Python dependencies
└── README.md # This file
```

## Notes

- The application creates a directory `app/uploads` to store uploaded and generated


files
- For ClickHouse JWT authentication, provide a valid JWT token
- When ingesting to ClickHouse, tables will be created if they don't exist

You might also like