Prakhar-Software Engineer Intern
Prakhar-Software Engineer Intern
AI Prompts
1. "How to implement custom token parsing in ANTLR4?"
Tasks:-
This is an ANTLR4 grammar file that defines the syntax for a directive-based language. The
grammar includes:
Readme.md-
# Data Prep


[](https://travis-
ci.org/cdapio/hydrator-plugins)
[](https://scan.cove
rity.com/projects/hydrator-wrangler-transform)
[](https://mavenbadges.herokuapp.com/maven-
central/io.cdap.wrangler/wrangler-core)
[](http://www.javadoc.
io/doc/io.cdap.wrangler/wrangler-core)
[](https://opensource.org/licenses/Apache-2.0)
[](https://cdapusers.
herokuapp.com?t=1)
* Data Prep defines few concepts that might be useful if you are just getting
started with it. Learn about them [here](wrangler-docs/concepts.md)
* The Data Prep Transform is [separately
documented](wranglertransform/wrangler-docs/data-prep-transform.md).
* [Data Prep Cheatsheet](wrangler-docs/cheatsheet.md)
## New Features
More [here](wrangler-docs/upcoming-features.md) on upcoming features.
* **User Defined Directives, also known as UDD**, allow you to create
custom functions to transform records within CDAP DataPrep or a.k.a Wrangler.
CDAP comes with a comprehensive library of functions. There are however some
omissions, and some specific cases for which UDDs are the solution. Additional
information on how you can build your custom directives
[here](wranglerdocs/custom-directive.md).
* Migrating directives from version 1.0 to version 2.0
[here](wranglerdocs/directive-migration.md)
* Information about Grammar [here](wrangler-docs/grammar/grammar-info.md)
* Various `TokenType` supported by system
[here](../api/src/main/java/io/cdap/wrangler/api/parser/TokenType.java)
* Custom Directive Implementation Internals [here](wrangler-
docs/uddinternal.md)
Videos and Screencasts are best way to learn, so we have compiled simple,
short screencasts that shows some of the features of Data Prep. Additional
videos can be found [here](https://www.youtube.com/playlist?list=PLhmsfNvXKJn-
neqefOrcl4n7zU4TWmIr)
### Videos
### Recipes
## Available Directives
| Directive |
Description |
| |
|
|
**Parsers** |
|
| [JSON Path](wrangler-docs/directives/json-
path.md) | Uses a DSL (a JSON path expression)
for parsing JSON records |
| [Parse as AVRO](wrangler-docs/directives/parse-asavro.md)
| Parsing an AVRO encoded message - either as binary or json |
| [Parse as AVRO File](wrangler-docs/directives/parse-as-
avrofile.md) | Parsing an AVRO data file
| | [Parse as CSV](wrangler-docs/directives/parse-as-
csv.md) | Parsing an input record as comma-separated
values |
| [Parse as Date](wrangler-docs/directives/parse-asdate.md)
| Parsing dates using natural language processing
|
| [Parse as Excel](wrangler-docs/directives/parse-asexcel.md)
| Parsing excel
file. |
| [Parse as Fixed Length](wrangler-docs/directives/parse-as-
fixedlength.md) | Parses as a fixed length record with specified
widths |
| [Parse as HL7](wrangler-docs/directives/parse-ashl7.md)
| Parsing Health Level 7 Version 2 (HL7 V2) messages |
| [Parse as JSON](wrangler-docs/directives/parse-asjson.md)
| Parsing a JSON object | |
[Parse as Log](wrangler-docs/directives/parse-aslog.md)
| Parses access log files as from Apache HTTPD and nginx servers |
| [Parse as Protobuf](wrangler-docs/directives/parse-aslog.md)
| Parses an Protobuf encoded in-memory message using descriptor |
| [Parse as Simple Date](wrangler-docs/directives/parse-as-
simpledate.md) | Parses date strings
| | [Parse XML To JSON](wrangler-docs/directives/parse-xml-
tojson.md) | Parses an XML document into a JSON
structure |
| [Parse as Currency](wrangler-docs/directives/parse-ascurrency.md)
| Parses a string representation of currency into a number. |
| [Parse as Datetime](wrangler-docs/directives/parse-asdatetime.md)
| Parses strings with datetime values to CDAP datetime type |
| **Output
Formatters** |
| |
[Write as CSV](wrangler-docs/directives/write-ascsv.md)
| Converts a record into CSV format
|
| [Write as JSON](wrangler-docs/directives/write-as-
jsonmap.md) | Converts the record into a
JSON map |
| [Write JSON Object](wrangler-docs/directives/write-as-jsonobject.md)
| Composes a JSON object based on the fields specified. |
| [Format as Currency](wrangler-docs/directives/format-ascurrency.md)
| Formats a number as currency as specified by locale. |
|
**Transformations** |
|
| [Changing Case](wrangler-docs/directives/changingcase.md) |
Changes the case of column values | | [Cut
Character](wrangler-docs/directives/cutcharacter.md) |
Selects parts of a string value | | [Set
Column](wrangler-docs/directives/setcolumn.md) |
Sets the column value to the result of an expression execution |
| [Find and Replace](wrangler-docs/directives/find-andreplace.md)
| Transforms string column values using a "sed"like expression |
| [Index Split](wrangler-docs/directives/indexsplit.md)
|
(_Deprecated_) |
| [Invoke HTTP](wrangler-docs/directives/invokehttp.md)
| Invokes an HTTP Service (_Experimental_, potentially slow) | |
[Quantization](wranglerdocs/directives/quantize.md)
| Quantizes a column based on specified ranges |
| [Regex Group Extractor](wrangler-docs/directives/extract-regexgroups.md)
| Extracts the data from a regex group into its own column |
| [Setting Character Set](wrangler-docs/directives/set-
charset.md) | Sets the encoding and then converts the data to a
UTF-8 String |
| [Setting Record Delimiter](wrangler-docs/directives/set-
recorddelim.md) | Sets the record delimiter
| | [Split by Separator](wrangler-docs/directives/split-byseparator.md)
| Splits a column based on a separator into two columns |
| [Split Email Address](wrangler-docs/directives/split-
email.md) | Splits an email ID into an account and its
domain |
| [Split URL](wrangler-docs/directives/spliturl.md)
| Splits a URL into its constituents
|
| [Text Distance (Fuzzy String Match)](wrangler-
docs/directives/textdistance.md) | Measures the difference between two
sequences of characters |
| [Text Metric (Fuzzy String Match)](wrangler-
docs/directives/textmetric.md) | Measures the difference between two
sequences of characters |
| [URL Decode](wrangler-docs/directives/urldecode.md)
| Decodes from the `application/x-wwwform-urlencoded` MIME format |
| [URL Encode](wrangler-docs/directives/urlencode.md)
| Encodes to the `application/x-wwwform-urlencoded` MIME format |
| [Trim](wranglerdocs/directives/trim.md)
| Functions for trimming white spaces around string data | |
**Encoders and
Decoders** |
|
| [Decode](wranglerdocs/directives/decode.md)
| Decodes a column value as one of `base32`, `base64`, or `hex` |
| [Encode](wranglerdocs/directives/encode.md)
| Encodes a column value as one of `base32`, `base64`, or `hex` |
| **Unique
ID** |
| | [UUID
Generation](wrangler-docs/directives/generateuuid.md) |
Generates a universally unique identifier (UUID) .Recommended to use with
Wrangler version 4.4.0 and above due to an important bug fix [CDAP-
17732](https://cdap.atlassian.net/browse/CDAP-
17732) |
| **Date
Transformations** |
|
| [Diff Date](wrangler-docs/directives/diffdate.md)
| Calculates the difference between two dates |
| [Format Date](wrangler-docs/directives/formatdate.md)
| Custom patterns for date-time formatting
|
| [Format Unix Timestamp](wrangler-docs/directives/format-
unixtimestamp.md) | Formats a UNIX timestamp as a date
|
| **DateTime
Transformations** |
| | [Current
DateTime](wrangler-docs/directives/currentdatetime.md) |
Generates the current datetime using the given zone or UTC by default|
| [Datetime To Timestamp](wrangler-docs/directives/datetime-totimestamp.md)
| Converts a datetime value to timestamp with the given zone |
| [Format Datetime](wrangler-docs/directives/formatdatetime.md)
| Formats a datetime value to custom date time pattern strings |
| [Timestamp To Datetime](wrangler-docs/directives/timestamp-
todatetime.md) | Converts a timestamp value to datetime
|
|
**Lookups** |
|
| [Catalog Lookup](wrangler-docs/directives/catalog-
lookup.md) | Static catalog lookup of ICD-9, ICD-10-2016,
ICD-10-2017 codes |
| [Table Lookup](wrangler-docs/directives/tablelookup.md)
| Performs lookups into Table datasets
|
| **Hashing &
Masking** |
|
| [Message Digest or Hash](wranglerdocs/directives/hash.md)
| Generates a message digest | |
[Mask Number](wrangler-docs/directives/masknumber.md)
| Applies substitution masking on the column values |
| [Mask Shuffle](wrangler-docs/directives/maskshuffle.md)
| Applies shuffle masking on the column values |
| **Row
Operations** |
|
| [Filter Row if Matched](wrangler-docs/directives/filter-row-
ifmatched.md) | Filters rows that match a pattern for a
column |
| [Filter Row if True](wrangler-docs/directives/filter-row-
iftrue.md) | Filters rows if the condition is
true. |
| [Filter Row Empty of Null](wrangler-docs/directives/filter-empty-ornull.md)
| Filters rows that are empty of null. |
| [Flatten](wranglerdocs/directives/flatten.md)
| Separates the elements in a repeated field |
| [Fail on condition](wranglerdocs/directives/fail.md)
| Fails processing when the condition is evaluated to true. |
| [Send to Error](wrangler-docs/directives/send-toerror.md)
| Filtering of records to an error collector
|
| [Send to Error And Continue](wrangler-docs/directives/send-to-error-
andcontinue.md) | Filtering of records to an error collector and continues
processing |
| [Split to Rows](wrangler-docs/directives/split-torows.md)
| Splits based on a separator into multiple records |
| **Column
Operations** |
|
| [Change Column Case](wrangler-docs/directives/change-columncase.md)
| Changes column names to either lowercase or uppercase |
| [Changing Case](wrangler-docs/directives/changingcase.md)
| Change the case of column values
|
| [Cleanse Column Names](wrangler-docs/directives/cleanse-
columnnames.md) | Sanatizes column names, following
specific rules |
| [Columns Replace](wrangler-
docs/directives/columnsreplace.md) |
Alters column names in bulk
|
| [Copy](wranglerdocs/directives/copy.md)
| Copies values from a source column into a destination column |
| [Drop Column](wranglerdocs/directives/drop.md)
| Drops a column in a record | | [Fill
Null or Empty Columns](wrangler-docs/directives/fill-null-orempty.md) |
Fills column value with a fixed value if null or empty |
| [Keep Columns](wranglerdocs/directives/keep.md)
| Keeps specified columns from the record |
| [Merge Columns](wranglerdocs/directives/merge.md)
| Merges two columns by inserting a third column |
| [Rename Column](wrangler-
docs/directives/rename.md) | Renames an existing
column in the record | | [Set Column
Header](wrangler-docs/directives/setheaders.md) | Sets
the names of columns, in the order they are specified |
| [Split to Columns](wrangler-docs/directives/split-to-
columns.md) | Splits a column based on a separator into
multiple columns | | [Swap Columns](wranglerdocs/directives/swap.md)
| Swaps column names of two columns | | [Set
Column Data Type](wrangler-docs/directives/settype.md) |
Convert data type of a column |
Integration Assignment: Bidirectional ClickHouse & Flat File Data
Ingestion Tool
AI Prompts:-
Prompts Used for Development
Initial Prompt
"Integration Assignment: Bidirectional ClickHouse & Flat File Data Ingestion Tool - Create a web-based
application with a simple user interface that facilitates data ingestion between a ClickHouse database
and Flat Files. Support bidirectional data flow, JWT token authentication, column selection, and
record count reporting."
Frontend UI Design
"Design a responsive HTML/CSS/JS user interface for a data ingestion tool that allows users to select
source/target, configure connections, select columns, and view progress/results."
Readme.md: -
## Features
## Requirements
- Python 3.7+
- Flask
- clickhouse-driver
- pandas
- Other dependencies listed in requirements.txt
## Installation
3. Install dependencies:
```
pip install -r requirements.txt
```
## Running the Application
## Usage
## Testing
## Project Structure
```
clickhouse-flat-file-tool/
│
├── app/ # Application package
│ ├── init .py # Flask app initialization
│ ├── main.py # Main routes & API endpoints
│ ├── models/ # Data models
│ │ ├── clickhouse_client.py # ClickHouse client
│ │ └── flat_file.py # Flat file handling
│ ├── static/ # Static assets
│ │ └── main.js # Frontend JavaScript
│ ├── templates/ # HTML templates
│ │ └── index.html # Main UI
│ └── uploads/ # Directory for uploaded files
│
├── app.py # Application entry point
├── requirements.txt # Python dependencies
└── README.md # This file
```
## Notes