broad and deep

11:48 5th Apr 2024

Probabilistic Matching

In the context of ad tech, probabilistic matching refers to the process of connecting offline data (e.g., household data, purchase data) to online user profiles in the absence of direct user-level identifiers. There are a few common algorithms and techniques used for probabilistic matching:

Hashing-Based Matching:

This approach involves hashing offline data (e.g., email addresses, phone numbers) using a secure hashing algorithm, such as SHA-256 or MD5.
The hashed offline data is then matched to hashed online user identifiers (e.g., cookie IDs, device IDs) to establish probabilistic connections between offline and online data.
The hashing process ensures that the underlying personal data is not revealed, while still allowing for probabilistic connections to be made.

Probabilistic Linkage:

This technique uses statistical models and machine learning algorithms to infer relationships between offline and online data based on shared attributes, such as geographic location, demographic characteristics, or browsing behavior.
Common algorithms used for probabilistic linkage include logistic regression, decision trees, and ensemble methods like random forests.
These models are trained on a sample of known matches between offline and online data to learn the patterns and probabilities of connections.

Graph-Based Matching:

In this approach, offline and online data are represented as nodes in a graph, and connections between them are established based on shared attributes or relationships.
Graph-based algorithms, such as random walks, community detection, or link prediction, are used to identify and score the likelihood of connections between offline and online data points.
This technique can leverage the relational nature of data to make more informed probabilistic matches.

Artificial Intelligence and Machine Learning:

More advanced probabilistic matching techniques leverage artificial intelligence (AI) and machine learning (ML) models to learn complex patterns and relationships in the data.
This can include the use of deep learning algorithms, such as neural networks or recurrent neural networks, to capture non-linear relationships and make more accurate probabilistic connections.
These AI/ML models are trained on large datasets of known offline-online matches to learn the underlying patterns and improve the accuracy of probabilistic linking.

The choice of algorithm for probabilistic matching depends on factors such as the available data sources, the quality and completeness of the data, the desired level of accuracy, and the computational resources available. Ad tech companies often combine multiple techniques and continuously refine their probabilistic matching approaches to enhance the effectiveness of their targeted advertising and personalization efforts.

11:46

Data pipeline

Ad tech companies, particularly Demand Side Platforms (DSPs), often have complex data pipelines to integrate and process data from various external sources. Here’s a typical data integration pipeline used in the ad tech industry:

Data Collection:

The first step is to collect data from different external sources, such as data marketplaces, direct integrations with data providers, or a company’s own first-party data.
This data can include user profiles, purchase behaviors, contextual information, location data, mobile device data, and more.

Data Ingestion:

The collected data is ingested into the ad tech company’s data infrastructure, often using batch or real-time data ingestion methods.
Common tools used for data ingestion include Apache Kafka, Amazon Kinesis, or cloud-based data integration services like AWS Glue or Google Cloud Dataflow.

Data Transformation and Enrichment:

The ingested data is then transformed, cleansed, and enriched to create a unified, consistent data model.
This may involve data normalization, deduplication, entity resolution, and the addition of derived features or attributes.
Tools like Apache Spark, Hadoop, or cloud-based data transformation services (e.g., AWS Glue, Google Cloud Dataproc) are often used for this data processing step.

Data Storage:

The transformed and enriched data is then stored in a scalable data storage layer, such as a data lake (e.g., Amazon S3, Google Cloud Storage), a data warehouse (e.g., Amazon Redshift, Google BigQuery), or a combination of both.
These data stores provide a centralized and accessible repository for the integrated data.

Data Indexing and Querying:

To enable efficient querying and access to the integrated data, ad tech companies often build indexing and caching layers.
This may involve the use of search technologies like Elasticsearch, or in-memory databases like Redis or Aerospike, to provide low-latency access to user profiles, audience segments, and other critical data.

Data Activation and Targeting:

The integrated and processed data is then used to power the ad tech company’s targeting and optimization capabilities.
This may include creating audience segments, building predictive models, and enabling real-time decisioning for ad serving and bidding.
The data is integrated with the ad tech platform’s core functionality, such as a DSP’s ad buying and optimization algorithms.

Monitoring and Governance:

Throughout the data integration pipeline, ad tech companies implement monitoring, logging, and governance processes to ensure data quality, security, and compliance.
This may involve the use of data lineage tools, data quality monitoring, and access control mechanisms.

The complexity and scale of these data integration pipelines are a key competitive advantage for ad tech companies, as they enable more accurate targeting, personalization, and optimization of digital advertising campaigns.

11:42

Data integration in Ad Tech

Ad tech companies, especially Demand Side Platforms (DSPs), often need to integrate data from various external sources to enhance their ad targeting and optimization capabilities. Here are some common ways that ad tech companies integrate external data:

Data Marketplaces: Ad tech companies can access third-party data from centralized data marketplaces, such as LiveRamp, Acxiom, or Datalogix. These marketplaces provide access to a wide range of user-level data, including demographic information, purchase behavior, interests, and more.
Direct Integrations: DSPs may establish direct integrations with data providers, such as large publishers, e-commerce platforms, or data aggregators. These integrations allow the DSP to access and ingest first-party audience data, transaction data, or other proprietary datasets.
Data Onboarding: Ad tech companies can onboard their own first-party customer data, such as email lists or CRM data, and match it to online user profiles. This allows them to create targeted audience segments and enhance their targeting capabilities.
Probabilistic Matching: When direct user-level data is not available, ad tech companies may use probabilistic matching techniques to connect offline data (e.g., household data, purchase data) to online user profiles. This helps expand the available data for targeting and optimization.
Contextual and Behavioral Data: DSPs often integrate data that provides insights into the content, context, and behavior of users, such as website visitation data, app usage data, or browsing history. This data can be used for contextual targeting and creating audience segments.
Location Data: Location data from sources like GPS, Wi-Fi, or cellular network signals can be integrated to enable location-based targeting and attribution for ad campaigns.
Mobile Device Data: DSPs may access mobile device-level data, such as device IDs, app usage, and location, to build comprehensive user profiles and enable more precise targeting on mobile devices.
Offline Data Integrations: Some ad tech companies integrate offline data sources, such as point-of-sale data, loyalty program data, or automotive data, to better understand user behavior and interests across online and offline channels.

The integration of these diverse data sources allows ad tech companies, especially DSPs, to create more comprehensive user profiles, build targeted audience segments, and optimize ad campaigns for better performance and return on investment.

11:40

tags: Youtube

LLM models on Jetson Xavier NX

11:39

tags: Youtube

YOLOv8

14:07 1st Apr 2022

image: download

A.M. Turing Award 2021 for Jack Dongarra! Jack was one of my Ph.D. dissertation committee members. And, He was my supervisor when I did visit research at his lab 20 years ago. Also, Jack announced his retirement this year. Congratulations! It is an honour to be worked with him.

08:20 20th Dec 2021

My kid’s drawing met Meta’s AI. And I edited it.

23:15 3rd Oct 2021

tags: gadget

image: download

My new mechanical keyboard. TEX’s Shinobi (red switch)

12:48 16th May 2019

image: download

The second key added. I have used the first Yubikey since 2014. The new key supports NFC! More secure and convenient. They made it more durable with metal keyhole.

00:12 22nd Apr 2019

image: download

My new toy! NVIDIA Jetson Nano.