0% found this document useful (0 votes)
52 views

Kafka

Apache Kafka is a distributed streaming platform that allows for publishing and subscribing to streams of records, known as topics. It allows both publishing (producers) and consumption (consumers) of data streams, with topics providing a category for different streams of records. Kafka runs as a cluster of one or more servers (brokers) that maintain feeds of these records.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views

Kafka

Apache Kafka is a distributed streaming platform that allows for publishing and subscribing to streams of records, known as topics. It allows both publishing (producers) and consumption (consumers) of data streams, with topics providing a category for different streams of records. Kafka runs as a cluster of one or more servers (brokers) that maintain feeds of these records.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

Apache Kafka

A distributed messing system. Apache Kafka is publish-subscribe messaging


rethought as a distributed commit log.

Kafka is a distributed, partitioned, replicated commit log service. It provides the functionality of a
messaging system, but with a unique design

 Kafka maintains feeds of messages in categories called topics.


 We'll call processes that publish messages to a Kafka topic producers.
 We'll call processes that subscribe to topics and process the feed of published
messages consumers..
 Kafka is run as a cluster comprised of one or more servers each of which is called
a broker.
Producers
Producers publish data to the topics of their choice. The producer is responsible for choosing
which message to assign to which partition within the topic. This can be done in a round-robin
fashion simply to balance load or it can be done according to some semantic partition function
(say based on some key in the message). More on the use of partitioning in a second.

Consumers
Messaging traditionally has two models: queuing and publish-subscribe. In a queue, a pool of
consumers may read from a server and each message goes to one of them; in publish-
subscribe the message is broadcast to all consumers. Kafka offers a single consumer
abstraction that generalizes both of these—the consumer group
Use Case - In Integration with Spark Streaming
import org.apache.kafka.clients.producer.{KafkaProducer, ProducerConfig, ProducerRecord}
import org.apache.spark.SparkConf
import org.apache.spark.streaming._
import org.apache.spark.streaming.kafka._
/**
* Consumes messages from one or more topics in Kafka and does wordcount.
* Usage: KafkaWordCount <zkQuorum> <group> <topics> <numThreads>
* <zkQuorum> is a list of one or more zookeeper servers that make quorum
* <group> is the name of kafka consumer group
* <topics> is a list of one or more kafka topics to consume from
* <numThreads> is the number of threads the kafka consumer should use
*
* Example:
* `$ bin/run-example \
* org.apache.spark.examples.streaming.KafkaWordCount zoo01,zoo02,zoo03 \
* my-consumer-group topic1,topic2 1`
*/
object KafkaWordCount {
def main(args: Array[String]) {
if (args.length < 4) {
System.err.println("Usage: KafkaWordCount <zkQuorum> <group> <topics>
<numThreads>")
System.exit(1)
}
StreamingExamples.setStreamingLogLevels()
val Array(zkQuorum, group, topics, numThreads) = args
val sparkConf = new SparkConf().setAppName("KafkaWordCount")
val ssc = new StreamingContext(sparkConf, Seconds(2))
ssc.checkpoint("checkpoint")
val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap
val lines = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap).map(_._2)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1L))
.reduceByKeyAndWindow(_ + _, _ - _, Minutes(10), Seconds(2), 2)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
}
}
// Produces some random words between 1 and 100.
object KafkaWordCountProducer {
def main(args: Array[String]) {
if (args.length < 4) {
System.err.println("Usage: KafkaWordCountProducer <metadataBrokerList> <topic> "
+
"<messagesPerSec> <wordsPerMessage>")
System.exit(1)
}
val Array(brokers, topic, messagesPerSec, wordsPerMessage) = args
// Zookeeper connection properties
val props = new HashMap[String, Object]()
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, brokers)
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,
"org.apache.kafka.common.serialization.StringSerializer")
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,
"org.apache.kafka.common.serialization.StringSerializer")
val producer = new KafkaProducer[String, String](props)
// Send some messages
while(true) {
(1 to messagesPerSec.toInt).foreach { messageNum =>
val str = (1 to wordsPerMessage.toInt).map(x =>
scala.util.Random.nextInt(10).toString)
.mkString(" ")
val message = new ProducerRecord[String, String](topic, null, str)
producer.send(message)
}
Thread.sleep(1000)
}
}
}

You might also like