Getting Started with Storm
Format: PDF / Kindle (mobi) / ePub
Even as big data is turning the world upside down, the next phase of the revolution is already taking shape: real-time data analysis. This hands-on guide introduces you to Storm, a distributed, JVM-based system for processing streaming data. Through simple tutorials, sample Java code, and a complete real-world scenario, you’ll learn how to build fast, fault-tolerant solutions that process results as soon as the data arrives.
Discover how easy it is to set up Storm clusters for solving various problems, including continuous data computation, distributed remote procedure calls, and data stream processing.
- Learn how to program Storm components: spouts for data input and bolts for data transformation
- Discover how data is exchanged between spouts and bolts in a Storm topology
- Make spouts fault-tolerant with several commonly used design strategies
- Explore bolts—their life cycle, strategies for design, and ways to implement them
- Scale your solution by defining each component’s level of parallelism
- Study a real-time web analytics system built with Node.js, a Redis server, and a Storm topology
reliable. It’s important to define spout communication based on the problem that you are working on. There is no one architecture that fits all topologies. If you know the sources or you can control these sources, then you can use a direct connection, while if you need the capacity to add unknown sources or receive messages from variety sources, it’s better to use a queued connection. If you need an online process, you will need to use DRPCSpouts or implement something similar. Although you have
relations should be incremented. Take a look at the source code. The bolt keeps a set of the products navigated by each user. Note that the set contains product:category pairs rather than just products. That’s because you’ll need the category information in future calls and it will perform better if you don’t need to get them from the database each time. This is possible because the products have only one category, and it won’t change during the product’s lifetime. After reading the set of the
Repository where we can found the storm dependencies -->
You’ll create the topology using a TopologyBuilder, which tells Storm how the nodes are arranged and how they exchange data. TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("word-reader",new WordReader()); builder.setBolt("word-normalizer", new WordNormalizer()).shuffleGrouping("word-reader"); builder.setBolt("word-counter", new WordCounter()).shuffleGrouping("word-normalizer"); The spout and the bolts are connected using shuffleGroupings. This type of grouping tells Storm to
change the level of parallelism (in real life, of course, each instance would run on a separate machine). But there seems to be a problem: the words is and great have been counted once in each instance of WordCounter. Why? When you use shuffleGrouping, you are telling Storm to send each message to an instance of your bolt in randomly distributed fashion. In this example, it’d be ideal to always send the same word to the same WordCounter. To do so, you can change shuffleGrouping("word-normalizer")