Why mapreduce algorithms




















In fact, these algorithms have an important characteristic that their computation of frequent itemsets uses a small number of MapReduce phases and thus are free or almost free of level-wise process.

We will discuss more on this later. Its Map step generates all attainable itemsets in one step via powerset instead of a traditional level-wise approach, to obtain frequent itemsets of one set size at a time. The first phase produces frequent itemset of size one that are used to generate the rest of frequent itemsets in the second phase.

MapReduce-based Apriori by Imran and Ranjan [ 16 ] uses powerset and thus, inherits the same characteristic. However, their approach also employs vertical layout and set interaction in additional MapReduce phases to reduce number of scans and overhead. The vertical layout associates each item with a set of transaction containing it whereas a horizontal layout associates each transaction with items in it. The paper focuses on improving its performance over an existing similar approach [ 13 ] that uses only vertical layout and set interaction.

In this paper, we propose a simple MapReduce-based Apriori algorithm that requires only one Mapreduce phase in a general MapReduce environment without using a Combiner or other features. Table 1 summarizes this last group of MapReduce-based Apriori algorithms. The core MapReduce refers to MapReduce phases that are used to compute frequent itemsets as opposed to pre-processing like data partitioning.

As shown in Table 1 , AprioriPMR is the closest to our implementation in that it does not require additional features by focusing on using MapReduce environment in the most general sense. In addition, in [ 28 ], AprioriPMR has empirically shown to outperform traditional i. For these reasons, we will focus our comparison studies with AprioriPMR in the following sections. The terms and notations introduced in this section will be used throughout the rest of the paper.

Given two sets of items or itemsets X and Y with no overlapping items, i. We define support of X, sup X , to be a ratio of a number of database transactions that contain X over a total number of transactions in the database D. Specifically, algorithms for association rule mining has two processes: 1 Finding, from the database D, all possible frequent itemsets i.

Apriori algorithm [ 3 ] is an important core process of association rule mining. It finds frequent itemsets from a given database of transactions of items.

Apriori uses a level-wise approach to find frequent itemsets of different sizes, from itemsets of size one level one up to itemsets of size m , a maximum itemset size level m. From this point on, we will use the term n - itemset to refer to an itemset of size n. Similarly, a superset of infrequent itemset must be infrequent.

The algorithm has two main steps: 1 generating candidate itemsets, and 2 testing for frequent itemsets. In Step 1, to improve efficiency, C k applies a self - joining operation , of relational database, to L k -1 , i. Apriori is an iterative process that continues a level-wise generate-and-test until there is no frequent itemsets in the previous level to be processed further.

In the worst case, it reaches level m , the maximum itemset size possible. Apriori has been extended to many variations to minimize the number of database scans and numerous techniques to enhance performance for scalability [ 35 , 39 ].

MapReduce [ 12 ] is a highly scalable programming paradigm that enables massive volume of data processing by means of parallel execution on large clusters. MapReduce is a simplified programming model since all the parallelization, communication, load balancing and fault tolerance are automatically handled by framework operations in MapReduce system [ 12 , 35 ]. Inspired by primitives of functional programming languages such as Lisp, MapReduce uses two main user-defined functions: Map and Reduce.

Both input and output of these functions must be of the form key , value pairs. The reason for this restriction will become clear later. Figure 1 shows a typical form of Map and Reduce functions.

A format of basic Map and Reduce functions. Shown a basic data format of Map and Reduce functions. At an initial setup, MapReduce system [ 12 ] splits the data into pieces of manageable size, starts up copies of programs on cluster nodes and assigns each idle node a Map or Reduce task. We will refer to a node assigned to a Map task, as a Map node where the Map function is executed. A Reduce node is defined similarly. As shown in Fig.

Reduce nodes receive the locations and use remote procedure calls to read buffered data. All intermediate values are sorted and those associated with the same intermediate key are grouped together as key, list of values. The Reduce function takes an intermediate key and its corresponding list of values, merges these values to form a smaller set of values. This continues iteratively until no intermediate pairs to be processed.

An iterator supplies the intermediate values to Reduce. This enables our ability to handle lists of values that are too large to fit in memory. The output of the Reduce function is appended to the output file.

Parallelizing recursive algorithms may require multiple MapReduce phases. In each MapReduce phase, a computing node may be assigned to do different multiple Map or Reduce tasks before the main Reduce step is completely done. This is why a restricted format of key , value pair needs to be maintained for outputs of both Map and Reduce functions.

Like many distributed systems, MapReduce was designed to reduce the amount of data sent across different computing nodes. To save network bandwidth, the intermediate pair data are read from local disks, and only a single copy of the data is written to local disks. Data processing at the node location helps reduce overhead in moving data. Additionally, reliability is achieved by re-assigning the task of a failed node to another available node.

MapReduce enables Map and Reduce tasks to be executed on different machines in parallel, but we obtain the final result only after all Reduce tasks are completed [ 35 ]. Thus, nodes that have finished their Reduce tasks have to wait for other Reduce nodes to complete before they can be released to take other tasks.

This also prevents invocation of next MapReduce phase and can result in poor performance. This section contrasts two types of MapReduce-based Apriori algorithms: naive and non-naive. The naive type refers to those traditional parallel Apriori solutions that apply MapReduce programming paradigm in a straightforward manner without modifying problem-solving concepts.

Instead, it mimics the sequential solution. The non-naive type refers to alternative solutions that do not follow the original sequential process. Naive MapReduce-based Apriori algorithms e. These algorithms typically require multiple MapReduce phases, each of which finds frequent itemsets of each set size.

Some discards infrequent items in each level [ 11 ]. Figure 2 shows a pseudocode of traditional naive MapReduce-based Apriori. There are other variations of the implementations. However, they all share the core characteristics of iterations of solution findings by levels.

Each of the Map functions takes each of the input transactions in the corresponding splits or blocks as input and executes the Map function opportunistically. Basic naive MapReduce-based Apriori.

An intermediate step sorts and collects, from the same key , the corresponding values as a list of values. The Reduce function sums the values till no more key , value pairs to be processed.

Another intermediate step not shown prunes the infrequent key, value pairs that do not meet the support criterion and produces the Frequent 1-itemsets F 1. This completes the Reduce step of MapReduce phase 1. MapReduce phase 2 starts only after the Reduce step of phase 1 finishes. In phase 2, the Map step takes input transaction and produces all possible key, value pairs where key is a 2-itemset of the transaction where all proper non-empty subsets of the key must be frequent.

This is to reduce the number of candidate itemsets in current level by using the pruning principle that all subsets of frequent itemsets must be frequent. One way to do this is by determining if all 1-itemsets of each potential candidate 2-itemset S of the transaction are frequent i. The rest of this MapReduce phase follows similarly to those of phase 1. Note that the sequential Apriori in [ 3 ] generates a candidate set by using self-joining to generate a potential candidate that is a superset of a frequent itemset since by pruning principle, a superset of infrequent itemset is infrequent and thus, generating such a candidate set will not be fruitful.

However, our focus is not to improve naive Apriori by another naive Apriori. The bottom part of Fig. The concepts are the same. The Map function checks if all subsets of a potential candidate of current level are frequent using results of frequent itemsets from a previous level. The Reduce functions of all levels are coded the same while the key itemset in each level has different size implicitly controlled by the Map in each phase.

The first phase produces a set of items whose frequency of occurrences satisfies the support criterion to be frequent. The results of the first phase are used to generate the rest of frequent itemsets in the second phase. Basic idea of AprioriPMR. The concept is the same as that of the first phase of the traditional Apriori but the format of the key is slightly different for simplicity. As shown in the Map function of phase 2 in Fig.

This is crucial to make sure that a subset of T would not be a superset of infrequent itemsets. Therefore, a superset of infrequent itemset is infrequent. The Reduce function works the same way as in other algorithms. Experiments on performance of AprioriPMR compared with naive MapReduce-based solution referred to as traditional parallel Apriori algorithm have shown significant improvement [ 28 ].

However, the authors did not provide explanation or insight of the characteristic that contributes to this improvement other than their proposed mechanism. Our hypothesis is that the number of iterations or MapReduce phases may impact the performance because each phase has to wait for the previous phase to finish before it can start.

Later sections will show that AprioriS is not just simple but also turns out to be powerful in its performance on large clusters. Figure 4 shows pseudocode of simple our non-naive AprioriS.

Click for Topics listing. MapReduce Algorithm. Summary -. MapReduce Algorithm works by breaking the process into 3 phases. Map phase performs the following two sub-steps - Splitting - Takes input dataset from Source and divide into smaller sub-datasets. Mapping - Takes the smaller sub-datasets as an input and perform required action or computation on each sub-dataset.

Sorting - takes output from Merging step and sort all key-value pairs by using Keys. Mapper class takes the input, tokenizes it, maps and sorts it. The output of Mapper class is used as input by Reducer class, which in turn searches matching pairs and reduces them. MapReduce implements various mathematical algorithms to divide a task into small parts and assign them to multiple systems.

Sorting is one of the basic MapReduce algorithms to process and analyze data. MapReduce implements sorting algorithm to automatically sort the output key-value pairs from the mapper by their keys. In the Shuffle and Sort phase, after tokenizing the values in the mapper class, the Context class user-defined class collects the matching valued keys as a collection.

To collect similar key-value pairs intermediate keys , the Mapper class takes the help of RawComparator class to sort the key-value pairs. Searching plays an important role in MapReduce algorithm.

It helps in the combiner phase optional and in the Reducer phase. It creates a summary of the complete data set. MapReduce is an application that is used for the processing of huge datasets.

These datasets can be processed in parallel. MapReduce can potentially create large data sets and a large number of nodes. These large data sets are stored on HDFS which makes the analysis of data easier. It can process any kind of data like structured, unstructured or semi-structured. MapReduce is growing rapidly and helps in parallel computing. It helps in determining the price for products and helps in yielding the highest profits. It also helps in predicting and recommending analysis.

It allows programmers to run models over different data sets and uses advanced statistical techniques and machine learning techniques that help in predicting data. It filters and sends out the data to different nodes within the cluster and functions as per the mapper and reducer function.

Hadoop is among the most wanted jobs these days. It is accelerating the rate and the opportunity which is growing very fast in this field. There is going to be a boom in this area even more. The IT professionals who are working in Java have a plus point as they are the most sought-after people. Also, developers, data architects, data warehouse and BI professionals can take away huge amounts of salary by learning this technology. MapReduce is the basic of the Hadoop framework.

By learning this you will surely get to enter the data analytics market. You can learn it thoroughly and get to know how large sets of data are being processed and how this technology is bringing a change with processing and storing of data.



0コメント

  • 1000 / 1000