Please use this identifier to cite or link to this item: http://hdl.handle.net/1843/ESBF-AEDNUF
Type: Dissertação de Mestrado
Title: Escalonamento baseado em localidade no ambiente Watershed
Authors: Bruno Cerqueira Hott
First Advisor: Dorgival Olavo Guedes Neto
First Referee: Italo Fernando Scota Cunha
Second Referee: Renato Antonio Celso Ferreira
Abstract: O aumento dos volumes de dados disponíveis para procesamento em diversos cenários e o surgimento de plataformas de armazenamento e processamento como Hadoop têm viabilizado novas aplicações, mas também criado novos desafios. Com um volume muito grande de dados, distribuídos por diversas máquinas, surge o problema de se levar as aplicações para perto dos dados, a fim de reduzir os custos com comunicação dentro do sistema. Entretanto, ainda existe pouco entendimento sobre a interferência da localidade dos dados no desempenho desses frameworks. Este trabalho avalia esse problema no contexto do ambiente Watershed. Para essa análise fizemos uma integração do Watershed ao ecossistema Hadoop e implementamos um escalonador baseado na informação de localidade fornecidas pelo sistema para aplicações Watershed. Os resultados obtidos comprovam as vantagens de se levar em conta o posicionamento dos dados no escalonamento de aplicações desse tipo.
Abstract: Increased in connectivity and bandwidth on the Internet, combined with the reduced cost of electronic equipment in general have caused an explosion in the volume of data traveling over the network. At the same time, resources to store these data have been growing, which led to the appearance of specially developed systems to process them, and as an early example the MapReduce model of Google, which was followed by several open source implementations such as Hadoop, and new models such as Spark. In addition, it was necessary a solution to the storage of this huge data set and distributed file systems like HDFS and Tachyon, were emerging. Because the data are now a very large volume and are distributed over multiple machines in a cluster, the problem arises of getting applications close to the databases in a effectively way.If this is not done, the price of moving the data through the system can be very high and impair the final performance of the application. Depending on location, the data access application may be performed directly on the disk of the local machine, the local memory via caching of memory or from another cluster machine via network. The various commitments in terms of storage capacity, access time and computational cost involved make nontrivial a positioning decision.This work implements the scheduling based on data locality in the Watershed processing environment. For this analysis was made an integration of Watershed Hadoop ecosystem, creating channels of communication with the HDFS distributed file systems and Tachyon. Based on the location information provided by these systems, we have implemented a process scheduler based on locality for Watershed applications on those file systems.Finally, experiments were conducted in order to compare the various means of manipulating files, either by the local file system, distributed or in memory. The results show the advantages of taking into account the placement of data in scheduling such applications.
Subject: Computação
Big data
Sistemas distribuidos
Sistemas distribuídos
language: Português
Publisher: Universidade Federal de Minas Gerais
Publisher Initials: UFMG
Rights: Acesso Aberto
URI: http://hdl.handle.net/1843/ESBF-AEDNUF
Issue Date: 15-Jul-2016
Appears in Collections:Dissertações de Mestrado

Files in This Item:
File Description SizeFormat 
brunohott.pdf1.35 MBAdobe PDFView/Open


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.