Escalonamento baseado em localidade no ambiente Watershed

Bruno Cerqueira Hott

Use este identificador para citar o ir al link de este elemento: http://hdl.handle.net/1843/ESBF-AEDNUF

Tipo:	Dissertação de Mestrado
Título:	Escalonamento baseado em localidade no ambiente Watershed
Autor(es):	Bruno Cerqueira Hott
primer Tutor:	Dorgival Olavo Guedes Neto
primer miembro del tribunal :	Italo Fernando Scota Cunha
Segundo miembro del tribunal:	Renato Antonio Celso Ferreira
Resumen:	O aumento dos volumes de dados disponíveis para procesamento em diversos cenários e o surgimento de plataformas de armazenamento e processamento como Hadoop têm viabilizado novas aplicações, mas também criado novos desafios. Com um volume muito grande de dados, distribuídos por diversas máquinas, surge o problema de se levar as aplicações para perto dos dados, a fim de reduzir os custos com comunicação dentro do sistema. Entretanto, ainda existe pouco entendimento sobre a interferência da localidade dos dados no desempenho desses frameworks. Este trabalho avalia esse problema no contexto do ambiente Watershed. Para essa análise fizemos uma integração do Watershed ao ecossistema Hadoop e implementamos um escalonador baseado na informação de localidade fornecidas pelo sistema para aplicações Watershed. Os resultados obtidos comprovam as vantagens de se levar em conta o posicionamento dos dados no escalonamento de aplicações desse tipo.
Abstract:	Increased in connectivity and bandwidth on the Internet, combined with the reduced cost of electronic equipment in general have caused an explosion in the volume of data traveling over the network. At the same time, resources to store these data have been growing, which led to the appearance of specially developed systems to process them, and as an early example the MapReduce model of Google, which was followed by several open source implementations such as Hadoop, and new models such as Spark. In addition, it was necessary a solution to the storage of this huge data set and distributed file systems like HDFS and Tachyon, were emerging. Because the data are now a very large volume and are distributed over multiple machines in a cluster, the problem arises of getting applications close to the databases in a effectively way.If this is not done, the price of moving the data through the system can be very high and impair the final performance of the application. Depending on location, the data access application may be performed directly on the disk of the local machine, the local memory via caching of memory or from another cluster machine via network. The various commitments in terms of storage capacity, access time and computational cost involved make nontrivial a positioning decision.This work implements the scheduling based on data locality in the Watershed processing environment. For this analysis was made an integration of Watershed Hadoop ecosystem, creating channels of communication with the HDFS distributed file systems and Tachyon. Based on the location information provided by these systems, we have implemented a process scheduler based on locality for Watershed applications on those file systems.Finally, experiments were conducted in order to compare the various means of manipulating files, either by the local file system, distributed or in memory. The results show the advantages of taking into account the placement of data in scheduling such applications.
Asunto:	Computação Big data Sistemas distribuidos Sistemas distribuídos
Idioma:	Português
Editor:	Universidade Federal de Minas Gerais
Sigla da Institución:	UFMG
Tipo de acceso:	Acesso Aberto
URI:	http://hdl.handle.net/1843/ESBF-AEDNUF
Fecha del documento:	15-jul-2016
Aparece en las colecciones:	Dissertações de Mestrado

archivos asociados a este elemento:

archivo	Descripción	Tamaño	Formato
brunohott.pdf		1.35 MB	Adobe PDF	Visualizar/Abrir

Mostrar registro completo del elemento Visualizar estadísticas