REDUZINDO O IMPACTO DE PERDAS DE CHUNK EM SISTEMAS P2P DE LIVE STREAMING JOÃO FERREIRA D´ARAÚJO E OLIVEIRA REDUZINDO O IMPACTO DE PERDAS DE CHUNK EM SISTEMAS P2P DE LIVE STREAMING Tese apresentada ao Programa de Pós- -Graduação em Ciência da Computação do Instituto de Ciências Exatas da Universi- dade Federal de Minas Gerais como requi- sito parcial para a obtenção do grau de Doutor em Ciência da Computação. Orientador: Sérgio Vale Aguiar Campos Belo Horizonte Dezembro de 2015 JOÃO FERREIRA D´ARAÚJO E OLIVEIRA LOWERING THE IMPACT OF CHUNK LOSSES IN P2P LIVE STREAMING SYSTEMS Thesis presented to the Graduate Program in Computer Science of the Federal Univer- sity of Minas Gerais in partial fulfillment of the requirements for the degree of Doctor in Computer Science. Advisor: Sérgio Vale Aguiar Campos Belo Horizonte December 2015 c© 2015, João Ferreira D´Araújo e Oliveira. Todos os direitos reservados. Oliveira, João Ferreira D´Araújo e O48L Lowering the Impact of Chunk Losses in P2P Live Streaming Systems / João Ferreira D´Araújo e Oliveira. — Belo Horizonte, 2015 xxxi, 103 f. : il. ; 29cm Tese (doutorado) — Universidade Federal de Minas Gerais - Departamento de Ciência da Computação Orientador: Sérgio Vale Aguiar Campos 1. Computação - Teses. 2. Sistemas de transmissão de dados - Teses. 3. Arquitetura de redes de computador - Teses. 4. Comutação de pacotes (Transmissão de dados) - Teses. I. Título. CDU 519.6*21(043) Aos meus pais, João e Telma. ix Resumo Sistemas de transmissão de mídia contínua ao vivo estão ficando mais populares a cada dia. Esses sistemas atraem um crescente número de usuarios e, inclusive, alguns importantes canais de TV já disponibilizam alguns conteúdos ao vivo pela Internet. Sistemas par-a-par de transmissão de vídeo em tempo real podem prover uma solução de baixo custo ao problema de transmitir esse conteúdo para um grande número de espectadores. No entanto, estes sistemas não são inteiramente confiáveis e não imple- mentam mecanismos de garantia de qualidade, o que pode levar leva a uma degradação do desempenho do sistema experimentado pelos pares. Essa tese estuda a minimização da perda de pedaços em sistemas P2P de transmissão de vídeo em tempo real, enten- dendo a razão por trás das perdas, e propondo novas formas de reagir sobre cada uma destas razões, mantendo os custos baixos. Nós desenvolvemos nosso próprio sistema P2P de transmissão de mídia contínua ao vivo e o usamos para caracterizar perda de pedaços. Nós descobrimos uma forma de minimizar a perda de pedaços em cenários com free riders ao evitar requisitar pedaços de pares não cooperativos. Nós propomos um mecanismo de suporte para recuperação de pedaços "próximos a serem perdidos", que tem alto potencial para reduzir perdas; ele encaminha requisições emergenciais para fontes fora da vizinhança do par. Nós mapeamos esse mecanismo em um sistema CDN-P2P e desenvolvemos um algoritmo para minimizar dinamicamente o número de fluxos de vídeo sendo plantados na rede P2P ao mesmo tempo em que garante uma disseminação eficiente e evitando requisições emergenciais. Nós avaliamos o algo- ritmo sob diversas condições nas quais ele chegou a conservar 30% da banda de upload previamente usada com requisições emergenciais sem que a qualidade do vídeo fosse afetada. Palavras-chave: Par-a-Par, Live Streaming, Qualidade de Serviço, Perda de Pedaços. xi Abstract Live streaming systems are becoming increasingly more popular. These systems at- tract a growing number of users, and some important TV channels already broadcast their live content on the Internet. Peer-to-Peer live streaming systems may provide a low cost solution to the problem of transmitting live content to a large number of viewers. However, these systems are not fully reliable and do not implement quality assurance mechanisms, which leads to a degradation of system performance experi- enced by peers. This thesis studies chunk loss minimization in P2P live streaming systems, understanding the reason behind losses, and proposing new ways to respond to each reason while keeping low costs. We have developed our own fully capable P2P live streaming system and have used it to characterize chunk loss. We found a way to minimize chunk losses in freeriding scenarios by avoiding to request chunks to uncoop- erative peers. We proposed a support mechanism to recover "about to be lost" chunks that should significantly reduce losses; it forwards emergency requests to sources out- side of a peer neighborhood. We mapped this mechanism to a CDN-P2P system and developed an algorithm to dynamically minimize the number of video streams seeded to the P2P overlay while guaranteeing efficient dissemination and avoiding emergency requests. We have evaluated it under diverse conditions in which it may save up to 30% of upload bandwidth previously used by emergency requests without reducing stream quality. Keywords: Peer-To-Peer, Live Streaming, Quality of Service, Chunk Loss. xiii Resumo Estendido Sistemas de transmissão de mídia contínua ao vivo estão ficando mais populares a cada dia. Esses sistemas atraem um crescente número de usuarios e, inclusive, alguns impor- tantes canais de TV já disponibilizam alguns conteúdos ao vivo pela Internet. Sistemas par-a-par de transmissão de vídeo em tempo real podem prover uma solução de baixo custo ao problema de transmitir esse conteúdo para um grande número de especta- dores. À medida em que o potencial para esses sistemas se consolida, provedores como PPLive afirmam ter uma base de usuários de centenas de milhões de telespectadores e permitir que dezenas de milhões de pessoas assistam seus canais simultaneamente. Transmissão P2P de mídia contínua ao vivo se baseia em dividir a mídia em pedaços e distribuí-los aos participantes da rede P2P. Quando um pedaço não é rece- bido pelo par dentro do seu prazo de exibição, o tocador de mídia tem duas opções: esperar até que o pedaço chegue, atrasando o par em relação aos outros, ou pular a exibição do pedaço. Para o espectador, ambos os cenários são indesejáveis e indicam má qualidade do sistema. Atualmente a maioria dos sistemas par-a-par comerciais não são inteiramente confiáveis e não implementam mecanismos de garantia de qualidade, o que pode levar leva a uma degradação do desempenho do sistema experimentado pelos pares. Essa tese estuda a minimização da perda de pedaços em sistemas P2P de trans- missão de vídeo em tempo real, entendendo a razão por trás das perdas, e propondo novas formas de reagir sobre cada uma destas razões, mantendo os custos baixos. A ideia é não utilizar os mecanismos padrões do próprio sistema para realizar a recu- peração, mas criar novas estratégias para a regeneração do sistema antes mesmo do problema ocorrer, tornando a experiência do usuário perfeita e atraindo empresas de radiodifusão a adotar a tecnologia P2P. xv Trabalhos Realizados Nós desenvolvemos o TVPP, um sistema P2P de transmissão de mídia contínua ao vivo. Seu desenvolvimento foi motivado por restrições ao rodar experimentos em sis- temas comerciais, geralmente resultando em dados imprecisos e incompletos. O TVPP pode ser usado para testar uma vasta gama de cenários, alterando parâmetros e algo- ritmos utilizados pelo sistema. Nós comparamos o TVPP com o SopCast, um sistema comercial bastante conhecido. Nós utilizamos o TVPP para caracterizar perdas de pedaços. O estudo foi di- vidido em dois cenários: irrestrito e com restrição de largura de banda. Descobriu-se que mesmo em um cenário irrestrito de largura de banda alguns pares apresentaram perdas significativas. A perda de pacotes no sistema como um todo foi estável ao longo do tempo. Apesar das perdas terem ocorrido, em muitos momentos houve pares que não tiveram nenhuma requisição de pedaços feita a eles. Na maioria das vezes as requi- sições foram respondidas na primeira tentativa, pelo primeiro par a anunciar que tinha o pedaço. Pedaços que foram perdidos estavam disponíveis em menos candidatos do que pedaços que foram recebidos. Finalmente, perdas de pedaços não são orientadas por uma distribuição ruim de um pedaço específico na rede, nem por grandes sequên- cias de perdas consecutivas. Estes resultados sugerem duas soluções: uma melhoria no método de organização da rede sobreposta e, em última instância, um mecanismos de requisição emergencial para recuperar os pedaços que estão próximos de serem perdidos fora da vizinhança normal do par. As perdas ficam piores, todavia sustentáveis ainda, uma vez que nós começamos a restringir a largura de banda através da introdução de free riders. Observando free riders conscientes nós notamos que a perda de pedaços foi qualitativamente similar para um percentual de 50% de free riders compondo a rede, enquanto que a latência média foi acrescida de 1 segundo somente. Entretanto, a mesma fração de free riders não conscientes causa uma degradação significativamente maior, em grande parte por conta de um crescimento no número de tentativas de realizar uma requisição. Tais tentativas consecutivas desperdiçam banda e aumentam a latência média por pedaço. Para evitar isso, um algoritmo foi desenvolvido para complementar o escalonador de requisições, identificando pares não cooperativos dentro de poucas interações. Obtivemos assim resultados para latência e perda de pedaços similares para as mesmas quantidades de free riders conscientes e não conscientes. Finalmente, nós trabalhamos com o Hive, uma solução comercial CDN-P2P para transmissão de mídia contínua ao vivo. No Hive, um par que está prestes a perder o prazo de um pedaço solicita-o emergencialmente para o servidor da CDN. Observamos xvi que tais requisições garantem qualidade de serviço, mas entregam pedaços perto do seu prazo de exibição, o que deixa pouco tempo para que eles sejam retransmitidos na rede P2P. A solução evita a perda de pedaço, mas é ineficiente em resolver um eventual problema da disponibilidade do dado na rede P2P. Nós desenvolvemos um mecanismo que dinamicamente adéqua a quantidade de dados alimentada à rede P2P pela CDN de forma a garantir uma disseminação de pedaços eficaz e reduzir o numero de requisições emergenciais. Nós avaliamos a solução sob diversas condições e encontramos potencial para reduzir o consumo de banda da CDN em 30% mantendo a qualidade da mídia. Em resumo, nós descobrimos que a maioria das perdas acontece por razões muito específicas para serem pontuadas e tratadas. Todavia, uma importante razão identifi- cada foi a requisição de pedaços a pares não cooperativos. Nós propomos um método simples para identificá-los e evitar tais requisições. Para as demais razões, nós propo- mos um serviço onde pedaços que estão prestes a serem perdidos ganham uma última chance de serem recuperados fora da vizinhança do par. Essa solução tem potencial para garantir uma entrega de pedaços próxima a 100% dependendo da forma que for implementada. Apesar disso, tal método pode incorrer em custos de infraestrutura adicionais. Embora isso pareça inevitável, propusemos um método para minimizar esses custos. Finalmente, com este trabalho obtivemos uma melhor compreensão de por que os pedaços são perdidos na transmissão P2P de mídia contínua ao vivo, e como minimizar essas perdas. Um sistema de transmissão usando estas técnicas será signi- ficativamente mais eficiente para transmitir fluxos sem aumentar custos, tanto quanto possível. xvii Extended Abstract Live streaming systems are becoming more popular each day. These systems attract a growing number of users, and some important TV channels already broadcast their live content on the Internet. P2P live streaming systems may provide a low cost solution to the problem of transmitting live content to a large number of viewers. As the potential for P2P live streaming consolidates itself, providers, such as PPlive, claim to have a user base of hundreds of millions of viewers and allow dozens of millions to watch their channels simultaneously. P2P live streaming relies on splitting the stream into chunks and spreading them to the overlay participants. Whenever a chunk misses its playback deadline either the player will have to wait until it arrives, delaying the peer in relation to others, or it will skip it. For the viewer, both scenarios are undesirable, annoying and, indicatives of the system’s bad quality. Yet today most commercial P2P systems are unreliable, as quality assurance mechanisms are still in a developing stage, or even completely ignored, and several scenarios can cause degradation of system performance experienced by peers. This thesis studies chunk loss minimization in P2P live streaming systems, under- standing the reason behind losses, and proposing new ways to respond to each reason while keeping low costs. The principle is not to use common system mechanisms al- ready present upon recovery, but rather to create new strategies so systems could preemptively identify problems and regenerate, making the user experience flawless, and attracting broadcasters to embrace P2P technology. Work Accomplished We have developed TVPP, a P2P live streaming system. Its development has been motivated by impairments to run experiments with commercial systems and tools, often resulting in incomplete or imprecise data. It can be used to test scenarios under a wide set of parameters and algorithms. We have compared TVPP with SopCast, a well-known commercial system. xix We have used TVPP to characterize chunk losses. The study has been performed in two scenarios: resourceful and bandwidth-constrained. It has been found that for a resourceful scenario few peers exhibit significant losses. System-wide average chunk loss has been stable in time. While losses do occur, there are many moments where peers had no requests made to them. Mostly, requests receive replies in the first attempt, by the first peer to announce that it has the chunk. Chunks that have been lost have had less candidates to request from than chunks that have been received. Finally, chunk losses are not driven by a bad spread of a specific chunk or by big loss sequences. These results suggest two solutions: an improved overlay organization method and, ultimately, an emergency request mechanism to request "about to be lost" chunks outside of a peer’s partner set. Losses get worse, yet are still sustainable, once we start to reduce bandwidth through the introduction of free riders. Observing conscious free riders we have noted that chunk losses have been qualitatively similar with less than 50% free riders while the average latency had increased by 1s at most. However, the same fraction of oblivious free riders causes significantly worse degradation, mostly triggered by an increase in retries. Consecutive retries waste bandwidth and increase average chunk latency. To avoid this, an algorithm has been developed to complement the request scheduler which identifies uncooperative peers within few interactions. Using it we have obtained similar chunk loss and latency results for the same amount of conscious and oblivious free riders. Finally, we have worked with Hive, a CDN-P2P commercial live streaming so- lution. On Hive, a peer that is about to miss a chunk playback deadline issues an emergency request to the CDN. We have found that emergency requests guarantee quality of service, but delivering a chunk too close to its playback deadline leaves little to no time for redistribution in the P2P overlay. Such requests avoid chunk losses but are inefficient to solve P2P data availability problems. We have developed a mechanism that dynamically minimizes the number of video streams seeded to the P2P overlay while guaranteeing efficient dissemination and avoiding emergency requests. We have evaluated it under diverse conditions in which it may save up to 30% of upload band- width previously used by emergency requests without reducing stream quality. In summary we have found out that most losses happen for reasons too specific to be pointed out and have it treated. One important reason that has been identified, however, was requesting chunks to uncooperative peers. We have proposed a simple method to track this behavior and avoid it. For other reasons, we have also proposed a service where chunks that would be lost could have a last chance to be retrieved outside of the peer’s neighborhood. This solution has the potential to guarantee close xx to 100% delivery depending on how it is implemented. Nevertheless, such a method to neutralize losses might incur in additional infrastructure costs. While this may be unavoidable, we have proposed a method to minimize these costs. Finally, in this work we have obtained a better understanding of why chunks are lost in the transmission of live P2P streams, and how to minimize these losses. A transmission system using these techniques will be significantly more efficient to transmit streams without increasing costs as much as possible. xxi List of Figures 2.1 Pdf of packet size for each system. . . . . . . . . . . . . . . . . . . . . . . 22 2.2 Cdf of download and upload rates for each peer for each second for each system. R is ≈400Kbps. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.3 Network size in each snapshot. . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.4 Cdf of degree of each peer in all snapshot. TVPP was limited in 50 partners but even if a peer has reached that limit, it still can receive partnership requests. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.5 Cdf of shortest path lengths from each peer to the server in all snapshot. . 25 3.1 Cumulative distribution of chunk loss reported for each peer in a set of runs. 34 3.2 Scatter plot between requests received and requests not responded over time normalized by stream rate. . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.3 Distribution of the number of candidates that could deliver a chunk in each attempt for chunks that have been received. . . . . . . . . . . . . . . . . . 37 3.4 Distribution of the number of candidates that could deliver a chunk for each attempt for chunks that have been missed. . . . . . . . . . . . . . . . . . . 38 3.5 Distribution of the number of candidates that could deliver a chunk for each attempt for chunks that have been missed and only requested to the same candidate at all attempts. . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.6 Frequency distribution of bursts of chunks missed weighted by their size. . 39 3.7 Cumulative distribution of chunks that are not received by part of the network. 40 3.8 Chunk playback deadline miss rate (chunk loss) and latency, for conscious (3.8(a) and 3.8(b)) and oblivious (3.8(c) and 3.8(d)) free riders. Curves are scenarios with indicated free rider ratio. . . . . . . . . . . . . . . . . . 41 3.9 Chunk loss and latency comparison over the results of conscious, oblivious and SURE scenarios using 50% free rider ratio. . . . . . . . . . . . . . . . 43 3.10 Load distribution for several free rider ratios. (free riders, C = 0; uncoop- erative, 0 < C ≤ 1; cooperative, 1 < C ≤ 5; altruistic, C > 5) . . . . . . . 45 xxiii 4.1 CDN-P2P chunk flow overview . . . . . . . . . . . . . . . . . . . . . . . . 52 4.2 Peer upload bandwidth distributions. . . . . . . . . . . . . . . . . . . . . . 55 4.3 Savings for different upload bandwidth distributions. . . . . . . . . . . . . 58 4.4 Partnerships establishment attempts over time for different upload band- width distributions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.5 Distribution of peer upload bandwidth utilization. . . . . . . . . . . . . . . 59 4.6 Comparison of overall savings for BASE scenario with and without AERO. 65 4.7 Comparison of overall savings for 50F scenario with and without AERO. . 66 4.8 Comparison of overall savings for 75F scenario with and without AERO. . 67 4.9 Comparison of overall savings for DIV4 scenario with and without AERO. |OS|/|P| removed from key for clarity. . . . . . . . . . . . . . . . . . . . . 68 4.10 Savings for br and se peer upload bandwidth distributions. . . . . . . . . 68 4.11 Savings of different overlay construction configurations grouped by peer upload bandwidth distribution. . . . . . . . . . . . . . . . . . . . . . . . . 69 4.12 Savings under peer churn. . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.13 Savings over time for periodic flash crowd events. . . . . . . . . . . . . . . 72 4.14 Savings for div4 scenario with AERO over longer experiments. . . . . . . . 72 A.1 Chunk loss and latency comparison over the results of conscious, oblivious and SURE scenarios using 00% free rider ratio. . . . . . . . . . . . . . . . 89 A.2 Chunk loss and latency comparison over the results of conscious, oblivious and SURE scenarios using 10% free rider ratio. . . . . . . . . . . . . . . . 89 A.3 Chunk loss and latency comparison over the results of conscious, oblivious and SURE scenarios using 30% free rider ratio. . . . . . . . . . . . . . . . 90 A.4 Chunk loss and latency comparison over the results of conscious, oblivious and SURE scenarios using 50% free rider ratio. . . . . . . . . . . . . . . . 90 A.5 Chunk loss and latency comparison over the results of conscious, oblivious and SURE scenarios using 70% free rider ratio. . . . . . . . . . . . . . . . 90 B.1 Peer upload bandwidth distributions. . . . . . . . . . . . . . . . . . . . . . 91 B.2 (BR)Comparison of overall savings for BASE scenario with and without AERO. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 B.3 (BR)Comparison of overall savings for 50F scenario with and without AERO. 92 B.4 (BR)Comparison of overall savings for 75F scenario with and without AERO. 93 B.5 (BR)Comparison of overall savings for DIV4 scenario with and without AERO. |OS|/|P| removed from key for clarity. . . . . . . . . . . . . . . . . 93 xxiv B.6 (BR)Savings of different overlay construction configurations grouped by peer upload bandwidth distribution. . . . . . . . . . . . . . . . . . . . . . 94 B.7 (BR)Savings under peer churn. . . . . . . . . . . . . . . . . . . . . . . . . 94 B.8 (BR)Savings over time for periodic flash crowd events. . . . . . . . . . . . 94 B.9 (BR)Savings for div4 scenario with AERO over longer experiments. . . . . 95 B.10 (US)Comparison of overall savings for BASE scenario with and without AERO. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 B.11 (US)Comparison of overall savings for 50F scenario with and without AERO. 96 B.12 (US)Comparison of overall savings for 75F scenario with and without AERO. 97 B.13 (US)Comparison of overall savings for DIV4 scenario with and without AERO. |OS|/|P| removed from key for clarity. . . . . . . . . . . . . . . . . 97 B.14 (US)Savings of different overlay construction configurations grouped by peer upload bandwidth distribution. . . . . . . . . . . . . . . . . . . . . . . . . 98 B.15 (US)Savings under peer churn. . . . . . . . . . . . . . . . . . . . . . . . . . 98 B.16 (US)Savings over time for periodic flash crowd events. . . . . . . . . . . . . 98 B.17 (US)Savings for div4 scenario with AERO over longer experiments. . . . . 99 B.18 (SE)Comparison of overall savings for BASE scenario with and without AERO. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 B.19 (SE)Comparison of overall savings for 50F scenario with and without AERO.100 B.20 (SE)Comparison of overall savings for 75F scenario with and without AERO.101 B.21 (SE)Comparison of overall savings for DIV4 scenario with and without AERO. |OS|/|P| removed from key for clarity. . . . . . . . . . . . . . . . . 101 B.22 (SE)Savings of different overlay construction configurations grouped by peer upload bandwidth distribution. . . . . . . . . . . . . . . . . . . . . . . . . 102 B.23 (SE)Savings under peer churn. . . . . . . . . . . . . . . . . . . . . . . . . . 102 B.24 (SE)Savings over time for periodic flash crowd events. . . . . . . . . . . . . 102 B.25 (SE)Savings for div4 scenario with AERO over longer experiments. . . . . 103 xxv List of Tables 2.1 Definitions and notation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2 Traffic-related Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.3 Network-related Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.1 Distribution of attempts that take for a received chunk to be received. . . 36 3.2 Distribution of unique candidates used per request. . . . . . . . . . . . . . 37 4.1 Definitions and notation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.2 Origin of video chunks received. . . . . . . . . . . . . . . . . . . . . . . . . 60 4.3 Average number of traversed hops for a video chunk that enters the P2P overlay clustered by origin. . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.4 Definitions and notation used by AERO. . . . . . . . . . . . . . . . . . . . 62 4.5 Origin of received video chunks when using AERO (data copied from Table 4.2 below to aid comparison). . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.6 Savings for different overlay construction limits . . . . . . . . . . . . . . . 69 4.7 Savings for different neighbor selection policies. . . . . . . . . . . . . . . . 70 4.8 Savings for different overlay sizes. . . . . . . . . . . . . . . . . . . . . . . . 70 xxvii Contents Resumo xi Abstract xiii Resumo Estendido xv Extended Abstract xix List of Figures xxiii List of Tables xxvii 1 Introduction 1 1.1 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 Objectives and Contributions . . . . . . . . . . . . . . . . . . . . . . . 4 2 Peer-To-Peer Live Streaming: The TVPP Design 9 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 Overlay Construction and Maintenance . . . . . . . . . . . . . . . . . . 10 2.2.1 Tree-based topology . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2.2 Mesh-based topology . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2.3 Hybrid topology . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2.4 Peer Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2.5 TVPP implementation . . . . . . . . . . . . . . . . . . . . . . . 13 2.3 Chunk Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.4 Emergency Requesting . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.5 Logging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.6 New Modules/Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.7 Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.8 Comparison with SopCast . . . . . . . . . . . . . . . . . . . . . . . . . 20 xxix 2.8.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.8.2 Traffic Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.8.3 Network Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.9 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3 Chunk Loss Characterization 29 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.2 Parameters and Scenarios . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.2.1 Default Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.2.2 Free riding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.2.3 PlanetLab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.3 Analysis of Resourceful Scenario . . . . . . . . . . . . . . . . . . . . . . 33 3.3.1 System Behavior Analysis . . . . . . . . . . . . . . . . . . . . . 34 3.3.2 Peer Behavior Analysis . . . . . . . . . . . . . . . . . . . . . . . 35 3.3.3 Chunk Behavior Analysis . . . . . . . . . . . . . . . . . . . . . . 36 3.4 Analysis of Bandwidth Constraint Scenario . . . . . . . . . . . . . . . . 40 3.4.1 Conscious Free Riders . . . . . . . . . . . . . . . . . . . . . . . 40 3.4.2 Oblivious Free Riders . . . . . . . . . . . . . . . . . . . . . . . . 42 3.4.3 Simple Unanswered Request Eliminator . . . . . . . . . . . . . . 43 3.4.4 Workload Distribution Induced by Uncooperative Peers . . . . . 44 4 AERO: Adaptive Emergency Request Optimization 49 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.2 CDN-P2P Live Streaming . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.3 Simulation Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.4 P2P Distribution Efficiency . . . . . . . . . . . . . . . . . . . . . . . . 57 4.5 Current Efficiency Improvements . . . . . . . . . . . . . . . . . . . . . 61 4.6 Adaptive Emergency Request Optimization . . . . . . . . . . . . . . . . 62 4.7 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.7.1 Peer upload bandwidth . . . . . . . . . . . . . . . . . . . . . . . 65 4.7.2 Overlay properties and overlay size . . . . . . . . . . . . . . . . 69 4.7.3 Peer churn and flash crowds . . . . . . . . . . . . . . . . . . . . 71 4.7.4 Local minima . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.7.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.8 AERO at TVPP’s Emergency Request Service . . . . . . . . . . . . . . 73 4.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 5 Final Remarks 77 xxx Bibliography 81 Appendix A SURE Comparison Figures 89 Appendix B AERO Evaluation Figures 91 B.1 Brazilian Bandwidth Distribution . . . . . . . . . . . . . . . . . . . . . 92 B.2 North American Bandwidth Distribution . . . . . . . . . . . . . . . . . 96 B.3 Swedish Bandwidth Distribution . . . . . . . . . . . . . . . . . . . . . . 100 xxxi Chapter 1 Introduction In the last decade the Internet has grown significantly through the expansion of audio- visual content sharing. We have seen the Internet achieve its success evolving in size and applications through Web 1.0 - with text and static pictures -, instant messaging, file sharing and the mp3 musical revolution, real time voice transmission (as if we were using a phone), Web 2.0 and video on demand. Services such as Napster and YouTube were iconic examples that boosted Internet usage and popularity. Nowadays, the In- ternet flagship is video transmission; its traffic is still growing as video services offer new resolutions, such as 4K UHD [Amazon, 2014]. Globally, consumer internet video traffic will be 80 percent of all consumer Internet traffic in 2019, up from 64 percent in 2014 [Cisco, 2015]. Yet, the next imaginable evolution, video transmission in real time (or live streaming), has not yet emerged as a great success. There is no iconic live streaming application that stands out worldwide. Audiovisual content transmission in real time over the Internet has yet to be- come popular. Television broadcast is the technology most similar to live streaming, and what can be extracted from the television scenario is that 1) ordinary viewers do not have a habit of sharing real time videos or producing their own shows, and 2) broadcasters are not interested on abandoning their current reliable and scalable Ultra High Frequency (UHF) transmission structure. There is a business opportunity, how- ever, given that the Internet faces no geographical boundaries. During the FIFA 2014 World Cup in Brazil, Globo TV network live streamed all games achieving impressive numbers. During the stream of a single game, Globo achieved five hundred thousand simultaneous viewers. The accumulated duration of all streams during the event was approximately 1600 years in total [Moreira, 2015]. Moreover, individual producers of specific content have a way to reach interested viewers without having to fit to a predetermined program schedule like on TV. Services 1 2 Chapter 1. Introduction such as TwitchTV [Twitch TV, 2015] and Hitbox [Hitbox, 2015] specialize in broad- casting live content related exclusively to video games. Some championships reach hundreds of thousands of simultaneous viewers. Another highlight is Periscope, a live streaming application which was released in late March, 2015. The developing startup was acquired for US$100 million by Twitter even before reaching the public. One mil- lion users joined Periscope in its first 10 days on the iPhone app store. Yet, the number of viewers watching each stream is still relatively low; CNN Royal Correspondent Max Foster, who covered the birth of Duchess of Cambridge’s second child live through the application, reported a maximum of one and a half thousand viewers during his streams [CNN, 2015] The traditional model used for streaming content on the Internet is the client- server model. A server opens one connection and performs transmission for each user interested in the content. With live streaming, the server continuously transmits to each user the video at the same rate it is produced. In scenarios where there are many users interested in a dense video, such as high quality videos, hte client-server model stands out negatively by its low scalability. Content servers may eventually become bottlenecks that affect response time. This model is reliable and with some planning can be structured to comply with quality of service (QoS) requirements. One can replicate and distribute content across multiple well-located servers over the Internet and ensure increased availability and performance. That solution is called content delivery networks [Dilley et al., 2002], or CDNs, however, its structure demands costly investments. Alternatively, if each user interested in the content would initially consume it – as a client –, store it and then supply it – as a server –, it would be possible to distribute the total load among these "partners". From that idea arises the term that defines this type of transmission: peer-to-peer (P2P), which became popular from 1999 with systems such as Napster [Shirky, 2001], Gnutella [Ripeanu, 2001], Emule [Kulbak and Bickson, 2005], BitTorrent [Cohen, 2003] and Skype [Guha et al., 2006]. In particular, this work studies P2P live streaming systems. P2P live streaming has gained some notoriety over the last decade for its rise and numbers. There are several systems that hold millions of users connected daily [Hei et al., 2007a; Wu et al., 2007; Huang et al., 2008; Sentinelli et al., 2007]. This indicates that users are shifting their desire for live broadcasted content to the Internet, a natural technological exchange. Some see more aggregated value to this service such as greater choice of content, potential for interactivity and mobility at lower prices [CNET, 2010] and others belong to a new class of users who sees the possibility of creating and producing content by themselves with low-cost distribution [Sentinelli 3et al., 2007; Silva, 2009]. P2P live streaming fits these demands. System scalability as the number of users grows, load distribution and transfer rates achievable through P2P technology are fun- damental to ease the burden and infrastructure costs related to dedicated servers. Yet, there are open problems in the P2P live streaming scenario. P2P brings deficiencies to live streaming that, if not treated properly, may hinder the service popularity. Some of the main challenges are related to management and performance, turnover of part- ners (churn) and lack of guarantees on quality of service. Refinement of these latent issues is essential to improve P2P live streaming. Thus, the client-server model is still preferred because 1) there is a lack of maturity to P2P live streaming systems, and 2) client-server provides more guarantees as servers can be pressured to offer an acceptable quality if the investment is high enough. Commercial success of any live streaming service depends on the quality provided to end users. Nowadays, P2P live streaming viewers watch streams with acceptable quality. However, this is not true at all times, nor for all users. Most commercial systems are unreliable, as quality assurance mechanisms are still in a developing stage, or even completely ignored, and several scenarios can cause degradation of system performance experienced by users [Moltchanov, 2011]. Yet, quality is a broad term. There is a special class of videos, such as soccer matches, where stream latency is very important. Those videos cannot afford to be delayed by dozens of seconds or more. But, for most contents, continuity and perfect stream decoding is more important, and that is achieved by controlling losses of chunks of streamed content. Whenever a chunk misses its playback deadline either the player will have to wait until it arrives – delaying the peer in relation to others or eventually freezing exhibition – or it will skip the chunk – exhibiting badly decoded frames de- pending on system design. For the viewer, watching a show that keeps stuttering or with broken frames is undesirable, annoying and, indicates system’s bad quality. Chunk loss may depend on a number of system design peculiarities including network topology, scheduling, incentive mechanism, etc [Moltchanov, 2011]. Such sys- tems are also subject to scenarios that increase chunk loss, such as peers that not share (free riders), local neighborhood bandwidth constraints or the unavailability of a chunk in the neighborhood within its deadline. Moreover, chunk loss scales with the relation between network bandwidth distribution and video resolution. Although several studies try to improve or solve these problems by working on policies for chunk scheduler [Zhao et al., 2009; Birke et al., 2011] or peer selection mechanisms [Traverso et al., 2014; Simoni et al., 2014; Roverso et al., 2013], few studies focus on ensuring chunk delivery, specially while facing particularly bad scenarios. Most current systems 4 Chapter 1. Introduction are not well prepared to act in face of chunk loss. 1.1 Problem Acting upon chunk losses is essential. A single loss may disrupt the stream decoding for seconds. Losses may affect in different ways the momentary displayed error depending on decoder’s error concealment capabilities. High quality motion codecs such as H.264 [ITU-T, 2015] are designed to take advantage of redundancies between video frames. If a loss affects a key frame (I-frame), the effect will propagate through every frame until the next key frame arrive. Besides, loss of consecutive chunks may generate a signal interruption for several seconds. Stream disruption and chunk loss are considered direct and measurable impacts. Indirectly, these failures lead to 1) decreased confidence of users in the system, 2) the eventual abandonment of these users and 3) a system collapse. In this work we study the problem of chunk loss and methods to avoid/eliminate it. We give an extra focus at bandwidth constrained scenarios where chunk loss prob- lems become massive. Finally, although some methods are costless, final methods to neutralize losses incur in infrastructure cost increase and, since P2P architectures are attractive for their low costs, it is also important to track and minimize the tradeoff of ensuring chunk delivery. 1.2 Objectives and Contributions This thesis main goal is to study chunk losses in P2P networks used for live streaming and propose new methods to avoid/eliminate them. We employ harsh bandwidth conditions not only to create these methods but also to improve them. Below we will briefly describe our contributions and further detail them in the following paragraphs. • Development of a P2P live streaming system with academic purposes and design similar to commercial systems; • Characterization of chunk loss on a generic P2P mesh-pull live streaming network, extracting failure-related information – which peers, for what reasons, how much impact – with and without bandwidth constraints; • Analysis and development of a non-punitive algorithm to reduce the impact of free riding – as a bandwidth constraint; 1.2. Objectives and Contributions 5 • Proposal of a solution to assist and complement traditional P2P transmission mechanisms aimed at ensuring chunk delivery to all peers obeying time restriction requirements; • Analysis and development of an algorithm to reduce costs associated with ensur- ing chunk delivery; In practice, no live streaming system has 100% delivery guarantees, especially P2P ones. There are a number of reasons behind chunk losses in P2P live streaming systems. Chunk availability bottlenecks may occur, external factors may reduce peer’s upload capacity, churn makes everything less reliable, among other problems. Ultimately, the underlying Internet architecture is best-effort. Underlay connection problems may arise or network cables can accidently be disconnected which make the delivery infeasible. Nevertheless, there are chunks which could have been delivered but are lost because of system design inefficiency. There are dozens of studies that deal with chunk scheduling mechanisms, neighborhood management, or server placement aimed at optimize and solve these problems [Birke et al., 2011; Traverso et al., 2014; Moltchanov, 2011; Simoni et al., 2014; Roverso et al., 2013; Zhao et al., 2009]. Even these optimizations can fail however. These algorithmical optimizations tend to improve overall quality but do not treat specific client problems. In P2P systems each peer strives to perform his part, but that does not necessarily generate a collective guarantee. How can a system step further and increase the quality assurance for individual peers beyond those solutions? We address this feasible delivery aiming to eliminate chunk loss. We start this work by describing TVPP [Oliveira et al., 2013b] (Chapter 2). TVPP is an academic P2P live streaming system designed to provide a similar service to popular proprietary commercial systems but with a few advantages for researchers. TVPP’s local development has been motivated by the need for a P2P live streaming system in which one could run many sorts of experiments, with freedom to modify the source code, and easier access to experimental result logs. The tool mimics the behavior of commercial systems such as SopCast. Through TVPP description we in- troduce the major mechanics and theory behind peer-to-peer live streaming up to their details. Moreover, Section 2.4 presents an emergency requesting mechanism proposal for "pure" P2P live streaming systems which has the potential to drastically reduce chunk losses. Peers monitor their own missing chunks and contact the Emergency Request Service once they understand that a specific chunk has only one more oppor- tunity to be requested before it misses its playback deadline. The Emergency Request Service forwards the received requests to handlers which are peers selected from within 6 Chapter 1. Introduction the overlay. Handlers respond to original emergency request sources with the missing chunk. After that, we characterize chunk loss on TVPP under different bandwidth con- straints (Chapter 3). In particular, it is important to understand the characteristics of lost chunks and peers that lost it. We have identified a set of reasons behind chunk losses, the relative likelihood for these to happen and what solutions can be applied. We start from a generic and resourceful environment, and we add constraints modify- ing the parameters as follow-up questions arise. We decrease the amount of bandwidth available through the rise of free riders – peers that refuse to upload data. It can be observed that system-wide chunk loss highly depends on network conditions and system’s designs. Finally, we discuss solutions to avoid chunk losses and propose Sim- ple Unanswered Request Eliminator – SURE (Section 3.4.3). It is a modification to the chunk scheduling in which a simple and local evaluation identifies unresponsive neighbors and avoids sending requests to them. At last, we study an alternative to reduce chunk losses that uses CDN archi- tecture. We evaluate chunk loss elimination at Hive Streaming [Hive, 2015] (Chapter 4). Hive Streaming is a commercial solution for media distribution based on Smooth- Cache [Roverso et al., 2015], a CDN peer-assisted live streaming system. Started in 2007 as a spin-off from the Swedish Institute for Computer Science and the Royal In- stitute of Technology in Stockholm, its company maintains a strong focus on research and development. We were given access to SmoothCache code and have studied chunk loss in several scenarios. The benefit of experimenting with this tool is that Smooth- Cache is a commercial application with several implemented optimizations, including a concept that ensures chunk delivery – typical of CDN peer-assisted systems. The concept states that all data is acquirable from CDN; those who are interested in the content form a support P2P overlay, which should be queried first for data. Thus, SmoothCache peers know CDN servers addresses and, ultimately, will request chunks to the CDN if a chunk cannot be found or a request fail to be responded by its partners. We mapped this behavior as an equivalent of an emergency request. Therefore, we stop evaluating chunk loss as an overlay distribution efficiency metric to analyze how much data has been received through the source, emergency requests, or peers. We have analyzed emergency requesting tradeoff, i.e. costs or amount of effort expended to acquire the missing chunks. Although emergency requests allow the re- trieval of nearly missed chunks, they deliver a chunk close to its deadline, which leaves no time for dissemination through the P2P overlay. This may create a negative feed- back loop, where emergency requests compromise P2P dissemination, lead to more emergency requests, and further compromise P2P dissemination. We have developed 1.2. Objectives and Contributions 7 the Adaptive Emergency Request Optimization – AERO – (Section 4.6), a mechanism that dynamically optimize the number of video streams that servers should seed to P2P overlay – adapting servers’ out-degree – while guaranteeing efficient dissemina- tion and avoiding emergency requests. This is especially important on SmoothCache because CDN servers respond as both stream source and emergency request handler, and AERO manages to reduce servers’ load by seeding a bit more to reduce the large number of emergency requests. We evaluate AERO under diverse overlay conditions and show that it maintains or reduces servers upload traffic. Although developed using SmoothCache, which is a CDN peer-assisted system, AERO can be applied to "pure" P2P overlays reducing traffic on emergency requests handlers by adjusting handlers’ or source’s out-degree. In summary we have found out that most losses happen for reasons too specific to be pointed out and have it treated. One important reason that has been identified, however, was requesting chunks to uncooperative peers. We have proposed a simple method to track this behavior and avoid it. For other reasons, we have proposed a service where chunks that would be lost could have a last chance to be retrieved outside of the peer’s neighborhood. This solution has potential to guarantee close to 100% delivery depending on how it is implemented. Nevertheless, such method to neutralize losses might incur in additional infrastructure costs as an external entity will have to handle that chunk delivery. While this may be unavoidable, we have proposed a method to minimize these costs. Finally, in this work we have obtained a better understanding of why chunks are lost in the transmission of live P2P streams, and how to minimize these losses. A transmission system using these techniques will be significantly more efficient to transmit streams without increasing costs as much as possible. The remainder of this thesis is organized as follows. Chapter 2 explains peer to peer live streaming systems through TVPP. Chapter 3 address the problem of chunk loss and the characterization performed. Chapter 4 discusses reduction of costs caused by chunk loss recovery through emergency requesting. Chapter 5 summarizes the results and describe possible future works. Chapter 2 Peer-To-Peer Live Streaming: The TVPP Design In this chapter we present TVPP [Oliveira et al., 2013b], an academic P2P live stream- ing system designed to be similar to popular proprietary commercial systems, such as SopCast, but with a few advantages for researchers. Through the description of TVPP we introduce the major mechanics and theory behind Peer-To-Peer Live Streaming up to their details. We also provide an experimental comparison between TVPP and Sop- Cast. TVPP has been a key tool for the development of this thesis as an experimental platform. 2.1 Introduction TVPP is a research-oriented P2P live streaming system. We have built TVPP from the need to simplify, control and get more detailed data from experiments. It has features that are useful for researchers, such as: • easy data acquisition: no need for additional network traffic analyzers • configurability: several key parameters to experiment with • modularity: simply plug in and test new algorithms • collection of arbitrary data: one can change code to output what he needs Most popular live systems, such as TvAnts1, UUSee2, SopCast3, PPLive4 and PPStream5 are commercial applications, with no publicly available source code, which 1tvants.en.softronic.com; 2www.uusee.com; 3www.sopcast.org; 4www.synacat.com; 5www.ppstream.com 9 10 Chapter 2. Peer-To-Peer Live Streaming: The TVPP Design makes it hard for researchers to gather useful data or log files. Most studies targeting these systems deal with a black box and must rely on educated guesses with regard to system architecture, protocols and internals. Crawling the network, analyzing traffic, recreating partial network graphs are examples of methods frequently used to infer sys- tem structure and behavior [Vieira et al., 2009; Horvath et al., 2008; Silverston et al., 2009; Ali et al., 2006; Tang et al., 2009]. Moreover, in such a scenario where systems do not facilitate third party testing or reverse engineering, control of key protocol pa- rameters, such as partnership or bandwidth limits, media buffer size, chunk scheduling and partner maintenance strategies, which would allow deeper experimentation and analysis, are absent. The TVPP design uses similar mechanisms to those of the most popular plat- forms [Sentinelli et al., 2007]. It is a mesh-pull P2P live streaming system. In order to better understand this kind of systems some key mechanisms are particularly in- teresting to explore in further detail, such as overlay maintenance, chunk scheduling, emergency requesting, and logging ( Sections 2.2 to 2.5). Another important feature is that TVPP has been designed to be expandable, as described in Section 2.6, since it aims to experiment with a wide range of questions, and therefore must be able to include additional algorithms, mechanisms, monitoring metrics or experimental param- eters. Section 2.7 explores a full list of configurable parameters present on TVPP. Section 2.8 describes an experimental comparison about traffic and network related metrics between TVPP and SopCast. 2.2 Overlay Construction and Maintenance In mesh-pull applications, peers are organized in a mesh-like network, without hier- archical roles. Peers have partners from which they request data or send data to. This design flexibility has the advantage of being scalable and fault resilient [Fodor and Dan, 2007; Hei et al., 2008] while compared to tree-based approaches. Besides topology concerns, peers need to know which other peers exist in the overlay, a feature known as peer sampling. We describe tree-based, mesh-based and hybrid topologies, peer sampling methods, and then the TVPP design in details. 2.2.1 Tree-based topology In tree-based topologies [Deshpande et al., 2002; Tran et al., 2004; Castro et al., 2003; Kostić et al., 2003], peers form an hierarchical structure similar to a tree or a multi-tree 2.2. Overlay Construction and Maintenance 11 graph. Content is transmitted from the root – data source – to its children, and so on until it reaches the leafs. Benefits are that, since data is been pushed down the tree during chunk exchange, traffic flows are predictable and delay is proportional to the number of transversed hops from the source. Limitations are that the quality delivered to each branch is limited by the upload capacity of its individual root, the distribution potential of many leaves might be wasted [Picconi and Massoulié, 2008; Magharei and Rejaie, 2006], and attacks or even churn can disrupt the flow availability, partition the overlay, and overload the network with overlay repair requests. Some examples of tree-based system are Overcast [Jannotti et al., 2000], Climber [Park et al., 2008], ZigZag [Tran et al., 2003], NICE [Banerjee et al., 2002], SplitStream [Castro et al., 2003] and Orchard [Mol et al., 2007]. 2.2.2 Mesh-based topology The mesh-based topology tries to overcome the tree topology constraints [Hefeeda et al., 2003; Zhang et al., 2005; Magharei and Rejaie, 2006]. It incorporates swarm techniques for content distribution, inspired by mechanisms such as BitTorrent [Co- hen, 2003], allowing most peers to actively contribute with their upload bandwidth. Mesh network’s basic principle is that participants form an overlay through random connections generating non-deterministic topologies. Lack of hierarchy and of tight structure allows a peer to receive chunks from several sources, which makes media distribution unpredictable. Benefits are that overlay construction and maintenance are less complex and expensive, and a greater connectivity reduces the likelihood of performance and band- width bottlenecks while increases churn resilience [Moltchanov, 2011]. As for draw- backs, control traffic generated is usually significant given that peers need to fre- quently exchange status information for deciding which are the best peers to down- load chunks from, and each data chunk is treated as a separate delivery unit and it might be received through a different route, i.e., paths and delivery times are not predictable and are highly variable [Picconi and Massoulié, 2008]. Yet, Magharei et al. [2007] show that a mesh-based structure has a superior performance then a structure based on multiple trees. Some examples of mesh-based system are SopCast, DONet/Coolstreaming [Zhang et al., 2005], Chainsaw [Pai et al., 2005], BiToS [Vla- vianos et al., 2006] and PULSE [Pianese et al., 2007]. 12 Chapter 2. Peer-To-Peer Live Streaming: The TVPP Design 2.2.3 Hybrid topology Tree-based and mesh-based approaches can be combined to boost benefits from each other, and to better utilize bandwidth. Hybrid topologies construct trees with peers logically closer to the source, and meshes connecting these with the rest of the network. They get tree’s predictability at the early hops while keeping mesh’s churn resilience for the rest of the overlay. CliqueStream [Asaduzzaman et al., 2008] creates clusters of peers using delay and locality metrics. From each cluster, one or more peers are elected to form a tree connecting the clusters to each other and to the source. mTreebone [Wang et al., 2010] elects a subset of stable peers to form its tree, and all peers request chunks from the tree. Upon failure to receive data from the tree, peers use auxiliary mesh connections. The limitation of this approach is that few stable peers might get congested while others do not contribute with their upload bandwidth. 2.2.4 Peer Sampling Peer sampling is the act of supplying peers with a restricted sample of overlay members which they may connect to, and it can be done in a centralized or distributed fashion. The centralized method relies on a service which can be contacted for peers samples while the distributed method relies on gossiping [Voulgaris et al., 2005; Dowling and Payberah, 2012; Roverso et al., 2013] which draw peers samples from within peer-to- peer overlay. Heuristics can be used while selecting a sample to improve its relative quality, e.g., sampling peers geographically closer to the requester. The centralized method mechanic is straightforward, peers request samples di- rectly to a centralized service and receive a subset of all overlay members. Benefits include a view over all members at all times, view’s freshness, and immediate conver- gence of sample selection heuristics. However, as any centralized method, it is a single point of failure and it could become a bottleneck. The distributed approach to peer sampling is gossiping. Peers keep a fixed-size restricted view of the overlay, periodically select a target node from that view, and exchange a number of items from the view with the target [Jelasity et al., 2007]. The quality of such protocols is associated with their heuristics which are applied for node selection, nodes replacement at target’s view, and if peers just push views or if they trade them (enabling partial view swapping). Benefits include solving centralized ser- vice issues but at the expense of freshness and of convergence time. Moreover, bad implementations might lead to overlay partitioning. Some examples are Cyclon [Voul- 2.2. Overlay Construction and Maintenance 13 garis et al., 2005], Croupier [Dowling and Payberah, 2012], and WPSS [Roverso et al., 2013]. 2.2.5 TVPP implementation Stream distribution starts from a channel creation. The channel concept is similar to television channels. The channel is the system entry point for users. When a peer decides to watch a stream, it searches for other peers that are watching the same content in order to associate itself with. This entry point is managed by a special node called bootstrap. In TVPP, as in SopCast, mesh construction and maintenance relies on a central- ized bootstrap server. For each channel c, the bootstrap stores a list of connected peers, Pc, the source peer address, and the last generated chunk ID. In order to keep Pc up to date, peers send a ping packet to the bootstrap periodically. A channel source, or server peer, also sends ping messages, which update the last generated chunk ID value for that channel. This value is used in a further moment to initialize or reset peers. Peers are removed from their respective Pc if they fail to report their existence to the bootstrap for a configurable period, typically a few seconds. A channel’s source peer (one that splits a stream into chunks and starts their distribution) announces to the bootstrap its intent to create a channel. Other peers join the overlay sending to the bootstrap a peer request message asking for peers that are watching an existing channel. After receiving a peer request message, the bootstrap selects a subset of peers from Pc and sends it back to the requester. This message also contains the last generated chunk ID so that new peers know from which chunk to start asking. Upon receiving the requested subset from the bootstrap, peers store them in a candidate peer list. To keep the candidates list freshness, peers will send new peer request messages to the bootstrap at regular intervals. Until a peer reaches its partnership size limit, they will periodically select can- didates to try to establish a partnership with, turning that candidate into a partner. TVPP considers directed partnership links. Each peer p has a set of input partners, denoted Ip, and output partners, denoted Op. It receives data from the input partners and sends data to the output partners. Each peer tries to fill its own input partner set; consequently, its output partner set is filled by other peers doing the same. Table 2.1 summarizes the notation. Partnership links are maintained through pings between partners which signalize that each end is still alive. Further details about extra load on pings are discussed 14 Chapter 2. Peer-To-Peer Live Streaming: The TVPP Design Table 2.1. Definitions and notation. var. definition Pc Set of peers connected to channel c. Ip Set of input partners of peer p. Op Set of output partners of peer p. R Video streaming rate. in the next section for its importance. Partners may face connection issues or leave the network. In these cases they can be turned back into candidates or be dropped. Another feature is that, periodically, partnerships can be selected to be undone. Selection strategies are an important part of a system [Traverso et al., 2014]. Peer selection strategies are configurable and at any time new strategies can be implemented. They are used to select candidates to become partners or to select partnerships to be undone. A few ways to do peer selection are physical distance, logical distance (round- trip delay time), upload capacity or randomly. 2.3 Chunk Scheduling The chunk scheduling mechanism of TVPP is inspired by the Coolstreaming data- driven model Zhang et al. [2005]. According to this moel, all peers only request data to partners that actually have the data. This is possible by making peers broadcast periodically to their output partners which chunks they have within the ping messages. This design generates more control overhead, but it prevents request/transmission re- dundancy. The broadcast ping contains a buffer map which has an integer indicating the newest chunk of media present on the peers’s own buffer and a bit map where each position determines the presence or not of a previous chunk. The nth position of the map represents the id of the newest chunk minus n. Thus buffer maps cover a dynamic range of chunk IDs. Knowing which partners have which chunks is only part of the solution. There are three common approaches for chunk exchange: pull, push and hybrid. In the pull approach, data is sent as a response to explicit requests done by a destination peer to a source peer. In the push approach, peers forward data based on their partners needs without the explicit requests. The push approach has less control overhead, but has a great potential for redundant traffic since more than one source can push the same content to a given destination. Typically, tree-based topologies use the push approach while the mesh-based topologies use the pull approach. It is unclear which technique is 2.4. Emergency Requesting 15 the best, and commercial systems use them both. Moreover, these can be mixed into a hybrid approach in which sometimes peers push and other times they pull chunks. As previously mentioned, TVPP is a mesh-pull system. The channel source will naturally be the first peer to have any chunk available. It splits the stream into chunks and associate a chunk ID to each of them. The source announces new chunks to its partners through buffer map messages. Buffer map messages spread data availability over the entire network. Once any peer announces the presence of a chunk, its partners might try to request it, if they do not have it. Every few milliseconds each peer compares its own buffer map with its partners’s maps looking for the newest chunk to request. The peer then creates a request to that chunk, selects a partner to serve this request using a configurable selection strategy, and adds the request to the request list signaling its intent to ask that partner for that particular chunk. After receiving a chunk, the respective request is removed from the request list, the chunk is stored, and the buffer map sets it as present. There are many approaches to choose which chunk to request first and whom to ask. TVPP request scheduling rule follows the earliest deadline first (EDF) rule, in which chunks closer to meet the deadline are requested first. Other options are requesting the rarest chunk first considering partners buffer maps, or the chunk that more peers need. Jian [2009] discusses benefits of each and how they fit each streaming scenario. As for whom to ask the chunk, the same peer selection strategies described before can be used over the reduced set of peers that have the chunk. Requests have a configurable expiration timer with the default value set as half second. After a request timer expires a retry – a new selection for a serving peer – is held. The number of retries for each request is also configurable. If all tries fail, the request is removed from the request list. 2.4 Emergency Requesting In this section we discuss the Emergency Request Service, an envisioned design that allows the retrieval of nearly missed chunks. Yet, this design has not been implemented in TVPP and it figures as a proposal. Whenever the peer identifies that it is about to miss a media chunk playback deadline, it issues an emergency request to a reliable source. The design is inspired in fundamental premisses of CDN peer-assisted systems, which are mainly client-server architectures with a P2P support [Roverso et al., 2012, 2015; Zhao et al., 2013; Yin et al., 2009; Mansy and Ammar, 2011; Lu et al., 2012]. In 16 Chapter 2. Peer-To-Peer Live Streaming: The TVPP Design CDN peer-assisted systems all data is acquirable from CDN servers, but a support P2P overlay exists between those who are interested in the data which should be queried first for it. Once a chunk cannot be found on the overlay or a request fails to be responded by other peers, the request is made to the CDN servers. CDNs are well suited for such a solution as they are built to handle a variable load and can guarantee quality of service up to the contracted capacity. We propose a solution for "pure" P2P live streaming systems which has the potential to drastically reduce chunk losses using resources available over the P2P overlay. Received chunks are fed to a player for exhibition. The player maintains an exhibition buffer to store these chunks. A gap at the player buffer may occur once that a requested chunk is not received before its due time to be played. There are several reasons that might lead to that, such as network failures, peers failures, churn, bandwidth bottlenecks or data unavailability over the partnerships. Timing constraints of live transmission make this a particularly serious problem. A peer might try to request several times to the P2P network, however, in a given moment it will have only enough time to request once more. At this point, requesting to a more reliable source or just outside the established partnerships might help. We introduce the design for the Emergency Request Service describing its possible implementation on TVPP. The service address is published to peers through the peer list messages provided by the bootstrap. A new kind of request message is necessary, the emergency request message. Its function is to behave as a normal request that has to be responded not to its source but to a secondary address indicated over the request header. A peer that is about to miss a chunk playback deadline issues an emergency request to the service with its own address added to the message header. As the emergency request service receives emergency requests, it must forward them to emergency handlers – peers that are capable of responding. An emergency handler that receives a forwarded emergency request checks its buffer for the chunk and sends it directly to the emergency request original message source. The response with a chunk is treated as any other normal response. Many implementations can be done to define which peers will be emergency handlers, a few are discussed below. • The channel source might be the only one that receives the forwarded requests, similar to CDN peer-assisted systems; • A peer might initialize itself pointing out its willingness to be part of the emer- gency network, in which case the emergency request service would have a peer list composed of such peers; 2.5. Logging 17 • The emergency request service could elect a number of peers to be listed on as emergency handlers based on those peers features, such as upload capacity or data availability. Selection strategies are important in implementations with multiple handlers to select from. Strategies can be as simple as random or round-robin. The emergency request service will not even need to know handler’s buffer maps. Success is determined then by how buffers are configured and how synchronized peers playbacks are. Finally, emergency requests are to be used as a support mechanism, a last resort and avoided at maximum. They aim at solving specific problems but it is a costly operation and that deliver a chunk close to its playback deadline. Upon the arrival at the peer this chunk will have less opportunity to be disseminated compared to chunks obtained through normal requests. In Chapter 4 we discuss in details this problems using a CDN peer-assisted system by mapping requests made to the CDN as emergency requests. 2.5 Logging Ping messages that are sent to the bootstrap can carry peer performance data. Using the bootstrap as the logging server increases the traffic at the bootstrap, but it has the advantage that there is no need for a special server to store log data. These special pings – or log messages – include the following performance data: the number of chunks generated, sent and received per second, requests sent and re- ceived per second, duplicated and missed chunks since the last log message, average hop and tries count for each chunk received, input and output partnership size. The list can be easily extended, as new measures are envisioned and required. A more detailed chunk performance data is also sent in each message. This chunk data relates to the last sample chunk received before sending the log message. For simplicity and com- parison between peers we consider one sample chunk every buffer size chunks received. Its data is composed by its chunk ID, hop and tries count, and timestamps indicating when it was generated or consumed. All that data is used to get useful insights about latency, playback deadline miss rate, and the distance that a chunk travels before being delivered. These are especially interesting because they are very hard (impossible in some cases) to be captured from commercial systems. Moreover, there is a system option to enable client-side logs. These logs contain information about every chunk expected by the client. It reports if the chunk was either received normally, missed, received late or received more than once. Also, there 18 Chapter 2. Peer-To-Peer Live Streaming: The TVPP Design are details about to whom that chunk was requested at each try and from whom it was received (in case of receiving), and when each request/receive happened. Another feature is that each peer periodically sends to the bootstrap a list of which other peers they are connected to. This is done through a different message with two lists of peer IDs (input and output partners) and a timestamp upon bootstrap receival. With those reports, one may track the evolution of the overlay connections over time. 2.6 New Modules/Algorithms TVPP’s object-oriented design includes generic interfaces for many modules. These interfaces provide flexibility to implement new features or algorithms. For instance, it is easy to create a new kind of message to be exchanged between peers since the send/receive methods receive a Message object as parameter and all messages inherit from Message. So, to create a new kind of message, one must only create a class that extends from Message, add a new entry at the group of message kinds, describe the structure of the new kind of message through an abstract method inherited fromMessage and introduce a method to handle the reception of that message. Other mechanisms that are useful to alter are the scheduler policy, the bootstrap peer selection and the connect/disconnect partner selection algorithms. All of them have been implemented using the strategy design pattern [Gamma et al., 1994]. To extend these mechanisms, one can develop an algorithm as an extension of a strategy interface and patch the new strategy in with a few lines of code on headers and at the parameter handler. These selection algorithms mainly sort objects by a specific key value, and many strategies can yet be implemented. Adding a periodical event is possible by extending the Temporizable class. One can create an event, set its period, and push it to the Temporizable list, which triggers it periodically. A few examples of features currently implemented using this are: the peer list request to the bootstrap, the partnership connect and disconnect mechanic and the upload limiter. Implementing a new peer performance metric and logging it is also simple. Since the log message is basically a wrapper, one can add the new metric at the log message, to the chain that constructs it, and make sure that it will be unwrapped and written at the log file on the other side, in this case the bootstrap. 2.7. Parameters 19 2.7 Parameters Some mechanisms described throughout the chapter are essential to understand the basics about P2P live streaming. However, there are additional details that impact systems performance and, consequently, quality of service. Input and output partnership sizes are factors that impact the overlay connectiv- ity and churn resilience, the local chunk availability, the amount of control messages and the duplicated reception probability (depending on scheduling choices). When combined with the total network size, it affects the diameter of the network and, thus, latency, since chunk storage, processing, and forwarding periods are significant. Several system internal timers are also influential, such as timers between buffer map exchanges and request retries, and partnership link removal timeout. It is also important to bal- ance peer buffer sizes between avoiding big buffer map messages and keeping a desirable storage capacity at peers, so it can offer chunks to its partners. Peer upload capacity is another relevant aspect because it might introduce bottlenecks and overlay average upload capacity must be sufficient to spread the content through the entire network. Finally, TVPP provides some configurable parameters. Currently, one can fix the following set of parameters: • mode, defines peer main behaviour (e.g., source or a viewer) or can be used to create special kinds of peers (e.g., a free rider or an emergency handler); • buffer size, affects buffer map message sizes and the peer capacity to hold chunks; • maximum number of input/output partners, relates with the overlay, more part- ners means more chunk sources, a more connected network, but it also increases buffer map messages throughput since they are periodically broadcast to all out- put partners; • request limit, defines the amount of chunks that can be simultaneously requested; • request timeout, defines how many milliseconds a request must wait before being retried; • retry limit, defines the amount of times that a chunk can be requested if previous requests fail; • unresponsive input/output partnership timeout, define how many seconds a peer must wait before removing an unresponsive input/output partners from its lists; 20 Chapter 2. Peer-To-Peer Live Streaming: The TVPP Design • upload and download limits, restrict a peer to send or receive a maximum number of bytes using a leaky bucket; • bootstrap’s peer subset selection algorithm, the subset of peers returned by the bootstrap can be random, oriented by IP distance or RTT (round-trip time) between peers, but one can create his own selection strategy; • partnership candidate selection algorithms, for connecting or disconnecting a part- ner, these algorithms currently are the same as above since they are also peer selection strategies; • chunk scheduler algorithm, another peer selection strategy but during chunk scheduling. 2.8 Comparison with SopCast We have compared TVPP with SopCast. The latter is a known commercial system which has been targeted by several studies [Vieira et al., 2009; Horvath et al., 2008; Ali et al., 2006; Tang et al., 2009]. As in previous studies, we also have crawled the network, analyzed traffic, and recreated network graphs to obtain SopCast data. Our results indicate that both systems behave similarly which, in turn, shows that our system is as efficient and it can produce reliable, accurate and more comprehensive analyses of P2P live streaming systems. We have compared traffic and network metrics such as server and peers loads, control overhead rates, network size and diameter, partnership sizes, and average shortest paths. 2.8.1 Experimental Setup Firstly, we have conducted a SopCast experiment with 60 minutes duration using ap- proximately 500 unreliable geographically-dispersed peers. Peers have been partially synchronized, and using cron we have made all peers join and leave the overlay simulta- neously. We have streamed a 100 minutes video, which looped continuously, and with an average stream encoding rate, R, of 400Kbps. As a commercial system, SopCast does not allow further setup. Then we have conducted the same experiment over TVPP trying to emulate SopCast behavior. We have limited TVPP server output partners to 10 peers based on Tang et al. [2009]. Limits for Ip and Op are 50 partners for each p based on SopCast experiment observations, as shown in Section 2.8.3. Selection and scheduling algorithms have been setup to the random policy. 2.8. Comparison with SopCast 21 To record traffic for each peer we have used tshark, a console version of Wireshark6 which is a network protocol analyzer based on tcpdump. We have chosen to use these logs from both systems while comparing them to avoid biasing results by using different data acquiring methods. However, TVPP performance data has been used to validate its logs. Through traffic logs, we have analyzed both traffic and network behavior. We have dropped the initial and final 300 seconds of each log to avoid any interference from peers joining/leaving the overlay. We have calculated peers download and upload rates for both control and data packets. We have considered a conservative threshold of 200 bytes to classify control packets on SopCast in accordance with Tang et al. [2009]. On TVPP we have set this threshold to 1300 bytes, since no control packet is larger than this. Section 2.8.2 presents a histogram for packets size, a comparison table with server overhead and load, and cumulative distribution functions of upload and download rates for each peer for each second. Network metrics cannot be directly extracted since they are calculated over over- lay graphs. Peer traffic logs have been merged into a time-sorted log. To represent the partnership graph through time, we have extracted graph snapshots from the sorted log. Vertices and edges are representations of peers and partnerships, respectively. Interarrival time of keep-alive messages on SopCast is less than 2.5 seconds in more than 95% of the cases [Tang et al., 2009]. That assumption has been made also valid for TVPP through buffer map message exchanges every second and a 3 second part- nership timeout setup. For both systems, we have considered that a partnership starts after any message is exchanged between peers, and it ends after they stop exchanging packets for 3 seconds or more. Thus, we have analyzed overlapping snapshots within 3 seconds for every observed second, i.e., snapshot t is formed by any peer interaction between time t and t + 2, ∀t = [300, 3298] seconds. Section 2.8.3 presents network size, diameter and average shortest path, and cumulative distribution function of peers degrees (partnership size) and peers’s shortest path to server. These metrics are further detailed below. Network size is the number of vertices present in a snapshot. Degree of a vertex v is the number of edges which touch v [Skiena, 1991]. A shortest path, or distance, d(u, v) between two vertices u and v of a finite graph is the minimum length of the paths connecting them [Diestel, 1997]. Diameter is the longest shortest path (i.e., the longest geodesic) between any two vertices (u, v) of a graph (maxu,vd(u, v)) [Skiena, 1991]. 6http://www.wireshark.org/ 22 Chapter 2. Peer-To-Peer Live Streaming: The TVPP Design 0 0.2 0.4 0.6 0.8 1 0 200 400 600 800 1000 1200 1400 P( Pa ck et siz e = b) Packet Size b (bytes) SC TVPP Figure 2.1. Pdf of packet size for each system. Table 2.2. Traffic-related Metrics Download Upload Server Overhead Overhead Contribution SopCast 14.71% 18.47% 3.05% TVPP 21.32% 24.45% 2.30% 2.8.2 Traffic Analysis Figure 2.1 shows packet size distribution for SopCast (SC) and TVPP. We have grouped data in 50 bytes buckets. Each system has a different distribution, which indicates that these systems use different packet types. SopCast shows a very high probability of packets below 150 bytes, a smaller probability of packets above 1350 bytes and minor probabilities of packets with size between 150 and 1350 bytes. The first two types might be control and full media packets [Tang et al., 2009; Ali et al., 2006], however there is no consensus about the nature of the others. On TVPP three types of packets occur: small control packets (communication with bootstrap, part- nership agreement and chunk requests) below 100 bytes, medium control packets with near 250 bytes and data packets with more than 1350 bytes. Medium control packets are buffer maps and their actual size will vary according to buffer size setup. Overhead caused by these control messages also scale with output partners size. According to Table 2.2 and Figure 2.2(a), TVPP control packets have been responsible for a greater overhead compared to SopCast. Two ways of achieving lower overhead would be: to increase the period between buffer map messages propor- tionally to output partners size, which could compromise the scheduling algorithm, to group chunks so a buffer map index would represent n chunks instead of one, reducing map’s size, or to reduce output partners limit, reducing overlay connectivity. 2.8. Comparison with SopCast 23 0 0.2 0.4 0.6 0.8 1 0 320 640 960 1280 1600 P( Do wn loa d < = l ) Download l (Kbps) 400 KBbps (stream bitrate) SC − Ctrl TVPP − Ctrl SC − Data TVPP − Data (a) Download grouped by control and data packets 0 0.2 0.4 0.6 0.8 1 0 800 1600 2400 3200 4000 P( Up loa d < = l ) Upload l (Kbps) 400 Kbps (stream bitrate) SC − Data TVPP − Data (b) Upload restricted to data packets Figure 2.2. Cdf of download and upload rates for each peer for each second for each system. R is ≈400Kbps. Figure 2.2 shows a cumulative distribution function of download and upload rates for each peer for each second. It can be observed that both systems have similar distributions of control and data rates. Figure 2.2(a) shows that TVPP and SopCast peers have similar behavior while downloading and they both tend to have a down- load rate near R. Moreover, data curves extremes show that peers in both systems eventually receive nothing or download in bursts. Figure 2.2(b) shows upload rates distributions. In both systems, Wireshark logs show that there is a considerable number of peers that download what they need but do not contribute uploading to the P2P system in a similar fashion. This is called free-riding. In the other extreme, a few peers contribute a lot, supporting almost the entire network through uploading. As Figure 2.2(b) shows, generally TVPP peers have higher upload rates than SopCast. However, at extreme upload rates (on the right tail not shown) SopCast’s curve surpasses TVPP’s. Peak upload rates for SopCast and TVPP have been as high as 150 and 120 times R, respectively. This indicates that fewer peers share the upload burden on SopCast compared to TVPP. Finally, Table 2.2 shows server upload contribution. SopCast server has been responsible for serving 3.05% of all traffic while TVPP server have contributed with almost 2.30%. As peers join the overlay increasing its size, they help to reduce the burden of uploading data imposed to the server. However, Figure 2.3 shows that TVPP had almost 10% less peers than SopCast. Since the server had the same partners limit (around 10) in both runs, TVPP server partners have directly asked less data to the server. 24 Chapter 2. Peer-To-Peer Live Streaming: The TVPP Design 2.8.3 Network Analysis 0 100 200 300 400 500 600 500 1000 1500 2000 2500 3000 N et w or k si ze Time (seconds) SC TVPP Figure 2.3. Network size in each snapshot. Figure 2.3 shows the number of peers present in each experiment through time. TVPP overlay have had fewer peers than SopCast’s because experiments were con- ducted on different days and a different subset of active peers have been available for each day. SopCast experiment have started with nearly 520 peers and ended with ap- proximately 500 while TVPP’s went from 475 to 450. Size reduction for both systems has been similar. Curves noise results from snapshot creation methodology, mean- ing that any peer might eventually not be present in a snapshot despite being still connected to the overlay. 0 0.2 0.4 0.6 0.8 1 0 20 40 60 80 100 120 140 160 180 P( De gre e < = n ) Degree n TVPP SC Figure 2.4. Cdf of degree of each peer in all snapshot. TVPP was limited in 50 partners but even if a peer has reached that limit, it still can receive partnership requests. 2.8. Comparison with SopCast 25 0 0.2 0.4 0.6 0.8 1 1 2 3 4 5 P( Pa th len gth <= n) Shortest path length n to server SC TVPP Figure 2.5. Cdf of shortest path lengths from each peer to the server in all snapshot. Partnership limit parameter choice has direct implication on graph connectivity and impacts several network metrics. Having more partners increases the probabil- ity that a peer reaches any other peer with a smaller number of hops or that more partners are connected between themselves. However, there is a trade-off between this parameter and overhead, as already explained. Also, if all peers have been connected between themselves (making a fully connected graph) certainly an even smaller group of peers would concentrate the upload task. Figure 2.4 shows distributions of peer degrees which approximate partnership size. SopCast peers mostly had around 30 to 80 partners. This have motivated us to setup TVPP Ip and Op to accept a maximum of 50 partners. Even though we fix the parameter, a few peers had less than 40 or more than 60 partners through snapshots. These errors are inherent to the data acqui- sition methodology used for both systems, which was Wireshark logs concatenation. Although a peer has been limited to 50 partners, any other peer could have sent a mes- sage trying to connect to a "full" peer, which would be identified as a control message between two peers and, consequently, considered a link between then. Thus creating the top right tail on TVPP’s curve. Furthermore, a few traffic log files could have been corrupted or missing causing edges or vertices to be lost. Figure 2.5 shows a cumulative distribution function of lengths for shortest paths between each peer and the server. Assuming that all connections are free of errors, with no bandwidth limit or link delay, shortest paths indicate the best path to take, as fewer hops from server to peer lead to lower latency. Peers closer to the server receive chunks first, early announce this data availability and are more likely to be the ones forwarding chunks to their partners. The intuitive trade-off is that, in general, a peer closer to the server receives more requests. Systems should be concerned with 26 Chapter 2. Peer-To-Peer Live Streaming: The TVPP Design Table 2.3. Network-related Metrics Average Diameter Shortest Path Mean σ Mean σ SopCast 3.692 ±0.022 1.952 ±0.004 TVPP 4.361 ±0.023 2.318 ±0.054 fairness, aiming at egalitarian load distribution among participants. This helps to explain upload distribution seen before (Figure 2.2(b)). Nevertheless, connections cannot be treated as ideal. Traffic can easily surpass bandwidth and physical links can become bottlenecks. Thus shortest path lengths are an imprecise estimative of how many hops data travels until arrives at peers. Table 2.3 presents the average diameter and average shortest path for all snap- shots of each experiment, with a confidence interval of 99%. Smaller diameter and average shortest path are better in ideal scenarios but do not necessarily represent the best network organization or paths to take in real world. Despite the small difference between the values for each system, both metrics are slightly higher for TVPP than SopCast which might lead to higher latencies. Finally, we have observed that under traffic measurement TVPP and SopCast performed similarly. Thus, TVPP is capable of behaving much alike SopCast regarding chunk delivery. Some metrics above have shown that network graphs formed by each system are statistically different. Unfortunately, usual traffic analyzis and offline graph reconstruction methods lack depth to fully understand implications - such as latency impacts - of this differences. Also, these methods might provide incomplete or imprecise information which makes controlling systems verbosity an advantage. Such control on TVPP is responsible for delivering extra statistics such as chunk hop count, chunk loss and latency which could not be extracted from SopCast for comparison. 2.9 Conclusions We have presented TVPP, an academic research-oriented P2P live streaming system designed to provide a similar service to popular proprietary commercial systems, such as SopCast. We have exposed several details of TVPP design and architecture in the light of overlay construction and chunk scheduling theory. We have also described the design of an emergency request feature. We have shown what TVPP is currently able to log, and some ways on how it can be expanded for other researches. We have introduced the current parameter set with the options that can be fixed at runtime. Finally, we 2.9. Conclusions 27 have highlighted benefits of TVPP usage through an experimental comparison with SopCast about traffic and network related metrics. Chapter 3 Chunk Loss Characterization In this chapter we evaluate chunk loss in several different scenarios. Characterization of chunk loss plays a key role in our objectives. It allows us to understand 1) what are the reasons for failure, 2) if the reasons are protocol, system or network related, 3) what is the probable frequency or impact of each fault, and 4) what approaches can be used to solve them. We draw many insights to chunk losses both in resourceful and bandwidth constrained networks. From these, we highlight SURE, a mechanism that significantly reduces performance issues associated with overlooking unresponsive peers. Results obtained with the characterization make it easier to propose and develop other solutions to prevent losses. 3.1 Introduction A chunk loss happens at a peer whenever the given chunk misses its playback deadline. Once it occurs, either the player will have to wait until that chunk arrives – delaying the peer in relation to others or eventually freezing exhibition – or it will skip it – exhibiting badly decoded frames depending on system design and video encoding algorithm. For the viewer, watching a show that keeps stuttering or with broken frames is undesirable, annoying and, indicates system’s bad quality. These failures lead to decreased viewers confidence in the system, viewers abandonment and system collapse. Chunk loss may depend on a number of system design peculiarities including net- work topology, scheduling, incentive mechanism, etc [Moltchanov, 2011]. Furthermore, systems are also subject to runtime dynamic conditions, such as peers that not share content (free riders), local neighborhood bandwidth constraints or the unavailability of a chunk in the neighborhood within its deadline. 29 30 Chapter 3. Chunk Loss Characterization We have done a chunk loss characterization for experiments run on TVPP. The goal was to identify common reasons and patterns on delivery and failure situations. We have addressed questions such as "Do chunks losses occur?", "How frequently?" and "Why?", and we expand those to grasp the influence of available bandwidth, load at the peers, neighborhood, and time, over chunk losses. We explore different features of chunk loss as we investigated results for both resourceful, and bandwidth constrained scenarios. Section 3.2 describes default parameters and scenarios used throughout those experiments. At the resourceful scenario (Section 3.3), we have investigated chunk loss in three different scopes: system, peer and chunk. We have drawn conclusions about system- wide losses, partnership bottlenecks, clustering of losses in time and in peer groups, and more. At the bandwidth constrained scenario (Section 3.4), we reduce available band- width through the increase of free riding in the overlay. We setup two types of free riders (Section 3.2.2) – conscious and oblivious – and we vary their fraction from 0% to 95% of peer population. We evaluate system-wide metrics to focus at chunk losses, and also latency, as a function of incremental constraint and free riders influence. As more peers become free riders, the significant cost of overlooking unresponsive peers stands out among performance issues. This has led to the proposal of Simple Unanswered Request Eliminator – SURE (described at Section 3.4.3). It has been observed that system-wide chunk loss highly depends on network con- ditions and system design. Nevertheless, our findings suggest that P2P live streaming can support uncooperative peers up to a certain point with nearly the same quality as on a resourceful scenario. Finally, we conclude by discussing solutions to avoid chunk losses for each of our findings. 3.2 Parameters and Scenarios Experiments have been run on TVPP. As mentioned in Chapter 2, TVPP is a mesh- pull system with very flexible parameter changing capability. TVPP default setup is explained in Section 3.2.1. As we introduce bandwidth constrainment to the default setup, TVPP has been modified to allow free riding client behaviors described in Section 3.2.2. Finally, our evaluation relies on experiments deployed on PlanetLab, a realistic testbed for distributed systems. A description about PlanetLab is given in Section 3.2.3. 3.2. Parameters and Scenarios 31 3.2.1 Default Setup Mesh construction relies on the Random candidate selection policy to select and connect peers, thus creating random topologies in each run. Partnership sizes, Ip and Op, in every peer are set to 20, channel source included. Pings between peers and to the bootstrap are exchanged every second. Peers request a new peer list to the bootstrap every 30 seconds and try to fill their input partners set. Partnerships are undone after 3 seconds without contact from a partner. Bootstrap removes a peer from the channel list after 10 seconds without contact. Regarding the exchange of chunks, scheduling is done by requesting the earliest- deadline-first (EDF) chunk to a random valid candidate peer. A valid candidate is a partner that has announced that it possesses the chunk requested through its buffer map messages. Each chunk carries around 1500 bytes, the Ethernet MTU. Peer buffer sizes are set to 1600 chunks. Consequently, buffer map messages, which represent buffers through bit maps, are restrained to around 200 bytes. Peers can have any number of simultaneous active requests. Requests time out after 0.5 seconds. Upon failure, requests can be retried 2 more times. We have configured video and bootstrap servers in our university’s network and have used between 486 and 489 available PlanetLab nodes as peers. The video server streams a ≈ 420 kbps variable bitrate video. This translates as about 40 chunks per second and a 40 seconds sized buffer. PlanetLab nodes may be running multiple simultaneous experiments, so each host has variable CPU and bandwidth constraints at each time. As a default, client-side upload bandwidth restrictions are not used. Restrictions are given by local conditions of the set of nodes used in each run. It is important to note that TVPP does not know or measure node bandwidth. All peers join the live transmission during an initial period of 60 seconds, with joining times chosen randomly following an uniform distribution. Each experiment lasts 8 minutes, and we discard data from the first and last 90 seconds (i.e., the warm up and cool down periods). All results are based on 10 runs for each experiment. The amount of data obtained with these repetitions is enough to produce statistically solid results, with low coefficients of variation for the reported averages. We quantify the quality of a P2P live streaming using mostly chunk playback deadline miss rate, or chunk loss. Eventually, chunk latency is also analyzed in order not to be deceived by good chunk loss results that are inconsistent with a "good" latency. We define the playback deadline miss rate as the fraction of chunks that are not received before their playback deadline. Missed chunks cause flickering or 32 Chapter 3. Chunk Loss Characterization interruption, specially when many chunks are lost in sequence. Chunk latency, also called diffusion latency, is the delay between the creation of a chunk – at the video server – and its reception by a peer. High latency causes undesirable conditions for viewers as peers will play outdated content (e.g., a neighbor cheering a goal that you will watch a few seconds from now). 3.2.2 Free riding Free riders are peers that do not contribute with upload capacity. The replacement of peers for free riders causes restrictions to the initially available bandwidth and to the upload distribution. We have observed how these restrictions affects chunk loss and latency by linearly increasing the fraction of free riders. We couldn’t find any studies where real P2P live streaming systems were analyzed for the presence of free riders and how they are commonly presented. The closest paper to point the fraction of free riders on a system discusses the Gnutella case, a file sharing system, where 63% of peers have never answered a file search query [Adar and Huberman, 2000]. Some results presented in the following sections have already been published [Oliveira et al., 2013a]. We define two types of free riders, namely conscious and oblivious. Conscious free riders inform their partners that they are unwilling or unable to upload data. This behavior may be coded in the software or chosen by users. In our system, conscious free riders request chunks as normal but always advertise empty buffer maps. They have their mode parameter set to free rider. As a consequence, no peer ever wastes time and bandwidth sending requests to conscious free riders. Oblivious free riders do not inform their partners that they are unwilling or unable to upload data. This behavior may happen if the software does not make provisions for free riders, if the user misconfigures the client, or due to malicious intent. In our system, oblivious free riders request chunks normally and advertise buffer maps with the chunks they have, but never upload data (i.e., never answer requests). They have their upload limit parameter set to zero. Oblivious free riders may receive requests and degrade system performance, as their partners will have to retransmit chunk requests after waiting for answers that never arrive. We have run experiments varying the behavior and fraction of free riders. For each type of free rider behavior, we vary the fraction of free riders from 0% to 95% in steps of 5%. To maximize the impact of free riding, we do not use any mechanism to choose or abandon partnerships (e.g., reputation systems). In other words, peers do not drop or punish uncooperative partners. 3.3. Analysis of Resourceful Scenario 33 3.2.3 PlanetLab Planet Lab1 is a worldwide consortium of research institutions that maintains a global environment for the development and testing of distributed applications. Each insti- tution has one or more nodes in the Internet that operate as virtual machines servers. PlanetLab access accounts are called slices. If an institution maintains at least one node in operation it is granted the right to create slices and every slice has the power to control a set of virtual machines on network nodes. A user with access to a slice is able to create a virtual machine on each node with an initial minimalist set of applications and restricted space available. These resources must be configured and managed individually. Any activity on the nodes should consider that their network and hardware characteristics are heterogeneous, that a node is not always available, and that it may fail. However, PlanetLab usage brings many data acquisition advantages. The most obvious one is the possibility of using over one thousand nodes, which helps in recover- ing more representative and reliable data. Another advantage is that PlanetLab nodes are dispersed both geographically and in diverse networks, thus avoiding that location aspects mask network behavior. Finally, PlanetLab nodes possess real IPs and do not suffer packet filtering. These characteristics prevent the need to address the Network Address Translation (NAT) problem [Bellovin, 2002], where nodes’s applications may be difficult to communicate with. 3.3 Analysis of Resourceful Scenario In this section we evaluate experiments executed under the scenario with no enforced bandwidth restriction. The scenario uses the default parameters setup present in Sec- tion 3.2.1 such as overlay size of ≈ 500 peers, duration of 8 minutes, and 10 repetitions. Bandwidth distribution is given by PlanetLab hosts momentary setup, and it can vary between experiments. Still, the environment is considered to be resourceful as results presented through out the section are similar to a scenario in which the average peer upload capacity is limited to ten times the stream bitrate. We explore chunk loss by examining three different scopes: system, peer and chunk. We start with a broad view through a system-wide analysis until we reach individual chunk failure analysis. We show that, although rare, losses do occur in resourceful scenarios, to many peers and steadily on time. Yet, there is certain under- utilized upload capacity while peers do not receive requests. We show neighborhood 1http://www.planet-lab.org/ 34 Chapter 3. Chunk Loss Characterization 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Cu m ul at ive F ra ct io n of P ee rs Chunk Miss Rate 0.75 0.8 0.85 0.9 0.95 1 0 0.02 0.04 0.06 0.08 0.1 Zoom Figure 3.1. Cumulative distribution of chunk loss reported for each peer in a set of runs. data availability for each attempt of requests that hit and miss and compare it. Finally, we study the loss spread of a specific chunk and the frequency of consecutive losses, which we refer to as chunk loss spatial and temporal locality. 3.3.1 System Behavior Analysis Even while considering a resourceful scenario, chunk loss can still be observed system- wide. The overall results from this scenario show that: • 60% of the peers have lost 0% of the chunks; • 35% of the peers have lost between 0 and 3% of the chunks; • 5% of the peers have lost more than 3% of the chunks; • The losses from these 5% last peers represent 86% of the total chunk loss. Visual disruption caused by chunk loss strongly depends on stream coding al- gorithm used. Given the highly structured organization of the video streams, the degradation of the received video quality becomes typically noticeable for values of loss higher than 1%, while loss probability of a few percent (3-4%) significantly impair the user quality of experience [Traverso et al., 2014]. Thus, we have chosen to cluster above analysis using the 3% reference (inferior limit to impair the stream). Figure 3.1 shows the distribution of chunk losses for each peer in all experiments. The smaller figure zooms at the top left part of the curve. We can observe that the knee of the curve occurs at around 92% of the peers with a loss rate of around 0.8%. 3.3. Analysis of Resourceful Scenario 35 0 5 10 15 20 0 5 10 15 20 R eq ue st s No t R es po nd ed Requests Received Figure 3.2. Scatter plot between requests received and requests not responded over time normalized by stream rate. This result can be considered to be very good, but beyond this knee there are many peers that have trouble watching the stream. If we evaluate the average chunk loss for all peers by time, we can observe that chunk losses have remained stable through time for all experiments. The average chunk loss for all peers for each interval have ranged mostly between 0.5% and 1%. 3.3.2 Peer Behavior Analysis Do more losses occur at peers that are serving more chunks? Were peers overloaded with chunk requests? Is it a bottleneck issue? In this section, we focus on individual peers and use chunk request metrics as an abstraction to investigate the correlation between chunk losses and peer load or contribution in the network. Figure 3.2 shows a scatter plot between the amount of requests received and the amount of requests not responded for each peer for each second. Since the system has a request retry feature, requests that have not been responded do not necessarily result in a chunk loss, but it is a good approximation of load-related issues. We have expected chunk request received and requests not responded to be related; a peer would fail to respond requests proportionally to the amount of requests received. As Figure 3.2 illustrate, failures could not be associated to peer contribution through a linear dependence. Numerically, a Pearson correlation coefficient of 0.11 indicates that there has been no linear relationship between both variables. From the data, we have extracted that peers have received no chunk requests in 19% of the samples, peers have responded to all the chunk requests in 75% of the samples, and peers have failed to respond to all 36 Chapter 3. Chunk Loss Characterization chunk requests only in 0.4% of the samples. Thus, we can observe two things: a peer who contributes more does not necessarily fails to respond more, and, although chunk losses occur, there are many moments where peers are not receiving requests, possibly wasting their upload capacity. Unfortunately, with the lack of instant bandwidth measurement we can not establish a relation between peer upload capacity and failures. 3.3.3 Chunk Behavior Analysis We have analyzed the characteristics of each individual chunk request from all peers. We establish that a chunk hit happens whenever a request has been responded in time within three attempts. Conversely, a chunk miss (loss) happens whenever a chunk has not been received in time for playback. Chunk losses have represented around only 0.5% of all chunk transfer log, however, this small amount still results in the loss distribution seen in Figure 3.1. We have studied the efficiency of retries, neighborhood data availability for chunk hits and misses, and temporal and spatial locality of misses. Table 3.1. Distribution of attempts that take for a received chunk to be received. Attemps 1 2 3 Frequency 98.7% 1.1% 0.2% Chunks that have been received. Table 3.1 and Figure 3.3 report results for chunk hits. Table 3.1 shows that 98.7% of the hits are from requests that were responded in the first attempt. 1.1% are from the second attempt, and the rest from the third. This suggests that allowing more attempts through the common chunk scheduler will probably result in little improvement over chunk loss. Figure 3.3 illustrates neighborhood data availability. It presents the number of available candidates whenever a chunk hit happens on the first, second or third attempt. There are few candidates whenever a request is responded within the first attempt. Usually the first attempt is done to the first peer to announce that it has the chunk, a greedy characteristic of TVPP’s earliest deadline first (EDF) chunk scheduler policy. Gradually, the chunk is spread to more peers increasing data availability. Thus the subsequent attempts have more candidates to choose from. Table 3.2 shows how many unique candidates have been used whenever a chunk hit happens on the first, second or third attempt, or when a chunk miss happens. The second and third attempts have been generally made to distinct candidates for chunks that have been received. This result is expected as TVPP selects a random candidate 3.3. Analysis of Resourceful Scenario 37 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Fr eq ue nc y Number of Candidates 1st Attempt 2nd Attempt 3rd Attempt Figure 3.3. Distribution of the number of candidates that could deliver a chunk in each attempt for chunks that have been received. Table 3.2. Distribution of unique candidates used per request. Unique candidates 1 2 3 1st Attempt Hit 100.0% 2nd Attempt Hit 3.4% 96.6% 3rd Attempt Hit 0.5% 56.9% 42.7% Miss 25.3% 42.0% 32.7% to provide the chunk and there are usually more then one candidate to chose from. Results for chunks that have been missed are discussed below. Chunks that have been missed. For all chunk misses, 15% have happened because there were no candidates for the chunk. For the remainder 85% which have had can- didates, we present Figure 3.4. It is important to draw a parallel between this and Figure 3.3. We have observed that missed chunks have had fewer candidates in the subsequent attempts if compared to chunks that hit. This indicates a slower spread progression for these chunks, suggesting eventual issues in the neighborhood such as data unavailability and peer bandwidth saturation. The smaller number of candidates raises another question. According to Table 3.2, 25% of the requests that miss are being attempted to the same candidate. Why so many? The candidate might be experiencing a transient problem to respond. Can- didate distribution whenever the chunk is requested to the same candidate every time is particularly different from previous distributions. Figure 3.5 shows that candidate progression for this scenario is even worse than the general candidate distribution for missed chunks in Figure 3.4. In this case, even after few attempts peers requesting a chunk still have very few candidates to request from. So, the lack of balance in spread 38 Chapter 3. Chunk Loss Characterization 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Fr eq ue nc y Number of Candidates 1st Attempt 2nd Attempt 3rd Attempt Figure 3.4. Distribution of the number of candidates that could deliver a chunk for each attempt for chunks that have been missed. 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Fr eq ue nc y Number of Candidates 1st Attempt 2nd Attempt 3rd Attempt Figure 3.5. Distribution of the number of candidates that could deliver a chunk for each attempt for chunks that have been missed and only requested to the same candidate at all attempts. may result in this miss scenario. Temporal locality. We have also investigated bursts of chunk misses, i.e. consecu- tive chunks miss occurrences. Losing many chunks in a row is undesirable because it accentuates video disruption. Figure 3.6 gathers details about frequency and size of these bursts. The inside figure zooms at bursts with less than 10 misses. The outside figure shows an extended view limited at 100 misses. At the figure, each frequency is weighted by the amount of chunks in the burst. That helps to identify relevant events such as a peer with a huge miss sequence. If we look at the data for frequency alone, instead of frequency multiplied by size, only 1% of the bursts are beyond 10 misses, and just 0.2% beyond 100. These results show that the majority of chunk misses lacks temporal locality, i.e. if we look at a peer, there is a tendency for its chunk misses to oc- 3.3. Analysis of Resourceful Scenario 39 0 0.1 0.2 0.3 0.4 0.5 0 20 40 60 80 100 Fr eq ue nc y * S ize Burst Size 0 0.1 0.2 0.3 0.4 0.5 0 1 2 3 4 5 6 7 8 9 10 Figure 3.6. Frequency distribution of bursts of chunks missed weighted by their size. cur individually instead of grouped. Still, few peers may eventually have a continuous issue that impairs their quality of experience (not shown). Spatial locality. Finally, the last result conveys the spread of chunk misses, i.e. we investigate if a given chunk is lost by several peers in the network or losses have less spatial locality. If a given chunk is lost by several peers that means that the distribution tree for that specific chunk have had issues, probably in the early hops of the tree. However, in Figure 3.7 we present the cumulative distribution of chunks by the percentage of peers that miss a chunk. We have observed that no chunk has been lost by more than 3% of the total amount of peers. From the levels in the figure we can extract that 10% of the chunks were received by all peers, 22% were received by all but one, 25% by all but two, 21% by all but three, 10% by all but four and 7.5% by all but five, leaving 5% of the chunks with losses that reach more than five peers. The majority of chunk misses lacks spatial locality, i.e. if we look at a chunk miss, there is a tendency for that to happen in a small amount of peers. Summary. The main results of this fine grained analysis for a resourceful scenario have been that 1) retrying had low impact in recovery, 2) chunks that have been missed were less available, peers have had fewer candidates to request from, 3) chunk misses generally do not happen consecutively, in long bursts, and 4) any given chunk has been missed only by a few peers throughout the overlay, which help in a better understanding of losses. Chunk misses have been rare and fairly independent of each other, suggesting that occasional availability issues are their causes, not their consequences. 40 Chapter 3. Chunk Loss Characterization 0 0.2 0.4 0.6 0.8 1 0 0.005 0.01 0.015 0.02 0.025 0.03 Cu m m ul at ive F ra ct io n o f M iss ed C hu nk s Percentage of Peers Figure 3.7. Cumulative distribution of chunks that are not received by part of the network. 3.4 Analysis of Bandwidth Constraint Scenario In this section we evaluate the impact of increasing bandwidth constraints – introducing free riders – over chunk loss and latency. As expected, results get worse once we start to reduce available bandwidth. We show that conscious free riders have minor impact on both evaluation metrics, but that oblivious free riders seriously degrade them. We then introduce Simple Unanswered Request Eliminator (SURE) to mitigate the impact of oblivious free riders. Finally, we show that the increased workload caused by free riders is evenly balanced among cooperative peers and is manageable. 3.4.1 Conscious Free Riders Figure 3.8(a) shows the distribution of chunk playback deadline miss rate (chunk loss) for each peer in a set of runs. We plot various curves varying the fraction of conscious free riders. From what has been observed previously (Figure 3.1), we note that some chunks may miss their playback deadline for factors other than uncooperativeness. For instance, our results include PlanetLab nodes that may be overloaded, lacking enough network bandwidth or CPU to download and process chunks. The chunk miss rate is qualitatively similar for fractions of free riders below 50%. Again, with less than 50% of free riders, around 65% of peers receive all chunks before their playback deadline and 92% of peers experience chunk miss rates lower than 3%. Chunk losses increase significantly with 70% of free riders thus being unnecessary to show higher free riders ratios. Peers have fewer partners that can provide chunks and they may fail to receive a chunk before its playback deadline. If a content provider intends to 3.4. Analysis of Bandwidth Constraint Scenario 41 0 0.2 0.4 0.6 0.8 1 0 0.02 0.04 0.06 0.08 0.1 Cu m ul at ive F ra ct io n of P ee rs Chunk Miss Rate 0% 10% 30% 50% 70% (a) Conscious Free Riders - Chunk Loss 0 0.2 0.4 0.6 0.8 1 0 1 2 3 4 5 6 7 8 Cu m ul at ive F ra ct io n of P ee rs Average Chunk Latency (sec) 0% 10% 30% 50% 70% (b) Conscious Free Riders - Latency 0 0.2 0.4 0.6 0.8 1 0 0.02 0.04 0.06 0.08 0.1 Cu m ul at ive F ra ct io n of P ee rs Chunk Miss Rate 0% 10% 30% 50% 70% (c) Oblivious Free Riders - Chunk Loss 0 0.2 0.4 0.6 0.8 1 0 1 2 3 4 5 6 7 8 Cu m ul at ive F ra ct io n of P ee rs Average Chunk Latency (sec) 0% 10% 30% 50% 70% (d) Oblivious Free Riders - Latency Figure 3.8. Chunk playback deadline miss rate (chunk loss) and latency, for conscious (3.8(a) and 3.8(b)) and oblivious (3.8(c) and 3.8(d)) free riders. Curves are scenarios with indicated free rider ratio. sustain a system with 70% (or more) of free riders, it may need a hybrid architecture with well-provisioned support peers to cover the missing resources. Figure 3.8(b) shows the distribution of average chunk latency for each peer in a set of runs. Again, we plot multiple curves varying the fraction of conscious free riders. The average latency is representative of a peer’s chunk latencies: the standard deviation of chunk latencies is less than 1 second for 93% of peers. Figure 3.8(b) shows that latency stays stable if the number of conscious free riders is less than 10%. As the fraction of free riders increases, the fraction of partners of a peer willing to contribute decreases. This shortage of contributing partners causes chunk forwarding paths to become longer, meaning a chunk needs to traverse more hops in average to reach all peers. The median chunk forwarding path length is 3.5 for 10% of free riders, but increases to 4.1 for 50% of free riders. Longer forwarding paths increase latency since there are delays associated with buffer map advertisements, chunk requests, and the data transfer itself. 42 Chapter 3. Chunk Loss Characterization 3.4.2 Oblivious Free Riders When a peer requests a chunk from an oblivious free rider, its request will time out and it will have to be retried later. In particular, the chance that retries are (repeatedly) requested to an oblivious free rider is proportional to the fraction of free riders in the overlay. We compare the number of request retries in scenarios with conscious and oblivious free riders. Retries are rarely needed when free riders are conscious (e.g., when a peer removes a chunk from its local buffer while a request is in transit). Thus, peers usually make a single request per chunk, with a median of 1.007 tries per hop in the chunk forwarding path. When free riders are oblivious, the number of retries to obtain a chunk increases significantly. For instance, with 50% of oblivious free riders, the median number of tries per hop on the chunk forwarding path is 1.4. Consecutive retries waste bandwidth and time, thus increasing average chunk latency. Moreover, consecutive retries may increase chunk losses if no retry succeeds before chunk playback deadline. Figure 3.8(d) shows the distribution of average chunk latency for each peer in a set of runs. Even 10% of oblivious free riders increase the median of the distribution of average chunk latency to 3.28 seconds, a 15% increase compared to the scenario without free riders. This is significantly worse than having 10% of conscious free riders (Figure 3.8(b)), which has almost no impact on latency. System performance gets even worse when the fraction of oblivious free riders increases. The median average chunk latency is 6.49 seconds if the overlay has 50% of free riders, a 127% increase compared to the scenario without free riders. As for chunk losses (Figure 3.8(c)), there are three important things to notice. First, the number of peers that receive 100% of the chunks gets extremely reduced. In this scenario, requesting chunks to a peer that will not respond is a serious problem because it wastes our limited tries. Once the maximum number of tries is reached, peers give up on the requested chunk which results in chunk loss. Second, the visual degradation on the chunk loss figure is not as bad as in latency, because peers eventually find chunks after a few tries but latency delays accumulate between hops. Yet, even small degradation in chunk loss plays a large role on QoS impact. Finally, if we had established a maximum acceptable latency threshold such as 8 seconds, many received chunks with high latencies would be dropped translating into even more chunk losses than what is presented. Chunk request retries are the major difference between scenarios with conscious and oblivious free riders. If we could avoid chunk requests to peers that are unable to respond – e.g. oblivious free riders – in the first place, we would limit their negative impact on system performance. 3.4. Analysis of Bandwidth Constraint Scenario 43 3.4.3 Simple Unanswered Request Eliminator We introduce and evaluate SURE, a modification to our system’s request scheduler that avoids excessive retries caused by oblivious free riders. Our goal is to show that even simple solutions can significantly reduce the impact of uncooperativeness on P2P live streaming systems. SURE maintains a counter of pending requests for each partner. Whenever a peer sends a chunk request to a partner, it increments that partner’s pending requests counter. Whenever a peer receives a chunk from a partner, it decrements that part- ner’s counter. The idea is that counters for oblivious free riders will increase rapidly, while counters for contributing peers will remain low. When sending a chunk request, peers choose the partner with the smallest counter among partners with that chunk. If multiple partners have the same amount of pending requests, SURE picks one at random. 0 0.2 0.4 0.6 0.8 1 0 0.02 0.04 0.06 0.08 0.1 Cu m ul at ive F ra ct io n of P ee rs Chunk Miss Rate Conscious SURE Oblivious (a) Chunk Loss Comparison 0 0.2 0.4 0.6 0.8 1 0 1 2 3 4 5 6 7 8 Cu m ul at ive F ra ct io n of P ee rs Average Chunk Latency (sec) Conscious SURE Oblivious (b) Latency Comparison Figure 3.9. Chunk loss and latency comparison over the results of conscious, oblivious and SURE scenarios using 50% free rider ratio. Figure 3.9(b) shows the distribution of average chunk latency for each peer for 50% of conscious and oblivious free riders (curves “conscious” and “oblivious” in Figure 3.9(b) are the same as “50%” in Figure 3.8(b) and 3.8(d)). We also plot the chunk latency for a scenario with 50% of oblivious free riders when using SURE. With our modification we are able to reduce the number of retries. The chunk latency is almost equivalent to the baseline scenario of conscious free riders; horizontal difference between these curves is less then 0.3 seconds. Results for different fractions of free riders are qualitatively similar (Appendix A). Similarly, Figure 3.9(a) shows the distribution of chunk playback deadline miss rate for 50% of conscious and oblivious free riders; as well as the miss rate for oblivious free riders when using SURE. Again, this technique reduces chunk losses to levels 44 Chapter 3. Chunk Loss Characterization equivalent to those of the baseline scenario with conscious free riders. In particular, SURE reduces by half the fraction of peers with chunk losses higher than 3%. SURE works because it identifies uncooperative peers with few interactions. When a peer joins the channel, all its partners have zeroed pending requests coun- ters and it may send requests to uncooperative partners. However, uncooperative partners’s pending requests counters increase quickly and they are avoided until the end of the partnership. Another advantage of SURE is that it balances the load among peers. Consider that two partners have many desirable chunks and the same value on their pending requests counters. When SURE issues a request to one of the partners and increments its pending requests counter, it will prefer the other partner for the next request as it will have a smaller counter. Finally, "SURE" depends only on local and individual information, eliminating the need for measurements between the peer and its partners. SURE shows that even simple solutions allow P2P streaming systems to mitigate most of the impact of uncooperative peers on system performance. We leave the evalu- ation of recovery mechanisms (e.g., slowly decrementing counters over time, optimistic unchoke) as future work. Recovery mechanisms may improve SURE’s performance for long stream sessions or in scenarios where peers change behavior overtime. Finally, we note that other alternatives to eliminate unanswered requests are possible, such as using round trip time instead of pending request as a decision metric. 3.4.4 Workload Distribution Induced by Uncooperative Peers In previous sections we have discussed how free riding increases chunk loss but is manageable up to a certain point. Yet, a natural side question is what happens to workload distribution when peers do not contribute to the system’s aggregate upload capacity; there must be nodes bearing the workload. We categorize peers using their cooperation level C, i.e., the ratio of peer average upload rate to the video stream bitrate throughout the experiment. Peers are classified as free riders if C = 0, uncooperative if 0 < C ≤ 1, cooperative if 1 < C ≤ 5, and altruistic if C > 5. To understand workload distribution, Figure 3.10 presents the fraction of peers in each category for an increasing fraction of free riders. For instance, checking the y axis while looking at 30% induced free riders (in the x axis) we get 41% uncooperative, 25% cooperative and 4% altruistic peers. We extract data from experiments using SURE over oblivious free riders, but results with conscious free riders are quantitatively similar. Even when there are no free riders in the system, most peers are classified as uncooperative. A large fraction of network load is carried out by a small number of 3.4. Analysis of Bandwidth Constraint Scenario 45 0 20 40 60 80 100 0 10 20 30 40 50 60 70 80 % N od es Free rider ratio Altruistic Cooperative Uncooperative Free rider Figure 3.10. Load distribution for several free rider ratios. (free riders, C = 0; uncooperative, 0 < C ≤ 1; cooperative, 1 < C ≤ 5; altruistic, C > 5) cooperative and altruistic peers. This unbalanced load distribution happens because 1) TVPP does not use any strong load balancing scheme and because 2) peers closer to the server receive chunks earlier and have more time to redistribute chunks than peers far from the server. Uncooperative peers occur due to intrinsic protocol mechanisms, as also observed in SopCast [Oliveira, 2010]. As the fraction of free riders increases, the workload of cooperative peers increases and they shift to the altruistic category. However, the average workload per peer in the cooperative and altruistic category remains stable, close to 2 and 8 stream rates, respectively, if there are less than 70% of free riders. The maximum neighborhood size has been set to 20 peers in our system configuration, which creates an upper bound on each peer’s contribution. Figure 3.10 shows that 70% of peers are uncooperative even when there are no free riders. Peers remain fairly stable in their categories, suffering significant changes beyond 50% free rider ratio. Previously cooperative peers are turned into free riders in scenarios with more than 70% of free riders, resulting in significant workload increase on the remaining cooperative and altruistic peers. After this turning point, system behavior converges quickly to a client-server model, where most peers contribute noth- ing and a few well-provisioned peers sustain all the workload. Finally, we note that although 70% of free riders is the upper limit before system collapse, performance degradation starts after 50% of free riders, as discussed above through Figure 3.8. With less then 50% of peers as free riders, the lack of contribution incurred by uncooperativeness is manageable; workload is evenly balanced among contributing peers, and chunk loss and latency impacts are tolerable. But even when there are no free riders, several peers behave uncooperatively although they probably have spare 46 Chapter 3. Chunk Loss Characterization bandwidth while there are chunk losses happening around the overlay. If we had a mechanism beyond the overlay maintenance (such as Emergency Request Service, Section 2.4) to link uncooperative peers with spare bandwidth to peers which are experiencing problems to request specific chunks, we might reduce chunk losses and exploit this spare resource. Finally, we argue that denying service to uncooperative peers may not be the best long-term approach; while content providers would probably benefit from having a higher number of viewers, our findings suggest that P2P live streaming can support uncooperative peers and possibly even explore their resources. Conclusions Chunk loss is an important issue present in several P2P live streaming scenarios which can reduce the quality of experience for users. The chapter describes a set of experiments run on TVPP mostly aimed at the chunk loss characterization. We describe parameters and scenarios generally used through out experiments. Then we evaluate both a resourceful and a bandwidth constrained scenario. We have found that for a resourceful scenario most peers do not have chunk loss problems, while a few peers present significant losses. System-wide average chunk loss has been stable in time. While losses do occur, there are many moment when peers had no requests made to them, which suggests that the overlay organization might be improved. Mostly, requests are responded in the first attempt, by the first peer to announce that it has the chunk. Chunks that have been missed have had a worse chunk spread – having less candidates to request from – than chunks that have been received. Finally, chunk loss lacks spatial and temporal locality. For this scenario, results lead us to believe that losses are somewhat erratic, and thus an emergency request mechanism (described in Section 2.4) would be effective to virtually overcome all these losses. Losses get worse, yet are still sustainable, once we start to reduce bandwidth through the introduction of free riders. Observing conscious free riders we noted that chunk loss has been qualitatively similar for fractions of free rides below 50% while latency has increased by 1s. However, the same fraction of oblivious free riders cause a degradation significantly worse, mostly triggered by the increase in retries. Consecu- tive retries waste bandwidth and increase average chunk latency. We have introduced SURE, a modification to our system’s request scheduler that avoids excessive retries caused by oblivious free riders because it identifies uncooperative peers with few in- teractions. Using SURE we have obtained similar results for the same amount of conscious and oblivious free riders. While studying chunk loss, we have shown that P2P live streaming systems can sustain 50% of free riders with negligible degradation. We also have found that the workload incurred by free riders is evenly balanced among contributing peers and is manageable. We argue that denying service to uncoopera- 3.4. Analysis of Bandwidth Constraint Scenario 47 tive peers may not be the best long-term approach; our findings suggest that P2P live streaming can support uncooperative peers. Chapter 4 AERO: Adaptive Emergency Request Optimization In this chapter we analyse the chunk delivery mechanism present on CDN peer-assisted systems. Live streaming platforms employ advanced mechanisms to guarantee contin- uous and scalable video playback to large user bases. One such mechanism is CDN- P2P streaming, where servers are hosted on content distribution networks and clients help disseminate content over a peer-to-peer overlay. Hybrid CDN-P2P platforms are slightly different from pure P2P and, therefore, results in this chapter were obtained with SmoothCache [Roverso et al., 2012, 2015] instead of TVPP. In CDN-P2P stream- ing, all data chunks can be obtained from CDN. Those who are interested in the content form a support P2P overlay, which should be queried first for data. If a chunk cannot be found or a request fail to be responded by its partners, all peers know CDN servers addresses and will request chunks directly to them. We mapped this behavior as an equivalent of TVPP’s proposed emergency request service (Section 2.4). Emergency requests allow the retrieval of nearly missed pieces and avoid chunk losses, guarantee- ing continuous playback. We show that emergency requests deliver a chunk close to its deadline that has little to no time to be disseminated over the peer-to-peer overlay. We present AERO, a mechanism that dynamically adjusts the rate at which CDN-hosted servers seed content pieces into the peer-to-peer overlay as a function of network con- ditions. Our evaluation of AERO under diverse conditions shows it reduces emergency requests, guarantees efficient peer-to-peer dissemination, and provides significant server upload bandwidth savings. 49 50 Chapter 4. AERO: Adaptive Emergency Request Optimization 4.1 Introduction Peer-to-peer (P2P) content distribution improves the scalability of live streaming plat- forms and reduces costs. One disadvantage, however, is that P2P distribution is best- effort and cannot guarantee quality of service (QoS). This is specially impactful in real- time live streaming [Hei et al., 2007b; Lu et al., 2009]. Modern streaming platforms complement P2P distribution with reliable high-capacity servers hosted on content de- livery networks (CDNs) [Roverso et al., 2015, 2012; Lu et al., 2012; Zhao et al., 2013; Yin et al., 2009; Mansy and Ammar, 2011]. To guarantee QoS, platforms allow peers to issue ‘emergency’ requests to servers whenever the P2P overlay fails to distribute a video chunk before its playback deadline [Roverso et al., 2012, 2015; Payberah et al., 2012b; Kreitz and Niemela, 2010; Zhang, 2005; Jin et al., 2013]. CDNs are well suited for such a solution as they are built to handle variable load and can guarantee QoS up to the contracted capacity. Streaming cost in hybrid CDN-P2P streaming platforms is proportional to band- width utilization at the CDN-hosted servers. Server bandwidth utilization is composed of two elements: 1) the number of peers of the P2P overlay which are being seeded by servers, and 2) the rate of emergency requests. Increasing the number of seeded peers decreases both the dependence on P2P distribution and the number of emergency requests. The challenge is to balance the number of seeded peers and emergency requests to minimize cost. Current platforms compute the number of seeded peers as a function of peer upload bandwidth [Roverso et al., 2012, 2015; Simoni et al., 2014; Yin et al., 2009] or peer arrival and departure rates [Tewari and Menon, 2009; Yin et al., 2009; Mansy and Ammar, 2011]. Unfortunately, we show that such allocations may become suboptimal depending on network conditions (Section 4.4). We also show that emer- gency requests may undermine P2P distribution efficiency. In these cases, peers receive responses for emergency requests close to playback deadlines and do not have time to widely redistribute these chunks through the P2P overlay (Section 4.4). In this chapter we present Adaptive Emergency Request Optimization, AERO, a mechanism to minimize streaming cost (Section 4.6). AERO dynamically computes the number of peers of the P2P overlay which are being seeded by servers to reduce total bandwidth consumption on servers. When the P2P overlay is distributing video chunks efficiently and peers have unused upload bandwidth, AERO decreases the number of seeded peers. The rate of emergency requests does not increase and total bandwidth utilization at servers decreases as peers start using their spare bandwidth to take over distribution driven by the peer that is no longer seeded by CDN servers. When the 4.2. CDN-P2P Live Streaming 51 P2P overlay is not distributing video chunks efficiently but peers have unused upload bandwidth, AERO increases the number of seeded peers. The bandwidth consumption from the additional seeded peers is compensated by a greater reduction in the rate of emergency requests, as seeded peers using spare upload bandwidth possibly start new chunk redistribution chains in the P2P overlay. Our evaluation shows that the rate of emergency requests is a good estimator for P2P distribution efficiency (Section 4.7). AERO compares observed and expected rate of requests done to servers to adapt the number of seeded peers to variable overlay con- ditions. We evaluate AERO across different scenarios varying peer upload bandwidth, fraction of free-riding peers, peer churn (up to flash crowds), overlay sizes, and overlay maintenance policies. Our results show that AERO increases server upload bandwidth savings by up to 30% compared to static configuration of the number of seeded peers. AERO is flexible and can be used in combination with previous mechanisms for CDN-P2P live streaming. AERO is complementary to P2P overlay organization mech- anisms [Castro et al., 2003; Zhang et al., 2005; Li et al., 2008; Tang et al., 2009; Traverso et al., 2014]; in particular, AERO does not impose any structure on the P2P overlay and can be applied to both tree-like and mesh overlays. AERO is also complementary to algorithms of peer sampling [Roverso et al., 2013] and chunk scheduling [Zhao et al., 2009]. As increasing broadband penetration drive up media consumption, content providers need efficient distribution mechanisms. We believe AERO is useful to large- scale content providers with global overlays that have variable and unpredictable per- formance, as well as to small-scale content providers who have less resources and are proportionately more heavily impacted by operational costs. 4.2 CDN-P2P Live Streaming CDN peer-assisted (CDN-P2P) live streaming systems [Roverso et al., 2015, 2012; Lu et al., 2012; Zhao et al., 2013; Yin et al., 2009; Mansy and Ammar, 2011; Payberah et al., 2012b] share many mechanics with pure P2P systems (such as TVPP, Chapter 2). CDN clients form a peer-to-peer overlay between themselves and, whenever content cannot be found over P2P, they resort to CDN servers. In this section we present a broad view of a CDN-P2P live streaming system considering bootstraping, chunk delivery from CDN to peers, chunk scheduling, overlay maintenance, and others. Some of these features are similar to those presented in Chapter 2 while others are particular to the studied platform. 52 Chapter 4. AERO: Adaptive Emergency Request Optimization Figure 4.1. CDN-P2P chunk flow overview Live streaming systems have a back-end that captures, encodes, and splits a video stream into chunks. Chunks have fixed size but may contain multiple video frames under variable bit rate encoding. In hybrid CDN-P2P systems, the back-end distributes video chunks to a set of well-provisioned servers hosted on CDNs or on cloud providers. CDN clients become peers, organizing themselves in a peer-to-peer overlay (Figure 4.1). Servers seed, i.e., immediately redistribute, video chunks received from the back- end to a subset of these peers. Peers that are being seeded by servers are tagged as prefetchers. Peers collaborate with each other to receive and redistribute video chunks in a P2P overlay. Peers have a local buffer to store video chunks downloaded before their playback, to synchronize playback at a constant delay relative to the video’s capture, and to allow P2P distribution. Servers also answer emergency requests that peers issue when the P2P overlay does not deliver a video chunk before its playback deadline. Unless underlying connec- tivity issues prevent timely data transmission (e.g., congestion), emergency requests ensure all chunks are delivered and guarantee continuous playback. In CDN-P2P streaming, the total cost to the content provider is dominated by the aggregate upload 4.2. CDN-P2P Live Streaming 53 bandwidth across all servers. New peers enter the CDN-P2P system via a bootstrap that maintains membership information for each streaming channel, i.e., which peers are receiving the stream and participating in its P2P overlay. Peers register with the bootstrap and query member- ship information to discover potential neighbors and join the P2P overlay. Moreover, peers periodically report neighborhood information and performance metrics to the bootstrap. The bootstrap aggregates reports to build snapshots of the P2P overlay. Bootstrap also stores metadata about the latest chunk available at the servers. Prefetchers receive the latest chunk available as soon as servers have it, however, boot- strap delays the metadata update, pushing prefetchers slightly ahead of others. Peers joining the overlay will be tricked into considering an older chunk as the latest one, thus giving a time advantage for prefetchers to spread chunks through P2P [Roverso et al., 2012]. For example, at time t, the back-end produces chunk ct, pushes to the CDN servers that immediately let prefetchers request it. Prefetchers announce ct and its distribution start, prior to peers need. Bootstrap ’latest chunk available’ metadata, however, is updated to announce chunk ct−1 and peers will believe that they need to look for ct−1. Since ct−1 spread started in t − 1 they might already have it in their buffer. Time interval between t− 1 and t is used to distribute ct−1. Each peer p joins the streaming channel in initialization mode, where p requests chunks directly to CDN servers. This initialization reduces the time to fill peer p’s buffer and ensures data availability. Initialization mode ends when peer p fills its buffer and establishes neighborhood partnerships with a subset of the peers it received from the bootstrap. Each peer p maintains two sets of neighbors. A peer p receives chunks from a set of input neighbors, denoted Ip, and sends chunks to a set of output neighbors, denoted Op. We call the number of input and output neighbors, i.e., |Ip| and |Op|, a peer’s in- and out-degree, respectively. Table 4.1 summarizes notation. If a peer p is not receiving all the chunks it needs from its input neighbors (i.e., it is issuing emergency requests) it will remove from Ip the neighbor with highest number of unanswered chunk requests. Previous research on overlay construction algorithms show that keeping peers with more upload bandwidth closer to the server improves distribution efficiency [Felber and Biersack, 2005; Payberah et al., 2012a; Traverso et al., 2014]. A peer p will try to establish input partnerships with known peers that have equivalent or more upload bandwidth than itself. A peer p limits its out-degree, |Op|, to the number of streams it can upload (i.e., the ratio of its upload bandwidth, Bp, to the video streaming rate, R) and denies partnership requests if Op is full. A peer p periodically refreshes its list of known peers using two peer sampling (2.2.4) methods: by contacting the bootstrap directly, and by exchanging membership information with 54 Chapter 4. AERO: Adaptive Emergency Request Optimization Table 4.1. Definitions and notation. var. definition P Set of all peers. Ip Set of input neighbors of peer p. |Ip| In-degree of peer p. Op Set of output neighbors of peer p. |Op| Out-degree of peer p. S Server’s aggregated seeding ratio. OS Set of prefetchers. |OS| Number of prefetchers. Bp Upload bandwidth of peer p. R Video streaming rate. Up Peer p’s upload rate (to neighbors in Op). other known peers through gossiping [Roverso et al., 2013]. Peers periodically send buffer maps to output neighbors in Op announcing chunks stored in their buffer. Peers generate chunk requests sequentially, i.e., earliest deadline first, and serve requests in order of arrival. When multiple neighbors can provide video chunks, a peer requests a number of chunks proportional to each neighbor’s upload bandwidth. The benefit in P2P overlay distribution obtained from seeding the stream to a peer is proportional to that peers’s upload bandwidth [Felber and Biersack, 2005; Payberah et al., 2012a; Traverso et al., 2014]. As a result, servers choose prefetchers as the first among those with higher upload bandwidth and compute the number of prefetchers as a function of the obtained benefit. Servers seed streams to a set of peers denoted OS. The number of prefetchers, |OS|, is defined as a function of the seeding ratio S, the ratio of prefetchers’s aggregate upload bandwidth to all peers’s aggregate upload bandwidth. More formally, if P is the set of all peers, F is a candidate set of prefetchers, Bp is the upload bandwidth of peer p, then the number of prefetchers is the size of the smallest set of peers whose summed bandwidth is greater than or equal to a fraction (S) of all peer’s summed bandwidth, or: |OS| = min { |F| ∣∣∣ F ⊆ P ∧∑ p∈F Bp ≥ S ∑ q∈P Bq } . (4.1) 4.3 Simulation Setup In this section we describe the simulation environment and experiment setup we use to evaluate distribution efficiency in Sections 4.4 and 4.7. We evaluate CDN-P2P streaming costs using a simulator which shares most of its code with SmoothCache, a 4.3. Simulation Setup 55 0 0.2 0.4 0.6 0.8 1 0 5 10 15 20 25 30C um ul at ive F ra ct io n of P ee rs Upload Bandwidth [stream rate, 331Kbps] US BR SE Figure 4.2. Peer upload bandwidth distributions. real implementation of the CDN-P2P system described in Sec. 4.2 [Roverso et al., 2012, 2015]. The simulator substitutes the system’s networking code for a network emulator that allows control of a number of characteristics such as link bandwidths, transmission delays, queueing, and NATs. We repeat each simulation five times and report averages. We simulate for 20 or 60 minutes depending on the scenario. Numerical results ignore data from the first three minutes of all simulations to avoid the overlay’s warm-up period. Underlay configuration We configure the network emulator without any underlay bandwidth constraints (but we do limit peer upload bandwidth below) and set the transmission error rate to zero. We choose the one-way latency between peers uniformly distributed between 10ms and 50ms. Although the underlay never drops video chunks, we note the P2P overlay may still fail to disseminate a video chunk due to lack of peer upload bandwidth and inefficient neighborhoods leading to long, high-latency dissemination paths. Peer upload bandwidth distributions We evaluate CDN-P2P streaming costs using the three different peer upload bandwidth distributions shown in Figure 4.2. Upload bandwidth distribution for North American (‘US’ line) and Brazilian (‘BR’ line) hosts were collected from TestMy.net, a website where users can test their Internet speed. TestMy.net offers reports of the last 10000 upload and download bandwidth measurements performed on its servers from each country. We filtered upload bandwidths greater than 10Mbps to avoid distortions 56 Chapter 4. AERO: Adaptive Emergency Request Optimization caused by potentially non-domestic connections. The last distribution (‘SE’ line) was collected from upload bandwidths observed in an operational test deployment with thousands of peers over a Swedish corporate network [Roverso et al., 2012]. Unless stated otherwise, we use the bandwidth distribution of North American hosts as a baseline and refer to it as base (most results for other bandwidth distributions can be found at Appendix B). Restrictions to bandwidth distributions From each distribution, we generate four resource-constrained scenarios. We consider two scenarios where we divide peer bandwidths by two and four, denoted div2 and div4, respectively. This is equivalent to multiplying the streaming rate by two and four, e.g., as needed for streaming HQ videos. We also consider scenarios where 50% and 75% of peers are free-riders, i.e, do not upload video chunks, denoted 50f and 75f, respectively. To better illustrate the difference between these, one can imagine a hypothetical distribution in which all peers have the same bandwidth x. In div2 distribution for the hypothetical distribution, all peers have the same bandwidth equals to x/2. While in 50f distribution, half of the peers have bandwidth equal to zero and the other half equal to x. These distributions make the system perform differently despite having the same average bandwidth. The same idea applies to div4 and 75f. Overlay sizes We evaluate channel populations varying from 100 to 2000 simultaneous peers. Peers join the overlay at a constant rate over the first 100 seconds of the simulation. Unless stated otherwise, we run simulations with 500 peers. Overlay configuration We assume servers have bandwidth to upload the number of streams required to achieve the configured seeding ratio S, to support peers in initialization mode, and answer all emergency requests. We set the seeding ratio S = 2.5%. This seeding ratio has been observed to yield positive results in previous work [Roverso et al., 2012] and is small enough to allow cost savings. We vary peer in-degrees |Ip| between 2 and 5, and minimum peer out-degrees |Op|min between 0 and 10. Unless stated otherwise, we set |Ip| = 3, |Op|min = 0 and |Op|max = min(bBp/Rc, 10), where Bp is peer p’s upload bandwidth and R is the streaming rate. The out-degree is bounded by the number of 4.4. P2P Distribution Efficiency 57 streams a peer can upload to avoid receiving more requests than can be served and by a fixed threshold to limit abuse of peer upload capacity. Peer behavior To further stress the P2P overlay with dynamic peer behaviors, we emulate both peer churn and flash crowd scenarios. For the peer churn scenario, we have considered a severe case where we remove and insert 3% of peers in the overlay at random every ten seconds; severe because this churn rate is at least three times higher than previously found in other studies [Simoni et al., 2014; Payberah et al., 2012b]. For the flash crowd scenario, we have considered an overlay initially with 100 peers alternating between adding and removing 1000 peers every ten minutes. 4.4 P2P Distribution Efficiency Our goal is to evaluate P2P distribution efficiency for different network configura- tions. We show that distribution efficiency can degrade significantly for bandwidth- constrained scenarios even though peers have spare upload bandwidth. Video chunks are either distributed by the server via seeding, peer initialization, and emergency requests, or disseminated by the P2P overlay. We note that peers that do not receive a video chunk from the P2P overlay issue an emergency request for that chunk. There are no “missed” chunks; servers cover all costs and distribute all chunks that the P2P overlay did not. We measure P2P distribution efficiency as the fraction of chunks that servers did not have to distribute due to the P2P overlay, which we refer to as savings. More precisely, savings is the ratio between the number of chunks distributed by the P2P overlay and the total number of distributed chunks. Figure 4.3 shows savings over time for four different peer upload bandwidth scenarios. Results for div2 (omitted) are quantitatively similar to results for 50f. Savings for base and 50f stabilize at 96% and 95% approximately three minutes after the start of the experiment. In these scenarios, server workload (seeding, peer initialization, and emergency requests) accounts for 4–5% of all distributed chunks. Savings increases sharply in the beginning as peers optimize the overlay self-organizing away from servers by decreasing upload bandwidth. Figure 4.3 shows that savings for the 75f and div4 scenarios stabilize substantially lower at 77% and 50%, respectively. This means that servers are responsible for distributing 23% and 50% of all chunks, workloads that are 5 and 10 times higher than in the base scenario. One may think 58 Chapter 4. AERO: Adaptive Emergency Request Optimization 0 0.2 0.4 0.6 0.8 1 0 200 400 600 800 1000 1200 Sa vin gs Time(s) BASE 50F 75F DIV4 Figure 4.3. Savings for different upload bandwidth distributions. 0 20 40 60 80 100 0 200 400 600 800 1000 1200 Pa rtn er sh ip E st ab lis hm en t At te m pt s (% of to tal ) Time(s) BASE 50F 75F DIV4 Figure 4.4. Partnerships establishment attempts over time for different upload bandwidth distributions. that savings is low because peer upload bandwidth is fully utilized and insufficient to distribute chunks, but we find this is not the case. We identify several reasons for inefficient P2P distribution. Peers cannot establish input partnerships. Recall from Section 4.2 that a peer p limits its out-degree, |Op|, to the number of streams it can upload, and that the system employs a bandwidth-aware overlay construction policy. Low upload bandwidths lead to low out-degrees. Low amount of out-degrees on the overlay lead to competition for output partnerships. Moreover, as each peer tries to establish input partnerships with peers that have equivalent or more upload bandwidth than itself, high-bandwidth peers have less input peer options. For example, a free-rider can establish input partnerships with any peer but a high-bandwidth peer can establish input partnerships with a few peers only. 4.4. P2P Distribution Efficiency 59 0 0.2 0.4 0.6 0.8 1 0 20 40 60 80 100C um ul at ive F ra ct io n of P ee rs Effective Upload Bandwidth Utilization (%) BASE 50F 75F DIV4 Figure 4.5. Distribution of peer upload bandwidth utilization. Figure 4.4 shows ratio between the number of partnership establishment attempts and the number of possible partnerships, i.e., min( ∑ p |Ip|, ∑ p |Op|), over time. The spike at the beginning is due to new peers establishing their initial partnerships and changing partnerships to optimize the P2P overlay. We observe that base and 50f build an efficient overlay and partnership establishment attempts stabilize close to zero. Under more severe bandwidth constraints such as in 75f and div4, peers keep attempting to establish new input partnerships, either because their previous attempts failed or because they are not obtaining sufficient data from the overlay. The issue could be eased by topology adjustments (e.g., different |Ip| and |Op| limits, or other overlay construction policies). However, these configurations are sce- nario dependant, and more constrained scenarios need, proportionally, more fine-tuned configurations. Peers underutilize upload bandwidth. Figure 4.5 shows the distribution of ef- fective upload bandwidth utilization over time for different upload bandwidth distri- butions. Effective upload bandwidth utilization is the rate at which a peer p uploaded chunks to its output neighbors, Up, divided by the maximum rate at which that peer could upload chunks, min(Bp, |Op|R). From the data, the fraction of peers that are free-riders or have negligible bandwidth and thus zero output neighbors (i.e., |Op| = 0) is 10%, 55%, 77%, and 39% in base, 50f, 75f, and div4, respectively. These values are visually correlated to the y-values where effective upload bandwidth equals zero. Peers have average utilizations of 17% and 32% in the base and 50f scenarios, respec- tively, which is enough to redistribute most chunks and achieve high savings (Figure 4.3). Peers in the 75f and div4 scenarios also have spare upload bandwidth capacity; savings, however, remains low. Figures 4.4 and 4.5 show that peers have spare upload bandwidth but cannot 60 Chapter 4. AERO: Adaptive Emergency Request Optimization Table 4.2. Origin of video chunks received. chunk origin base 50f 75f div4 Seeded by servers 1.4% 0.8% 0.3% 0.7% Emergency request 3.0% 4.9% 25.9% 49.5% P2P overlay 95.6% 94.3% 73.8% 49.8% Table 4.3. Average number of traversed hops for a video chunk that enters the P2P overlay clustered by origin. chunk origin base 50f 75f div4 Seeded by servers 3.65 4.07 4.59 1.84 Emergency request 1.01 1.30 1.47 1.43 establish input partnerships, relying on emergency requests. Manual inspection of logs from peers that issue many emergency requests shows that (i) these peers may only know peers with less upload bandwidth than themselves; or that (ii) they may know peers with more upload bandwidth but fail to establish input partnerships (i.e., the peers they know cannot establish additional output partnerships); or that (iii) they may have input neighbors that seldom transmit video chunks. When case (iii) happens, it is usually the case that upstream peers in the partnership chain are in cases (i) or (ii), also issuing many emergency requests. Emergency requests are not P2P-friendly. Chunks obtained through emergency requests have short lifespans in the P2P overlay and cannot be widely distributed. Con- versely, seeded requests are transmitted early, stay longer in the P2P overlay, and have more opportunities to be distributed. Table 4.3 shows, for different upload bandwidth distributions, the average number of traversed hops for chunks originated as seeded by CDN servers or as answers to emergency requests. We note that chunks obtained from seeded requests traverse more hops than chunks from emergency requests - a di- rect consequence of previous claims. Table 4.2 shows, for different upload bandwidth distributions, the average fraction of chunks seeded by servers, transmitted as answers to emergency requests, and redistributed by the P2P overlay. We observe that in bandwidth-constrained scenarios low savings result from significantly higher fractions of emergency requests than base. Peers that issue emergency requests may fail to redistribute chunks, leading to their output neighbors also issuing emergency requests, degrading overall P2P distribution efficiency. This gets more evident as resources get more constrained. Summary. These findings show that static configuration of the number of prefetchers 4.5. Current Efficiency Improvements 61 or seeding ratios may perform poorly depending on P2P overlay characteristics. System performance, however, may be improved with specialized or more elaborate overlay maintenance mechanisms [Simoni et al., 2014; Payberah et al., 2012a; Fortuna et al., 2010; Felber and Biersack, 2005; Wichtlhuber et al., 2014; Tang et al., 2009; Traverso et al., 2014]. We next discuss those possible improvements and then propose a practical and complementary approach that adapts to variable overlay conditions. 4.5 Current Efficiency Improvements A large body of work has been dedicated to building and evaluating algorithms and mechanisms for efficient P2P live streaming. Contributions cover techniques to opti- mize P2P overlay topologies [Simoni et al., 2014; Payberah et al., 2012a; Fortuna et al., 2010; Felber and Biersack, 2005], schedule chunk requests and transmissions [Fortuna et al., 2010; Carlsson and Eager, 2007; Zhao et al., 2009], adapt topologies according to overlay and network conditions [Wichtlhuber et al., 2014], increase peer coopera- tion and avoid free-riding [Guerraoui et al., 2010; Piatek et al., 2010; Gonçalves et al., 2014], mitigate collusion attacks [Piatek et al., 2010], handle peer churn and flash crowds [Kumar et al., 2007; Liu et al., 2012]. We note our CDN-P2P streaming system incorporates findings from previous work, e.g., it organizes peers away from servers by decreasing upload bandwidth [Felber and Biersack, 2005; Payberah et al., 2012a; Traverso et al., 2014]. These works improve a P2P overlay and may be applicable to CDN-P2P systems. However, they do not consider emergency requests, a central point in our proposed design (Section 4.6). Regarding hybrid CDN-P2P live streaming, researchers have studied how to use system parameters and performance metrics—such as user churn rate, stream rate, average user upload bandwidth, maximum tolerable chunk distribution delay—to build resource allocation models. These allocation models can be used, e.g., to provision upload bandwidth at streaming servers [Tewari and Menon, 2009], deciding where to connect new peers joining the stream (e.g., CDN servers or the P2P overlay) [Yin et al., 2009], or to switch between CDN and P2P operation [Mansy and Ammar, 2011]. Another similar resource allocation solution comes from CLive [Payberah et al., 2012b], a hybrid cloud-P2P solution which dynamically adds/removes cloud-hosted servers as seeding supernodes in the P2P overlay. CLive considers overlay size, estimated peer upload bandwidth distribution, estimated chunk diffusion trees, and maximum acceptable chunk distribution delay to adjust cloud servers demand. These solutions, again, do not consider emergency requests. Yet they share a similar concept with 62 Chapter 4. AERO: Adaptive Emergency Request Optimization Table 4.4. Definitions and notation used by AERO. var. definition R Video streaming rate. S Server’s aggregated seeding ratio. |OS|t Number of prefetchers at round t. Ct Server upload bandwidth consumption at round t. Et Seeding ratio error at round t (Ct −R|OS|t−1). δ AERO’s seeding ratio scaling factor. ∆ AERO’s seeding ratio scaling factor upper bound. our proposed design - as adapting the number of prefetchers is a form of resource allocation. Our design does not require detailed information about system parameters, performance metrics or network conditions. Compared to other solutions, ours requires less monitoring overhead and can be more easily deployed as it makes decisions based only on server bandwidth consumption, which is trivial to obtain. It is a simpler alternative that can efficiently allocate resources on demand in diverse scenarios. 4.6 Adaptive Emergency Request Optimization We propose AERO, the Adaptive Emergency Request Optimization mechanism. AERO’s goal is to minimize servers bandwidth consumption by increasing P2P dis- tribution efficiency. Our main idea is to configure the seeding ratio S dynamically – and, consequently, the number of prefetchers |OS|. AERO increases the seeding ratio when P2P distribution efficiency is low, seeding chunks to peers that can then use their upload bandwidth more effectively and possibly start new chunk redistribution chains in the P2P overlay. This improvement over P2P distribution efficiency and its costs are beneficial as they reduce emergency requests. AERO decreases the seeding ratio when P2P distribution efficiency stabilizes in an attempt to maximize savings. AERO tracks a practical seeding ratio as network conditions and P2P overlays change. Table 4.4 summarizes notation used by AERO. AERO operates in rounds. Each round, AERO decides whether to increase or decrease the seeding ratio S depending on historical seeding ratio allocations and band- width consumption observations, as shown in Alg. 1. AERO keeps a history of bandwidth consumption at CDN servers, Ct, to com- pute whether bandwidth consumption is stable or not. AERO considers that band- width consumption is increasing or decreasing if current bandwidth consumption is higher or lower than that observed in the last three rounds. If the current bandwidth consumption is similar to those observed in previous rounds or if a trend is unclear, 4.6. Adaptive Emergency Request Optimization 63 Algorithm 1: AERO’s algorithm to update the seeding ratio S at each round input: history of bandwidth consumption, seeding ratio input: seeding ratio scaling factor δ (upper bound ∆) input: seeding ratio error E at rounds t and t− 1 if bandwidth consumption is stable then S ← 0.95S; else if bandwidth consumption jumps then δ ← ∆; else if sign(Et) = sign(Et−1) then δ ← min(δ/0.75,∆); else δ ← δ × 0.75; end S ← S + sign(Et)δ end AERO considers that bandwidth consumption is stable. More precisely, the trend of bandwidth consumption is given by the following, where I(·) in an indicator function that returns zero if trend is stable and 1 otherwise. trend =  increasing if ∑3 i=1 I(0.95Ct > Ct−i) ≥ 2, decreasing if ∑3 i=1 I(1.05Ct < Ct−i) ≥ 2, stable otherwise. (4.2) AERO configures the seeding ratio S as a function of CDN servers upload band- width consumption due to seeding and emergency requests. We define the seeding ratio error at round t, Et, as the difference between the total servers upload bandwidth consumption, Ct, and the expected bandwidth consumption, R|OS|t−1 (streaming rate multiplied by the number of prefetchers in the last round). The seeding ratio error is positive when the seeding ratio is low and servers receive a large number of emergency requests (underprovisioning). Conversely, the seeding ratio error is negative when the seeding ratio is high and prefetchers are exchanging chunks between themselves (over- provision). AERO does not consider the upload bandwidth used to support peers in the initialization mode because this bandwidth demand is fixed per peer and because peer arrival times are outside the control of servers. AERO gradually modifies the seeding ratio S using a variable defined as scaling factor, δ. The scaling factor works as a fine tuner determining how much S should be modified in each round. Its value can decrease or increase as S gets, respectively, near or far from an optimum seeding ratio. We set a maximum seeding ratio modifier, ∆, to limit δ. Whenever bandwidth consumption is not stable, AERO updates the seeding ra- 64 Chapter 4. AERO: Adaptive Emergency Request Optimization tio S according to the seeding ratio error Et and Et−1 of the last two rounds. If sign(Et) = sign(Et−1), the last two rounds overestimated or underestimated the opti- mal seeding ratio. AERO concludes it is not close to the optimum seeding ratio and increases the scaling factor by a constant factor (shown as 0.75 in Alg. 1) to improve convergence speed. If sign(Et) 6= sign(Et−1), then one round underestimated and the other overestimated the optimal seeding ratio. AERO decreases the scaling factor by a constant factor (shown as 0.75 in Alg. 1) to converge closer to the optimal seeding ratio. The scaling factor is restored to the upper bound whenever a jump in bandwidth consumption is detected, i.e., whenever bandwidth consumption doubles or halves be- tween rounds. At each round, AERO increases the seeding ratio by the scaling factor when the system is underprovisioned (Et > 0) and decreases the seeding ratio when the system is overprovisioned (Et < 0). The process above iterates until the scaling factor decreases enough that servers bandwidth consumption stabilizes. To avoid getting stuck at a local minimum, AERO systematically decreases the seeding ratio by a constant factor once bandwidth con- sumption stabilizes (shown as 0.95 in Alg. 1). This systematic reduction of the seeding ratio will either reduce servers bandwidth consumption or cause fluctuations due to underprovisioning. When bandwidth fluctuates, AERO will reconverge to an optimum seeding ratio (possibly the same). Stability periods may end due to varying network and P2P overlay conditions, e.g., peer churn or flash crowds. AERO’s round duration must be long enough to allow AERO to observe the impact of seeding ratio adaptation decisions on bandwidth consumption. This property is necessary to compute the seeding ratio errors Et, and may negatively impact convergence times on systems with very large buffers. AERO makes no assumptions about overlay properties or peer behavior, and can be deployed in conjunction with existing overlay maintenance or chunk scheduling mechanisms. 4.7 Evaluation In this section we evaluate AERO for different deployment scenarios varying peer up- load bandwidth distributions, overlay construction policies, overlay sizes, and peer behavior. As in Section 4.4, we consider four out of five different base peer upload bandwidth distribution derivatives: base, 50f, 75f, and div4. We show that AERO increases P2P savings across all scenarios, trading bandwidth consumption spent on emergency requests for seeding peers, improving P2P distribution efficiency. Unless stated otherwise, we use the default parameters defined in Section 4.3 4.7. Evaluation 65 0 0.2 0.4 0.6 0.8 1 0 200 400 600 800 1000 1200 Time(s) BASE BASE+AERO 0 0.1 0.2 0.3 0 200 400 600 800 1000 1200 |Os|/|P| AERO |Os|/|P| Figure 4.6. Comparison of overall savings for BASE scenario with and without AERO. and configure AERO with 30-second rounds, initial seeding ratio S = 2.5%, maximum seeding ratio modifier ∆ = 5%. This initial seeding ratio has been observed to yield positive results in previous work [Roverso et al., 2012]. 4.7.1 Peer upload bandwidth Figures 4.6, 4.7, 4.8, and 4.9 compare server bandwidth savings for different peer upload bandwidth distributions with and without AERO on streaming channels with 500 peers (left y-axes). The lines labeled base, 50f, 75f, and div4 in each figure were taken from Figure 4.3. Figures 4.6 and 4.7 show that AERO adapts to high-resource scenarios and even increases savings by a little. Savings shows P2P distribution efficiency and it provides a good comparison for gains between hybrid CDN-P2P and CDN-only models. Since AERO brings an optimization over the current CDN-P2P model, we also measure server consumption relative difference between using and not using AERO. For example, on the base scenario, although absolute difference is small, AERO reduces average server upload bandwidth consumption from 5.1% to 2.9%, an 43% economy to servers. Figures 4.8 and 4.9 show that AERO achieves even higher savings in resource- constrained scenarios (and up to 60% reduction of average server upload bandwidth consumption). We note that there are no “missed” chunks; both solutions (with and 66 Chapter 4. AERO: Adaptive Emergency Request Optimization 0 0.2 0.4 0.6 0.8 1 0 200 400 600 800 1000 1200 Time(s) 50F 50F+AERO 0 0.1 0.2 0.3 0 200 400 600 800 1000 1200 |Os|/|P| AERO |Os|/|P| Figure 4.7. Comparison of overall savings for 50F scenario with and without AERO. without AERO) distribute the same number of chunks. AERO’s savings come entirely from higher P2P distribution efficiency. To explain this behavior, Figures 4.6, 4.7, 4.8, and 4.9 also show the fraction of prefetchers configured by AERO and the fraction of prefetchers statically configured when not using AERO (right y-axes, with range up to 30%). In high-resource scenarios (Figures 4.6 and 4.7), AERO reduces the fraction of prefetchers while maintaining P2P distribution efficiency. AERO achieves higher bandwidth savings when its fraction of prefetchers gets lower than in the static configuration. In resource-constrained scenarios (Figures 4.8 and 4.9), we observe AERO increases the number of prefetchers compared to the static configuration. This allows peers to distribute more chunks through the P2P overlay and helps further reduce the number of emergency requests. Finally, we note that AERO seeds to a lower fraction of peers in scenario 75f compared to div4. On average, high-resource peers in 75f have four times more bandwidth than high-resource peers in div4, and less seeding is necessary to achieve efficient P2P distribution. Table 4.5 shows the fraction of chunks that were distributed via seeding, emer- gency requests, and the P2P overlay, throughout scenarios using AERO. We have copied the data from Table 4.2 to aid comparison. It shows that AERO significantly improves P2P distribution efficiency. We note, however, that the amount of emergency requests on constrained scenarios is still high. AERO aims at reducing emergency 4.7. Evaluation 67 0 0.2 0.4 0.6 0.8 1 0 200 400 600 800 1000 1200 Time(s) 75F 75F+AERO 0 0.1 0.2 0.3 0 200 400 600 800 1000 1200 |Os|/|P| AERO |Os|/|P| Figure 4.8. Comparison of overall savings for 75F scenario with and without AERO. Table 4.5. Origin of received video chunks when using AERO (data copied from Table 4.2 below to aid comparison). chunk origin base aero 50f aero 75f aero div4 aero Seeded by servers 1.6% 1.6% 3.2% 6.6% Emergency request 0.9% 1.6% 5.9% 13.4% P2P overlay 97.5% 96.8% 90.9% 80.0% base 50f 75f div4 Seeded by servers 1.4% 0.8% 0.3% 0.7% Emergency request 3.0% 4.9% 25.9% 49.5% P2P overlay 95.6% 94.3% 73.8% 49.8% requests to virtually none, but currently does it by analysing observed and expected server bandwidth consumption. When the seeding ratio increases, the probability that prefetchers exchange chunks between themselves increases as well. Prefetchers trading chunks between themselves do less server requests than expected thus leaving space for AERO to stabilize with emergency requests consuming the bandwidth expected to be used by prefetchers. Figure 4.10 shows savings over time when peer upload bandwidths are sampled from the BR and SE distributions (Fig. 4.2). Results for the SE upload bandwidth distribution are qualitatively similar to the high-bandwidth scenarios shown in Figures 68 Chapter 4. AERO: Adaptive Emergency Request Optimization 0 0.2 0.4 0.6 0.8 1 0 200 400 600 800 1000 1200 Time(s) DIV4 DIV4+AERO 0 0.1 0.2 0.3 0 200 400 600 800 1000 1200 |Os|/|P| AERO |Os|/|P| Figure 4.9. Comparison of overall savings for DIV4 scenario with and without AERO. |OS |/|P| removed from key for clarity. 0 0.2 0.4 0.6 0.8 1 0 200 400 600 800 1000 1200 Sa vin gs Time(s) SE+AERO SE BR+AERO BR Figure 4.10. Savings for br and se peer upload bandwidth distributions. 4.6 and 4.7. Brazilian hosts have lower upload bandwidths (BR distribution), which results in a more resource-constrained scenario. In this scenario AERO has been able to achieve very high savings and reduce average server upload bandwidth consumption by 68%. These two bandwidth distributions has been investigated in the same way as North American one has and the results can be found on Appendix B. We point out that the brazilian distribution is already way constrained; all restrictions (50F, DIV2, 75F, DIV4) made peers average upload bandwidth drop below one. This means 4.7. Evaluation 69 0 0.2 0.4 0.6 0.8 1 BASE BASE AERO 50F 50F AERO 75F 75F AERO DIV4 DIV4 AERO Sa vin gs Figure 4.11. Savings of different overlay construction configurations grouped by peer upload bandwidth distribution. Table 4.6. Savings for different overlay construction limits Scenario I2O0 I3O0 I4O0 I5O0 I5O5 I5O10 Savings± σ base 0.978 0.949 0.968 0.940 0.948 0.942 0.954 ± 0.015 50f 0.980 0.946 0.964 0.908 0.958 0.952 0.951 ± 0.024 75f 0.821 0.773 0.741 0.575 0.752 0.829 0.748 ± 0.092 div4 0.493 0.504 0.515 0.491 0.522 0.447 0.496 ± 0.027 A base 0.973 0.971 0.968 0.965 0.967 0.969 0.969 ± 0.003 E 50f 0.980 0.965 0.970 0.965 0.945 0.951 0.963 ± 0.013 R 75f 0.931 0.911 0.927 0.929 0.875 0.867 0.907 ± 0.029 O div4 0.797 0.696 0.694 0.726 0.819 0.738 0.745 ± 0.052 that peers do not have enough capacity to fully distribute the stream to every overlay participant. In harsh scenarios such as BR-75F there is little that AERO can do to aid the system, yet it does not harm it either. 4.7.2 Overlay properties and overlay size P2P distribution efficiency is highly dependent on overlay maintenance mechanisms and overlay properties. We choose six configurations varying peer in-degrees |Ip| ∈ {2, 3, 4, 5} and minimum out-degrees |Op| ∈ {0, 5, 10} to evaluate AERO under differ- ent overlay construction constraints. Figure 4.11 shows savings for different configura- tions grouped by peer upload bandwidth distribution. Each bar shows the minimum, quartiles, and maximum upload bandwidth savings from 30 experiments (5 for each in- and out-degree configuration). Some peer upload bandwidth scenarios are very de- pendent on overlay configurations (e.g., 75f), illustrating that P2P efficiency depends 70 Chapter 4. AERO: Adaptive Emergency Request Optimization Table 4.7. Savings for different neighbor selection policies. scenario random bw-aware bw-relax 75f 64.0% 77.3% 85.9% div4 52.8% 50.4% 50.3% 75f aero 88.9% 91.1% 92.7% div4 aero 78.3% 79.6% 72.8% Table 4.8. Savings for different overlay sizes. overlay size (peers) scenario 100 500 1000 2000 base 96.4% 94.9% 97.2% 97.2% 50f 96.9% 94.6% 96.9% 97.2% 75f 91.3% 77.3% 78.0% 75.9% div4 50.3% 50.4% 50.6% 48.6% base aero 96.3% 97.1% 97.2% 97.4% 50f aero 96.3% 96.5% 98.1% 97.8% 75f aero 92.5% 91.1% 93.5% 92.9% div4 aero 64.1% 79.6% 74.6% 78.1% on overlay properties. AERO improves savings for all peer upload bandwidth distribu- tions. Table 4.6 shows the average savings for each different overlay construction limit. We have highlighted the default configuration. For any fixed in- and out-degree config- uration, average savings when using AERO is always higher or equivalent (i.e., lower within 1%) than average savings when not using AERO (static seeding ratio). We draw attention to the most constrained scenarios, the worse average savings with AERO is better than the best without it, e.g., worse 75f aero=0.867, best 75f=0.829 and worse div4 aero=0.694, best div4=0.522. These results indicate that AERO could be used as an alternative or complement to P2P overlay maintenance mechanisms (e.g., [Simoni et al., 2014; Payberah et al., 2012a; Fortuna et al., 2010; Felber and Biersack, 2005; Wichtlhuber et al., 2014; Tang et al., 2009; Traverso et al., 2014]). We have also evaluated savings under three different neighbor selection policies where each peer tries to (i) establish partnerships with random known peers (‘random’), (ii) establish partnerships with known peers that have equivalent or higher upload bandwidth (the default, ‘BW-aware’), and (iii) establish partnerships with known peers that have at least 80% of its own upload bandwidth (‘BW-relax’). Table 4.7 shows average upload bandwidth savings for bandwidth-constrained scenarios. We observe that AERO not only improves savings, but that adding AERO results in higher impact than changing the neighbor selection policy. Results for high-bandwidth scenarios show quantitatively similar savings across all neighbor selection policies (omitted). 4.7. Evaluation 71 0 0.2 0.4 0.6 0.8 1 0 600 1200 1800 2400 3000 Sa vin gs Time(s) BASE+AERO BASE DIV4+AERO DIV4 Figure 4.12. Savings under peer churn. Table 4.8 shows average savings for different overlay sizes and bandwidth- constrained scenarios. We find that varying the overlay size does not significantly impact savings (lines have similar values), except in a few cases of small overlays with highly variable performance (100 peers, 75f and div4). AERO provides equivalent or higher savings (up to 30%) and substantially reduces server upload bandwidth con- sumption (up to 70%) across all overlay sizes. 4.7.3 Peer churn and flash crowds We also evaluate AERO under adverse peer behavior. Figure 4.12 shows savings for the base and div4 scenarios when we turn on peer churn at second 300. Every 10 seconds 3% of peers are replaced (Sec 4.3). Each new peer spend approximately 60 seconds requesting chunks to CDN servers at initialization. Thus the configured churn rate and peer initialization time implies a reduction of 18% in savings. Moreover, churn replaces peers that have complete neighborhoods for new peers that have empty buffers and zero partnerships, disrupting the P2P overlay and further degrading sav- ings. Figure 4.12 shows that AERO’s dynamic seeding ratio configuration adapts to very dynamic overlays and provides equivalent or higher bandwidth savings compared to static seeding ratios (results for 75f are omitted but qualitatively similar to div4). Churn can be especially harmful if it removes a significant fraction of prefetchers from the overlay at once. During our experiments we have verified the occurrence of one such case in one simulation run for the base scenario at around 2000 seconds. In Figure 4.12, the disruption on this specific run can be seen even after averaging all five simulation runs. Figure 4.13 shows savings over time for the base and div4 scenarios when the 72 Chapter 4. AERO: Adaptive Emergency Request Optimization 0 0.2 0.4 0.6 0.8 1 0 600 1200 1800 2400 3000 Sa vin gs Time(s) J L J L J BASE+AERO BASE DIV4+AERO DIV4 Figure 4.13. Savings over time for periodic flash crowd events. 0 0.2 0.4 0.6 0.8 1 0 600 1200 1800 2400 3000 3600 Sa vin gs Time(s) DIV4 Figure 4.14. Savings for div4 scenario with AERO over longer experiments. streaming channel experiences periodic flash crowd events. We start with an overlay containing 100 peers, and alternate between adding and removing 1000 peers from the overlay periodically every 10 minutes. Peers join the overlay at seconds 300, 1500, and 2700, denoted ‘J’, in initialization mode and significantly degrade savings, as shown by the simultaneous valleys. AERO then quickly adjusts the seeding ratio and achieves high savings. Peers leave at seconds 900 and 2100, denoted ‘L’, severing the majority of partnerships and leaving a (possibly disconnected) P2P overlay with 100 peers. As in Tab. 4.8, we observe higher variation in savings for these small overlays. After both arrival and departure events, AERO achieves equivalent or higher savings compared to static configuration of seeding ratios. 4.8. AERO at TVPP’s Emergency Request Service 73 4.7.4 Local minima Figure 4.9 shows a drop in savings for div4 scenario, in the last few minutes (near 1200 seconds). AERO systematically decreases the seeding ratio to avoid getting stuck at local minima. Figure 4.14 presents the curve of one out of five runs of a longer div4 scenario; four other curves have been similar and have not been shown to increase clarity. The drop observed at Figure 4.9 becomes the expected upside down ripples on Figure 4.14. It results from a burst of emergency requests due to underprovisioning |Os|, leading to a re-convergence process. Re-convergence happens more often in sce- narios where global minimum is high, compared with scenarios such as base, because of the multiplicative decay. The ripple can be eliminated by removing this multiplicative decay, or reduced by adjusting it to a lower value. We have tested with and without a 5% decay. However, the loss in savings caused by the ripple is compensated by better convergence outside of local minima. 4.7.5 Summary AERO dynamically computes the number of prefetchers. We have evaluated AERO for different peer upload bandwidths, under different overlay properties and sizes, and for adverse peer behavior. We have shown that AERO reduces emergency requests and consistently achieves higher P2P distribution efficiency and higher savings. 4.8 AERO at TVPP’s Emergency Request Service We have discussed AERO implementation at SmoothCache, a hybrid CDN-P2P sys- tem. AERO benefits stand out in hybrid CDN-P2P by decreasing CDN bandwidth utilization while it could be increasing the number of seeded peers. Benefits are easy to comprehend since CDN servers respond for both seeded and emergency requests. However, emergency requesting for pure P2P systems might be implemented differently making AERO implementation unclear. Compared to the proposed Emergency Request Service for pure P2P systems, the CDN acts both as server and emergency request handler. There are a few ways to implement emergency requesting on pure P2P systems and, thus, a few AERO approaches. Remember that AERO is an effort to adjust chunk seeding costs to a minimum value that would provide all chunks to all peers by finding an optimal balance between seeding and answering emergency requests. Below we discuss one complete approach. 74 Chapter 4. AERO: Adaptive Emergency Request Optimization TVPP’s Emergency Request Service proposal separates the role of source and emergency request handlers. Handlers should be peers which have high and spare upload bandwidth to deal with emergency requests. In this approach we argue to employ a bandwidth aware overlay maintenance policy on TVPP and limited number of output partners. Handlers would then be elected from within the overlay – using spare bandwidth and upload capacity measurements – and placed near the source. Inclusion on the Emergency Request Service handlers set would bring the benefit of being closer to the stream source, increasing chunk delivery odds and reducing latency. This would mimic hybrid CDN-P2P, considering source’s direct partners as the CDN layer from Figure 4.1. The major differences are that these peers capacity is not dedicated to streaming, will not vary to adapt to the load, nor they can be trusted to stay in the system indefinitely. The Emergency Request Service can trivially account for emergency requests and adjust handlers’s out-degree. Since handlers are not dedicated, the Emergency Request Service has to identify if the handler set has enough spare resources to serve the amount of emergency requests issued and, otherwise, more handlers should be added. Such mechanism should like-wise reduce emergency requesting thereby mitigating handlers bandwidth utilization. Other ideas derive from this approach. For example, the system could use the source as the handler but that would demand the source to have a enough capac- ity to scale its out-degree up if needed. Further, handlers could be instantiated instead of elected, thus a content provider would have more control about delivery adding/removing dedicated handlers considering the Emergency Request Service de- mand. 4.9 Conclusion CDN-P2P live streaming systems allow clients to issue emergency requests and guaran- tees QoS. In this chapter, we have shown that emergency requests deliver chunks close to their playback deadlines, which leaves little to no time for redistribution in the P2P overlay. This leads to decreased P2P distribution efficiency, which causes even more emergency requests, which ultimately results in higher bandwidth costs for content providers. We have presented AERO, a mechanism to minimize bandwidth consumption in CDN-P2P live streaming systems. AERO first increases the number of peers into the P2P overlay which are being seeded by servers to guarantee that the overlay is 4.9. Conclusion 75 effectively using peer upload bandwidth to redistribute chunks, reducing emergency requests. AERO then attempts to reduce bandwidth consumption at servers reducing the amount of seeded peers without increasing the rate of emergency requests. Our evaluation of AERO under diverse conditions shows that it achieves up to 30% higher server upload bandwidth savings compared to static allocation of the number of seeded peers. AERO requires readily available information: the number of peers into the P2P overlay which are being seeded by servers and servers bandwidth consumption. AERO runs entirely on CDN servers and does not require modification of client software. AERO also does not impose any restriction on and is compatible with existing P2P overlay construction and chunk scheduling mechanisms. We believe AERO can be eas- ily integrated into existing systems that use emergency requests or similar mechanisms (such as into Section 2.4) to guarantee quality of service, helping improve scalability and reducing operational costs. As future work, we plan to investigate how to improve AERO’s convergence delay by computing constant scaling factors dynamically and using historical information, and to better classify requests done to servers so AERO would have the emergency request rate as input instead of estimating it through servers bandwidth consumption. Chapter 5 Final Remarks Live streaming systems are becoming more popular each day. These systems attract a growing number of users, and some important TV channels already broadcast their live content on the Internet. Peer-to-peer live streaming systems may provide a low cost solution to the problem of transmitting live content to a large number of viewers. This technology relies on splitting the stream into chunks and spreading them to the P2P overlay participants. Peers distribute each chunk among their partners, so that all peers receive all chunks. Whenever a chunk misses its playback deadline either the player will have to wait until it arrives, delaying the peer in relation to others, or it will skip it. For the viewer, both scenarios are undesirable and indicatives of the system’s bad quality. Yet today most commercial P2P systems are unreliable, as quality assurance mechanisms are still in a developing stage, or even completely ignored, and several scenarios can cause degradation of system performance experienced by peers. This work have studied how to ensure quality of service through minimization of chunk loss, understanding the reason behind losses, and proposing new ways to respond to each reason. We have developed a P2P live streaming system, characterized chunk losses using it, studied the effect of free riders, developed a non-punitive algorithm to reduce peer uncooperativeness impact, proposed an emergency request mechanism that aims to eliminate chunk losses, and developed an algorithm to reduce costs associated with emergency requesting. First, we have presented TVPP, an academic research-oriented P2P live streaming system designed to provide a similar service to popular proprietary commercial systems, such as SopCast. We have exposed several details of TVPP design and architecture in the light of overlay construction and chunk scheduling theory. We have also described the design of an emergency request feature. We have shown what TVPP is currently able to log, and some ways on how it can be expanded for other researches. We have 77 78 Chapter 5. Final Remarks introduced the current parameter set that can be fixed at TVPP runtime. Finally, we have highlighted benefits of TVPP usage through an experimental comparison with SopCast about traffic and network related metrics. Second, we have characterized chunk loss causes using TVPP. We have evaluated both a resourceful and a bandwidth constrained scenario. We have found that for a resourceful scenario most peers do not have chunk loss problems, while a few peers present significant losses. System-wide average chunk loss has been stable in time. While losses do occur, there are many moments where peers had no requests made to them. Mostly, requests are responded in the first attempt, by the first peer to announce that it has the chunk. Chunks that have been missed have had a worse chunk spread – having less candidates to request from – than chunks that have been received. Finally, chunk loss lacks spatial and temporal locality. These results suggest that two solutions would be effective to overcome these losses: an improved overlay organization policy, and an emergency request mechanism (described in Section 2.4) to request "about to be lost" chunks outside of a peer’s partner set. Losses get worse, yet sustainable, once we reduced bandwidth through the intro- duction of free riders. We present the concept of conscious and oblivious free riders. Observing conscious free riders we have noted that chunk loss has been qualitatively similar for fractions of free rides below 50% while latency has increased by 1s. However, the same fraction of oblivious free riders have caused a degradation significantly worse, mostly triggered by the increase in retries. Consecutive retries waste bandwidth and increase average chunk latency. We have introduced Simple Unanswered Request Eliminator (SURE), a modifica- tion to our system’s request scheduler that avoids excessive retries caused by oblivious free riders because it identifies uncooperative peers with few interactions. Using SURE we have obtained similar results for the same amount of conscious and oblivious free riders. While studying chunk loss, we have shown that P2P live streaming systems can sustain 50% of free riders with negligible degradation. Third, we have studied an alternative to reduce chunk losses that uses CDN ar- chitecture. We have worked with Hive Streaming, a commercial modern live streaming solution. Hive uses SmoothCache, a hybrid CDN-P2P live streaming system, as its distribution platform. SmoothCache includes a concept that ensures chunk delivery – typical of CDN peer-assisted systems. The concept states that all data is acquirable from CDN; those who are interested in the content form a support P2P overlay, which should be queried first for data. SmoothCache peers know CDN servers addresses and, ultimately, will request chunks to the CDN if a chunk cannot be found or a request fail to be responded by its partners. We argue that this design can be mapped to the 79 emergency request mechanism proposed for P2P systems in Section 2.4 and results obtained on SmoothCache would be equivalent. We have found that, although emergency requests guarantee quality of service, they deliver a chunk close to its playback deadline, which leaves little or no time for redistribution in the P2P overlay. The benefit caused by recovering a chunk through emergency requests is hardly propagated to a peer’s partners, which can cause a chain reaction. This avoids chunk losses but is inefficient to solve P2P data availability problems that might be causing them. We have developed Adaptive Emergency Request Optimization (AERO), a mech- anism that dynamically minimizes the number of video streams seeded to the P2P over- lay while guaranteeing efficient dissemination and avoiding emergency requests. We have shown that AERO significantly reduces bandwidth consumption at the streaming infrastructure (up to 30%) compared to a static configuration of the number of seeded streams. We have evaluated AERO under different scenarios varying peer upload band- width, fraction of free-riding peers, peer churn (up to flash crowds), overlay size, overlay construction and organization mechanism, and underlying physical network conditions, with positive results in all cases. AERO does not impose any restriction on and is com- patible with any peer-to-peer overlay construction and chunk scheduling mechanisms. Although developed using SmoothCache, which is a CDN peer-assisted system, AERO can be applied to "pure" P2P overlays. We believe AERO can be easily integrated into systems that use emergency requests or similar mechanisms to guarantee quality of service, helping to improve scalability and reducing operational costs by by adjusting emergency requests handlers’s out-degree. In summary we have found out that most losses happen for reasons too specific to be pointed out and have it treated. One important reason that has been identified, however, was requesting chunks to uncooperative peers. We have proposed a simple method to track this behavior and avoid it. For other reasons, we have proposed a service where chunks that would be lost could have a last chance to be retrieved outside of the peer’s neighborhood. This solution has potential to guarantee close to 100% delivery depending on how it is implemented. Nevertheless, such method to neutralize losses might incur in additional infrastructure costs as an external entity will have to handle that chunk delivery. While this may be unavoidable, we have proposed a method to minimize these costs. Finally, in this work we have obtained a better understanding of why chunks are lost in the transmission of live P2P streams, and how to minimize these losses. A transmission system using these techniques will be significantly more efficient to transmit streams without increasing costs as much as possible. 80 Chapter 5. Final Remarks Future work includes the implementation of the emergency request service in TVPP, as well as its multiple conceptual approaches, resulting in tests, analyses and comparisons to determine the potentials of each approach. Some of these ideas were discussed at Section 2.4 through questions such as 1) "how to select emergency request handlers" or 2) "to which handler forward a emergency request if there is more than one handler". Another activity would be to further explore tweaks over AERO algorithm, de- termining if there is a better configuration for its parameters. AERO results from exploratory research; seventeen versions of the algorithm were created with slightly dif- ferent ideas such as 1) setting the seeding ratio straight to the consumption level that CDN servers were observing, 2) identifying requests done to the servers as prefetched and emergency ones, and use this to reduce emergency requests to zero, 3) using thresh- olds to identify if the system is stable or should adapt, e.g., if emergency requests are more than 10% of all server upload or if prefetched requests are less than 90% of the expected server upload (which considers that all prefetchers will request all chunks to servers), or 4) using thresholds to perform more frequent or bigger seeding ratio adjustments. AERO has initially achieved good results but one could better organize and structure the ideas from its versions. Finally, many possibilities would arise from implementing AERO in TVPP, es- pecially since handlers chosen among peers are not dedicated to just streaming and they can leave the system at anytime. Ideas presented at Section 4.8 are future work themselves as they depend on the implementation of TVPP’s emergency request service. Bibliography Adar, E. and Huberman, B. (2000). Free Riding on Gnutella. First Monday, 5(10-2). Ali, S., Mathur, A., and Zhang, H. (2006). Measurement of commercial peer-to-peer live video streaming. In Proc. of Workshop in Recent Advances in Peer-to-Peer Streaming. Amazon (2014). Amazon will stream in ultra-high def 4k by january. http://time.com/3581899/amazon-4k-streaming-prime-instant-video/. Asaduzzaman, S., Qiao, Y., and Bochmann, G. (2008). Cliquestream: an efficient and fault-resilient live streaming network on a clustered peer-to-peer overlay. In Peer-to- Peer Computing, 2008. P2P’08. Eighth International Conference on, pages 269--278. IEEE. Banerjee, S., Bhattacharjee, B., and Kommareddy, C. (2002). Scalable application layer multicast. In Proceedings of the 2002 conference on Applications, technologies, architectures, and protocols for computer communications, page 217. ACM. Bellovin, S. M. (2002). A technique for counting natted hosts. In IMW ’02: Proceedings of the 2nd ACM SIGCOMM Workshop on Internet measurment, pages 267--272, New York, NY, USA. ACM. Birke, R., Kiraly, C., Leonardi, E., Mellia, M., Meo, M., and Traverso, S. (2011). Hose rate control for p2p-tv streaming systems. In P2P, 2011 IEEE International Conference on, pages 202--205. IEEE. Carlsson, N. and Eager, D. L. (2007). Peer-assisted on-demand streaming of stored media using bittorrent-like protocols. In NETWORKING. Ad Hoc and Sensor Net- works, Wireless Networks, Next Generation Internet, pages 570--581. Springer. Castro, M., Druschel, P., Kermarrec, A.-M., Nandi, A., Rowstron, A., and Singh, A. (2003). Splitstream: high-bandwidth multicast in cooperative environments. In ACM SIGOPS Operating Systems Review, volume 37, pages 298--313. ACM. 81 82 Bibliography Cisco (2015). Cisco visual networking index: Forecast and methodology, 2014-2019. http://www.cisco.com/c/en/us/solutions/collateral/service-provider/ ip-ngn-ip-next-generation-network/white-paper-listing.html. CNET (2010). The next big thing supersession: I want my iptv. Filme-video. CNN (2015). Periscope: Four ways it’s shaking up media. http://edition.cnn.com/2015/05/26/tech/periscope-android-media/. Cohen, B. (2003). Incentives Build Robustness in BitTorrent. In Workshop on Eco- nomics of Peer-to-Peer Systems, volume 6. Berkeley, CA, USA. Deshpande, H., Bawa, M., and Garcia-Molina, H. (2002). Streaming Live Media over a Peer-to-Peer Network. Stanford database group technical report (2002). Diestel, R. (1997). Graph Theory. New York: Springer-Verlag, 3rd edition. Dilley, J., Maggs, B., Parikh, J., Prokop, H., Sitaraman, R., and Weihl, B. (2002). Globally distributed content delivery. Internet Computing, IEEE, 6(5):50–58. ISSN 1089-7801. Dowling, J. and Payberah, A. H. (2012). Shuffling with a croupier: Nat-aware peer- sampling. In Distributed Computing Systems (ICDCS), 2012 IEEE 32nd Interna- tional Conference on, pages 102--111. IEEE. Felber, P. and Biersack, E. W. (2005). Cooperative content distribution: Scalability through self-organization. In Self-star Properties in Complex Information Systems, pages 343--357. Springer. Fodor, V. and Dan, G. (2007). Resilience in live peer-to-peer streaming [peer-to-peer multimedia streaming]. Communications Magazine, IEEE, 45(6):116--123. ISSN 0163-6804. Fortuna, R., Leonardi, E., Mellia, M., Meo, M., and Traverso, S. (2010). Qoe in pull based p2p-tv systems: overlay topology design tradeoffs. In Peer-to-Peer Computing (P2P), 10th International Conference on, pages 1--10. IEEE. Gamma, E., Helm, R., Johnson, R., and Vlissides, J. (1994). Design patterns: elements of reusable object-oriented software. Pearson Education. Gonçalves, G., Cunha, Í., Vieira, A., and Almeida, J. (2014). Predicting the level of cooperation in a peer-to-peer live streaming application. Multimedia Systems, pages 1–20. ISSN 0942-4962. Bibliography 83 Guerraoui, R., Huguenin, K., Kermarrec, A., Monod, M., and Prusty, S. (2010). Lifting: lightweight freerider-tracking in gossip. In Proceedings of the ACM/IFIP/USENIX 11th International Conference on Middleware, pages 313--333. Springer-Verlag. Guha, S., Daswani, N., and Jain, R. (2006). An experimental study of the skype peer-to-peer voip system. In Proc. of IPTPS, volume 6. Citeseer. Hefeeda, M., Habib, A., Botev, B., Xu, D., and Bhargava, B. (2003). PROMISE: peer-to-peer media streaming using CollectCast. Proceedings of the eleventh ACM international conference on Multimedia, pages 45--54. Hei, X., Liang, C., Liang, J., Liu, Y., and Ross, K. (2007a). A Measurement Study of a Large-Scale P2P IPTV System. Multimedia, IEEE Transactions on, 9(8):1672--1687. Hei, X., Liang, C., Liang, J., Liu, Y., and Ross, K. (2007b). A Measurement Study of a Large-Scale P2P IPTV System. IEEE Transactions on Multimedia, 9(8):1672--1687. Hei, X., Liu, Y., and Ross, K. (2008). IPTV over P2P streaming networks: the mesh- pull approach. Communications Magazine, IEEE, 46(2):86--92. ISSN 0163-6804. Hitbox (2015). Hitbox. http://www.hitbox.tv/. Hive (2015). Hive streaming. https://www.hivestreaming.com/. Horvath, A., Telek, M., Rossi, D., Veglia, P., Ciullo, D., Garcia, M., Leonardi, E., and Mellia, M. (2008). Dissecting PPLive, SopCast, TVAnts. submitted to ACM Conext. Huang, Y., Fu, T. Z., Chiu, D.-M., Lui, J. C., and Huang, C. (2008). Challenges, design and analysis of a large-scale p2p-vod system. SIGCOMM Comput. Commun. Rev., 38(4):375--388. ISSN 0146-4833. ITU-T (2015). H.264: Advanced video coding for generic audiovisual services. http://www.itu.int/rec/T-REC-H.264. Jannotti, J., Gifford, D., Johnson, K., Kaashoek, M., and O’Toole Jr, J. (2000). Over- cast: reliable multicasting with on overlay network. In Proceedings of the 4th confer- ence on Symposium on Operating System Design & Implementation-Volume 4, pages 14--14. USENIX Association Berkeley, CA, USA. Jelasity, M., Voulgaris, S., Guerraoui, R., Kermarrec, A.-M., and Van Steen, M. (2007). Gossip-based peer sampling. ACM Transactions on Computer Systems (TOCS), 25(3):8. 84 Bibliography Jian, F. (2009). A research on scheduling strategy in peer-to-peer streaming media. In Circuits, Communications and Systems, 2009. PACCS’09. Pacific-Asia Conference on, pages 439--442. IEEE. Jin, Y., Yi, Y., Kesidis, G., Kocak, F., and Shin, J. (2013). Hybrid Client-Server and Peer-to-Peer Caching Systems with Selfish Peers. In Proc. IEEE INFOCOM. Kostić, D., Rodriguez, A., Albrecht, J., and Vahdat, A. (2003). Bullet: high bandwidth data dissemination using an overlay mesh. ACM SIGOPS Operating Systems Review, 37(5):282--297. Kreitz, G. and Niemela, F. (2010). Spotify – large scale, low latency, p2p music-on- demand streaming. In Peer-to-Peer Computing (P2P), IEEE Tenth International Conference on, pages 1–10. Kulbak, Y. and Bickson, D. (2005). The eMule Protocol Specification. eMule project, http://sourceforge.net. Kumar, R., Liu, Y., and Ross, K. (2007). Stochastic fluid theory for p2p stream- ing systems. In Computer Communications (INFOCOM), IEEE 26th International Conference on, pages 919–927. ISSN 0743-166X. Li, B., Xie, S., Qu, Y., Keung, G., Lin, C., Liu, J., and Zhang, X. (2008). Inside the new coolstreaming: Principles, measurements and performance implications. In Proc. INFOCOM. Liu, F., Li, B., Zhong, L., Li, B., Jin, H., and Liao, X. (2012). Flash crowd in P2P live streaming systems: fundamental characteristics and design implications. Parallel and Distributed Systems, IEEE Transactions on, 23(7):1227–1239. ISSN 1045-9219. Lu, Y., Fallica, B., Kuipers, F., Kooij, R., and Van Mieghem, P. (2009). Assessing the Quality of Experience of SopCast. IJIPT, 4(1):11--23. Lu, Z., Wang, Y., and Yang, Y. R. (2012). An analysis and comparison of cdn-p2p- hybrid content delivery system and model. Journal of Communications, 7(3):232-- 245. Magharei, N. and Rejaie, R. (2006). Understanding mesh-based peer-to-peer streaming. In NOSSDAV ’06: Proceedings of the 2006 international workshop on Network and operating systems support for digital audio and video, pages 1--6, New York, NY, USA. ACM. Bibliography 85 Magharei, N., Rejaie, R., and Guo, Y. (2007). Mesh or multiple-tree: A comparative study of live p2p streaming approaches. In IEEE INFOCOM 2007. 26th IEEE International Conference on Computer Communications, pages 1424--1432. Mansy, A. and Ammar, M. (2011). Analysis of adaptive streaming for hybrid cdn/p2p live video systems. In Network Protocols (ICNP), 19th International Conference on, pages 276--285. IEEE. Mol, J.-D., Epema, D. H., and Sips, H. J. (2007). The orchard algorithm: Building multicast trees for p2p video multicasting without free-riding. Multimedia, IEEE Transactions on, 9(8):1593--1604. Moltchanov, D. (2011). Service quality in p2p streaming systems. Computer Science Review, 5(4):319 – 340. ISSN 1574-0137. Moreira, L. (2015). Fifa 2014 world cup live stream architecture. http://leandromoreira.com.br/2015/04/26/ fifa-2014-world-cup-live-stream-architecture/. Oliveira, J., Miguel, E., Ítalo Cunha, Vieira, A. B., Rocha, M., and Campos, S. (2013a). Can p2p live streaming systems coexist with free riders? In P2P. IEEE. Oliveira, J., Viana, R., Vieira, A. B., Rocha, M., and Campos, S. (2013b). Tvpp: A research oriented p2p live streaming system. In SBRC 2013 - Salão de Ferramentas, Brasília-DF. Oliveira, J. F. A. (2010). Super nós em sistemas p2p de distribuição de mídia ao vivo. Master’s thesis, UFMG. Pai, V., Kumar, K., Tamilmani, K., Sambamurthy, V., and Mohr, A. E. (2005). Chain- saw: Eliminating trees from overlay multicast. In Peer-to-peer systems IV, pages 127--140. Springer. Park, K., Pack, S., and Kwon, T. (2008). Climber: an incentive-based resilient peer- to-peer system for live streaming services. In IPTPS, page 10. Payberah, A. H., Dowling, J., Rahimian, F., and Haridi, S. (2012a). Distributed optimization of p2p live streaming overlays. Springer Computing, Special Issue on Extreme Distributed Systems: From Large Scale to Complexity, 94(8):621--647. Payberah, A. H., Kavalionak, H., Kumaresan, V., Montresor, A., and Haridi, S. (2012b). Clive: Cloud-assisted p2p live streaming. In Peer-to-Peer Computing (P2P), 12th International Conference on, pages 79--90. IEEE. 86 Bibliography Pianese, F., Perino, D., Keller, J., and Biersack, E. W. (2007). Pulse: an adaptive, incentive-based, unstructured p2p live streaming system. Multimedia, IEEE Trans- actions on, 9(8):1645--1660. Piatek, M., Krishnamurthy, A., Venkataramani, A., Yang, R., Zhang, D., and Jaffe, A. (2010). Contracts: Practical Contribution Incentives For P2P Live Streaming. In USENIX NSDI. Picconi, F. and Massoulié, L. (2008). Is there a future for mesh-based live video stream- ing? Peer-to-Peer Computing , 2008. P2P ’08. Eighth International Conference on, pages 289–298. ISSN . Planet Lab (2010). Planetlab. http://www.planet-lab.org/. Ripeanu, M. (2001). Peer-to-Peer Architecture Case Study: Gnutella Network. In Proceedings of International Conference on Peer-to-peer Computing, volume 101. Sweden: IEEE Computer Press. Roverso, R., Dowling, J., and Jelasity, M. (2013). Through the wormhole: Low cost, fresh peer sampling for the internet. In Peer-to-Peer Computing (P2P), 13th Inter- national Conference on, pages 1--10. IEEE. Roverso, R., El-Ansary, S., and Haridi, S. (2012). Smoothcache: Http-live streaming goes peer-to-peer. In IFIP NETWORKING 2012, pages 29--43. Springer. Roverso, R., Reale, R., El-Ansary, S., and Haridi, S. (2015). Smoothcache 2.0: Cdn- quality adaptive http live streaming on peer-to-peer overlays. In Proceedings of the 6th ACM Multimedia Systems Conference, MMSys ’15, pages 61--72, New York, NY, USA. ACM. Sentinelli, A., Marfia, G., Gerla, M., Kleinrock, L., and Tewari, S. (2007). Will IPTV ride the peer-to-peer stream?[Peer-to-Peer Multimedia Streaming]. Communications Magazine, IEEE, 45(6):86--92. ISSN 0163-6804. Shirky, C. (2001). Listening to Napster. Peer-to-Peer: Harnessing the Benefits of a Disruptive Technology, pages 19--28. Silva, T. H. (2009). Transmissões de vídeo ao vivo geradas por usuários: Caracterização e análise. Master’s thesis, Universidade Federal de Minas Gerais. Silverston, T., Fourmaux, O., Botta, A., Dainotti, A., Pescapé, A., Ventre, G., and Salamatian, K. (2009). Traffic analysis of peer-to-peer IPTV communities. Computer Networks, 53(4):470--484. ISSN 1389-1286. Bibliography 87 Simoni, G., Roverso, R., and Montresor, A. (2014). Rankslicing: A decentralized pro- tocol for supernode selection. In Peer-to-Peer Computing (P2P), 14th International Conference on, pages 1--10. IEEE. Skiena, S. (1991). Implementing discrete mathematics: combinatorics and graph theory with Mathematica. Addison-Wesley Longman Publishing Co., Inc. ISBN 0201509431. Tang, S., Lu, Y., Hernández, J., Kuipers, F., and Van Mieghem, P. (2009). Topology dynamics in a P2PTV network. Proc. NETWORKING. Tewari, S. and Menon, S. (2009). On resource provisioning in hybrid peer-to-peer live streaming systems. In Broadband Multimedia Systems and Broadcasting (BMSB), International Symposium on, pages 1–6. IEEE. Tran, D., Hua, K., and Do, T. (2004). A peer-to-peer architecture for media streaming. Selected Areas in Communications, IEEE Journal on, 22(1):121--133. Tran, D. A., Hua, K. A., and Do, T. (2003). Zigzag: An efficient peer-to-peer scheme for media streaming. In INFOCOM 2003. Twenty-Second Annual Joint Conference of the IEEE Computer and Communications. IEEE Societies, volume 2, pages 1283- -1292. IEEE. Traverso, S., Abeni, L., Birke, R., Kiraly, C., Leonardi, E., Lo Cigno, R., and Mellia, M. (2014). Neighborhood filtering strategies for overlay construction in p2p-tv systems: Design and experimental comparison. Networking, IEEE/ACM Transactions on, PP(99):1–1. ISSN 1063-6692. Twitch TV (2015). Twitch tv. http://www.twitch.tv/. Vieira, A., Gomes, P., Rocha, M., Almeida, J., and Campos, S. (2009). A behaviour model of the SopCast users. In Proc. WEBMEDIA. Vlavianos, A., Iliofotou, M., and Faloutsos, M. (2006). Bitos: Enhancing bittorrent for supporting streaming applications. In INFOCOM 2006. 25th IEEE International Conference on Computer Communications. Proceedings, pages 1--6. IEEE. Voulgaris, S., Gavidia, D., and Van Steen, M. (2005). Cyclon: Inexpensive member- ship management for unstructured p2p overlays. Journal of Network and Systems Management, 13(2):197--217. Wang, F., Xiong, Y., and Liu, J. (2010). mtreebone: A collaborative tree-mesh over- lay network for multicast video streaming. Parallel and Distributed Systems, IEEE Transactions on, 21(3):379--392. 88 Bibliography Wichtlhuber, M., Richerzhagen, B., Ruckert, J., and Hausheer, D. (2014). TRAN- SIT: Supporting Transitions in Peer-to-Peer Live Video Streaming. In Proc. IFIP Networking. Wu, C., Li, B., and Zhao, S. (2007). Characterizing Peer-to-Peer Streaming Flows. Selected Areas in Communications, IEEE Journal on, 25(9):1612--1626. Yin, H., Liu, X., Zhan, T., Sekar, V., Qiu, F., Lin, C., Zhang, H., and Li, B. (2009). Design and deployment of a hybrid cdn-p2p system for live video streaming: expe- riences with livesky. In Proceedings of the 17th ACM international conference on Multimedia, pages 25--34. ACM. Zhang, L. (2005). Efficient video streaming in peer-to-peer networks. PhD thesis, The Hong Kong Polytechnic University. Zhang, X., Liu, J., Li, B., and Yum, T. (2005). CoolStreaming/DONet: A data-driven overlay network for efficient live media streaming. In Proc. IEEE Infocom. Zhao, B. Q., Lui, J. C.-S., and Chiu, D.-M. (2009). Exploring the optimal chunk selection policy for data-driven p2p streaming systems. In Peer-to-Peer Computing (P2P), IEEE Ninth International Conference on, pages 271--280. IEEE. Zhao, M., Aditya, P., Chen, A., Lin, Y., Haeberlen, A., Druschel, P., Maggs, B., Wishon, B., and Ponec, M. (2013). Peer-assisted Content Distribution in Akamai NetSession. In Proc. IMC. Appendix A SURE Comparison Figures 0 0.2 0.4 0.6 0.8 1 0 0.02 0.04 0.06 0.08 0.1 Cu m ul at ive F ra ct io n of P ee rs Chunk Miss Rate Conscious SURE Oblivious (a) 00% - Chunk Loss Comparison 0 0.2 0.4 0.6 0.8 1 0 1 2 3 4 5 6 7 8 Cu m ul at ive F ra ct io n of P ee rs Average Chunk Latency (sec) Conscious SURE Oblivious (b) 00% - Latency Comparison Figure A.1. Chunk loss and latency comparison over the results of conscious, oblivious and SURE scenarios using 00% free rider ratio. 0 0.2 0.4 0.6 0.8 1 0 0.02 0.04 0.06 0.08 0.1 Cu m ul at ive F ra ct io n of P ee rs Chunk Miss Rate Conscious SURE Oblivious (a) 10% - Chunk Loss Comparison 0 0.2 0.4 0.6 0.8 1 0 1 2 3 4 5 6 7 8 Cu m ul at ive F ra ct io n of P ee rs Average Chunk Latency (sec) Conscious SURE Oblivious (b) 10% - Latency Comparison Figure A.2. Chunk loss and latency comparison over the results of conscious, oblivious and SURE scenarios using 10% free rider ratio. 89 90 Appendix A. SURE Comparison Figures 0 0.2 0.4 0.6 0.8 1 0 0.02 0.04 0.06 0.08 0.1 Cu m ul at ive F ra ct io n of P ee rs Chunk Miss Rate Conscious SURE Oblivious (a) 30% - Chunk Loss Comparison 0 0.2 0.4 0.6 0.8 1 0 1 2 3 4 5 6 7 8 Cu m ul at ive F ra ct io n of P ee rs Average Chunk Latency (sec) Conscious SURE Oblivious (b) 30% - Latency Comparison Figure A.3. Chunk loss and latency comparison over the results of conscious, oblivious and SURE scenarios using 30% free rider ratio. 0 0.2 0.4 0.6 0.8 1 0 0.02 0.04 0.06 0.08 0.1 Cu m ul at ive F ra ct io n of P ee rs Chunk Miss Rate Conscious SURE Oblivious (a) 50% - Chunk Loss Comparison 0 0.2 0.4 0.6 0.8 1 0 1 2 3 4 5 6 7 8 Cu m ul at ive F ra ct io n of P ee rs Average Chunk Latency (sec) Conscious SURE Oblivious (b) 50% - Latency Comparison Figure A.4. Chunk loss and latency comparison over the results of conscious, oblivious and SURE scenarios using 50% free rider ratio. 0 0.2 0.4 0.6 0.8 1 0 0.02 0.04 0.06 0.08 0.1 Cu m ul at ive F ra ct io n of P ee rs Chunk Miss Rate Conscious SURE Oblivious (a) 70% - Chunk Loss Comparison 0 0.2 0.4 0.6 0.8 1 0 1 2 3 4 5 6 7 8 Cu m ul at ive F ra ct io n of P ee rs Average Chunk Latency (sec) Conscious SURE Oblivious (b) 70% - Latency Comparison Figure A.5. Chunk loss and latency comparison over the results of conscious, oblivious and SURE scenarios using 70% free rider ratio. Appendix B AERO Evaluation Figures 0 0.2 0.4 0.6 0.8 1 0 5 10 15 20 25 30C um ul at ive F ra ct io n of P ee rs Upload Bandwidth [stream rate, 331Kbps] US BR SE Figure B.1. Peer upload bandwidth distributions. In each of the following sections, the curve pointed as baseis given by distributions in Figure B.1 (same as Figure 4.2). 50f, 75f and div4 curves are restrictions over base, as explained in Section 4.3. 91 92 Appendix B. AERO Evaluation Figures B.1 Brazilian Bandwidth Distribution 0 0.2 0.4 0.6 0.8 1 0 200 400 600 800 1000 1200 Time(s) BASE BASE+AERO 0 0.1 0.2 0.3 0 200 400 600 800 1000 1200 |Os|/|P| AERO |Os|/|P| Figure B.2. (BR)Comparison of overall savings for BASE scenario with and without AERO. 0 0.2 0.4 0.6 0.8 1 0 200 400 600 800 1000 1200 Time(s) 50F 50F+AERO 0 0.1 0.2 0.3 0 200 400 600 800 1000 1200 |Os|/|P| AERO |Os|/|P| Figure B.3. (BR)Comparison of overall savings for 50F scenario with and with- out AERO. B.1. Brazilian Bandwidth Distribution 93 0 0.2 0.4 0.6 0.8 1 0 200 400 600 800 1000 1200 Time(s) 75F 75F+AERO 0 0.1 0.2 0.3 0 200 400 600 800 1000 1200 |Os|/|P| AERO |Os|/|P| Figure B.4. (BR)Comparison of overall savings for 75F scenario with and with- out AERO. 0 0.2 0.4 0.6 0.8 1 0 200 400 600 800 1000 1200 Time(s) DIV4 DIV4+AERO 0 0.1 0.2 0.3 0 200 400 600 800 1000 1200 |Os|/|P| AERO |Os|/|P| Figure B.5. (BR)Comparison of overall savings for DIV4 scenario with and without AERO. |OS |/|P| removed from key for clarity. 94 Appendix B. AERO Evaluation Figures 0 0.2 0.4 0.6 0.8 1 BASE BASE AERO 50F 50F AERO 75F 75F AERO DIV4 DIV4 AERO Sa vin gs Figure B.6. (BR)Savings of different overlay construction configurations grouped by peer upload bandwidth distribution. 0 0.2 0.4 0.6 0.8 1 0 600 1200 1800 2400 3000 Sa vin gs Time(s) BASE+AERO BASE DIV4+AERO DIV4 Figure B.7. (BR)Savings under peer churn. 0 0.2 0.4 0.6 0.8 1 0 600 1200 1800 2400 3000 J L J L J Sa vin gs Time(s) BASE+AERO BASE DIV4+AERO DIV4 Figure B.8. (BR)Savings over time for periodic flash crowd events. B.1. Brazilian Bandwidth Distribution 95 0 0.2 0.4 0.6 0.8 1 0 600 1200 1800 2400 3000 3600 Sa vin gs Time(s) Run #1 Run #2 Run #3 Run #4 Run #5 Figure B.9. (BR)Savings for div4 scenario with AERO over longer experiments. 96 Appendix B. AERO Evaluation Figures B.2 North American Bandwidth Distribution 0 0.2 0.4 0.6 0.8 1 0 200 400 600 800 1000 1200 Time(s) BASE BASE+AERO 0 0.1 0.2 0.3 0 200 400 600 800 1000 1200 |Os|/|P| AERO |Os|/|P| Figure B.10. (US)Comparison of overall savings for BASE scenario with and without AERO. 0 0.2 0.4 0.6 0.8 1 0 200 400 600 800 1000 1200 Time(s) 50F 50F+AERO 0 0.1 0.2 0.3 0 200 400 600 800 1000 1200 |Os|/|P| AERO |Os|/|P| Figure B.11. (US)Comparison of overall savings for 50F scenario with and without AERO. B.2. North American Bandwidth Distribution 97 0 0.2 0.4 0.6 0.8 1 0 200 400 600 800 1000 1200 Time(s) 75F 75F+AERO 0 0.1 0.2 0.3 0 200 400 600 800 1000 1200 |Os|/|P| AERO |Os|/|P| Figure B.12. (US)Comparison of overall savings for 75F scenario with and without AERO. 0 0.2 0.4 0.6 0.8 1 0 200 400 600 800 1000 1200 Time(s) DIV4 DIV4+AERO 0 0.1 0.2 0.3 0 200 400 600 800 1000 1200 |Os|/|P| AERO |Os|/|P| Figure B.13. (US)Comparison of overall savings for DIV4 scenario with and without AERO. |OS |/|P| removed from key for clarity. 98 Appendix B. AERO Evaluation Figures 0 0.2 0.4 0.6 0.8 1 BASE BASE AERO 50F 50F AERO 75F 75F AERO DIV4 DIV4 AERO Sa vin gs Figure B.14. (US)Savings of different overlay construction configurations grouped by peer upload bandwidth distribution. 0 0.2 0.4 0.6 0.8 1 0 600 1200 1800 2400 3000 Sa vin gs Time(s) BASE+AERO BASE DIV4+AERO DIV4 Figure B.15. (US)Savings under peer churn. 0 0.2 0.4 0.6 0.8 1 0 600 1200 1800 2400 3000 Sa vin gs Time(s) J L J L J BASE+AERO BASE DIV4+AERO DIV4 Figure B.16. (US)Savings over time for periodic flash crowd events. B.2. North American Bandwidth Distribution 99 0 0.2 0.4 0.6 0.8 1 0 600 1200 1800 2400 3000 3600 Sa vin gs Time(s) Run #1 Run #2 Run #3 Run #4 Run #5 Figure B.17. (US)Savings for div4 scenario with AERO over longer experi- ments. 100 Appendix B. AERO Evaluation Figures B.3 Swedish Bandwidth Distribution 0 0.2 0.4 0.6 0.8 1 0 200 400 600 800 1000 1200 Time(s) BASE BASE+AERO 0 0.1 0.2 0.3 0 200 400 600 800 1000 1200 |Os|/|P| AERO |Os|/|P| Figure B.18. (SE)Comparison of overall savings for BASE scenario with and without AERO. 0 0.2 0.4 0.6 0.8 1 0 200 400 600 800 1000 1200 Time(s) 50F 50F+AERO 0 0.1 0.2 0.3 0 200 400 600 800 1000 1200 |Os|/|P| AERO |Os|/|P| Figure B.19. (SE)Comparison of overall savings for 50F scenario with and without AERO. B.3. Swedish Bandwidth Distribution 101 0 0.2 0.4 0.6 0.8 1 0 200 400 600 800 1000 1200 Time(s) 75F 75F+AERO 0 0.1 0.2 0.3 0 200 400 600 800 1000 1200 |Os|/|P| AERO |Os|/|P| Figure B.20. (SE)Comparison of overall savings for 75F scenario with and without AERO. 0 0.2 0.4 0.6 0.8 1 0 200 400 600 800 1000 1200 Time(s) DIV4 DIV4+AERO 0 0.1 0.2 0.3 0 200 400 600 800 1000 1200 |Os|/|P| AERO |Os|/|P| Figure B.21. (SE)Comparison of overall savings for DIV4 scenario with and without AERO. |OS |/|P| removed from key for clarity. 102 Appendix B. AERO Evaluation Figures 0 0.2 0.4 0.6 0.8 1 BASE BASE AERO 50F 50F AERO 75F 75F AERO DIV4 DIV4 AERO Sa vin gs Figure B.22. (SE)Savings of different overlay construction configurations grouped by peer upload bandwidth distribution. 0 0.2 0.4 0.6 0.8 1 0 600 1200 1800 2400 3000 Sa vin gs Time(s) BASE+AERO BASE DIV4+AERO DIV4 Figure B.23. (SE)Savings under peer churn. 0 0.2 0.4 0.6 0.8 1 0 600 1200 1800 2400 3000 Sa vin gs Time(s) J L J L J BASE+AERO BASE DIV4+AERO DIV4 Figure B.24. (SE)Savings over time for periodic flash crowd events. B.3. Swedish Bandwidth Distribution 103 0 0.2 0.4 0.6 0.8 1 0 600 1200 1800 2400 3000 3600 Sa vin gs Time(s) Run #1 Run #2 Run #3 Run #4 Run #5 Figure B.25. (SE)Savings for div4 scenario with AERO over longer experi- ments.