Statistical Mechanics for Big Data

acquisition, analysis and modeling

Description

Technological advances during the last fifteen years have boosted our capacity to generate and store data. Indeed, according to some estimates 90% of the world’s stored data has been generated in the last two years, and the availability of such large quantities of data is changing the way we face crisis response, social mobilization, marketing, and intelligence. In science, the hopes created by big data are also high, and analyses of large datasets are behind recent breakthroughs in areas such as astrophysics, genomics or particle physics.

Big data is a particularly relevant opportunity for the study of complex systems such as cells, societies, ecosystems or economies, the study of which has traditionally been constrained by the limited information available from the different components (and layers of components) comprising the systems. The availability of unprecedented highly detailed data on these systems opens the door to significantly advance our understanding of their behavior, and of how this behavior evolves in time.

Our project is based on the premise that statistical mechanics tools and methods, when applied to large-scale data acquisition, analytics, and modeling, have the potential to successfully address the problem of transforming data into knowledge. The overarching goal of this project is, precisely, to develop and apply a comprehensive set of statistical mechanics methods for large-scale data analysis. To achieve this goal, we propose three main objectives: (M) To develop statistical mechanics tools for the analysis of large-scale data; (D) to develop crowd-sourced data acquisition and processing protocols; (A) To analyze and model, using the methodologies and/or data from objectives M and A, complex systems in three different areas: biochemical systems, techno-social systems, and economic systems.

The methods we propose to develop are aimed at network and non-network data that are, in general, heterogeneous, multidimensional and multilevel, and time-resolved. The development of such methods will allow the construction of predictive models from large quantities of heterogeneous data, and provide guidance and innovative recommendations to practitioners and stakeholders (from academia, industry, and government). Our project relies not only on the interdisciplinary nature of the methodologies of the participants (statistical mechanics, computer science, mathematics, statistics) but also on the direct contact with experts in the fields of economy, finance, biology or chemistry. In this sense it is important to stress the quality of the members of the teams, their experience in different fields and the collaborations they maintain with internationally renowned experts, companies and institutions in many different areas.

Finally, our project deals with problems of large impact in technology and society. Given its comprehensive nature and the collaboration with companies, we expect our project to produce a large number of results with direct impact in our society and economy. Not only on the big data business, but also in problems like urban planning, marketing, financial markets or biology."

Highlights

CSD: A Multi-User Similarity Metric for Community Recommendation in Online Social Networks

X. Han, L. Wang, R. Farahbakhsh, Á. Cuevas, R. Cuevas, N. Crespi

Communities are basic components in networks. As a promising social application, community recommendation selects a few items (e.g., movies and books) to recommend to a group of users. It usually achieves higher recommendation precision if the users share more interests; whereas, in plenty of com...

Journal

How Far is Facebook from Me? Facebook Network Infrastructure Analysis

R. Farahbakhsh, Á. Cuevas, A. Ortiz, X. Han, N. Crespi

Facebook is today the most popular social network with more than one billion subscribers worldwide. To provide good quality of service (e.g., low access delay) to their clients, FB relies on Akamai, which provides a worldwide content distribution network with a large number of edge servers that a...

Journal

Daily rhythms in mobile telephone communication

Talayeh Aledavood, Eduardo López, Sam GB Roberts, Felix Reed-Tsochas, Esteban Moro, Robin IM Dunbar, Jari Saramäki

Circadian rhythms are known to be important drivers of human activity and the recent availability of electronic records of human behaviour has provided fine-grained data of temporal patterns of activity on a large scale. Further, questionnaire studies have identified important individual differen...

Journal

People

Roger Guimerà

Universitat Rovira i Virgili - ICREA Research Professor

Contact

roger.guimera@urv.cat

@sees_lab

Site

Marta Sales-Pardo

Universitat Rovira i Virgili - Associate Professor

Contact

marta.sales@urv.cat

@sees_lab

Site

Esteban Moro

Universidad Carlos III de Madrid - Associate Professor

Contact

emoro@math.uc3m.es

@estebanmoro

Site

Ángel Cuevas

Universidad Carlos III de Madrid - Assistant Professor

Insttitut Mines-Telecom SudParis - Adjunct Professor

Contact

acrumin@it.uc3m.es

@acrumin

Site

Jaume Masoliver

Universitat de Barcelona - Professor

Contact

jaume.masoliver@ub.edu

Site

Miquel Montero

Universitat de Barcelona - Associate Professor

Contact

miquel.montero@ub.edu

Site

Josep Perelló

Universtiat de Barcelona - Associate Professor

Contact

josep.perello@ub.edu

@josperello

Site

Jordi Duch

Universitat Rovira i Virgili - Associate Professor

Contact

jordi.duch@urv.cat

@tanisjones

Publications

Communities are basic components in networks. As a promising social application, community recommendation selects a few items (e.g., movies and books) to recommend to a group of users. It usually achieves higher recommendation precision if the users share more interests; whereas, in plenty of communities (e.g., families, work groups), the users often share few. With billions of communities in online social networks, quickly selecting the communities where the members are similar in interests is a prerequisite for community recommendation. To this end, we propose an easy-to-compute metric, Community Similarity Degree (CSD), to estimate the degree of interest similarity among multiple users in a community. Based on 3460 emulated Facebook communities, we conduct extensive empirical studies to reveal the characteristics of CSD and validate the effectiveness of CSD. In particular, we demonstrate that selecting communities with larger CSD can achieve higher recommendation precision. In addition, we verify the computation efficiency of CSD: it costs less than 1 hour to calculate CSD for over 1 million of communities. Finally, we draw insights about feasible extensions to the definition of CSD, and point out the practical uses of CSD in a variety of applications other than community recommendation.
Facebook is today the most popular social network with more than one billion subscribers worldwide. To provide good quality of service (e.g., low access delay) to their clients, FB relies on Akamai, which provides a worldwide content distribution network with a large number of edge servers that are much closer to FB subscribers. In this article we aim to depict a global picture of the current FB network infrastructure deployment taking into account both native FB servers and Akamai nodes. Toward this end, we have performed a measurement-based analysis during a period of two weeks using 463 Planet- Lab nodes distributed across 41 countries. Based on the obtained data we compare the average access delay that nodes in different countries experience accessing both native FB servers and Akamai nodes. In addition, we obtain a wide view of the deployment of Akamai nodes serving FB users worldwide. Finally, we analyze the geographical coverage of those nodes, and demonstrate that in most of the cases Akamai nodes located in a particular country service not only local FB subscribers, but also FB users located in nearby countries.
Circadian rhythms are known to be important drivers of human activity and the recent availability of electronic records of human behaviour has provided fine-grained data of temporal patterns of activity on a large scale. Further, questionnaire studies have identified important individual differences in circadian rhythms, with people broadly categorised into morning-like or evening-like individuals. However, little is known about the social aspects of these circadian rhythms, or how they vary across individuals. In this study we use a unique 18-month dataset that combines mobile phone calls and questionnaire data to examine individual differences in the daily rhythms of mobile phone activity. We demonstrate clear individual differences in daily patterns of phone calls, and show that these individual differences are persistent despite a high degree of turnover in the individuals' social networks. Further, women's calls were longer than men's calls, especially during the evening and at night, and these calls were typically focused on a small number of emotionally intense relationships. These results demonstrate that individual differences in circadian rhythms are not just related to broad patterns of morningness and eveningness, but have a strong social component, in directing phone calls to specific individuals at specific times of day.
Big Data on electronic records of social interactions allow approaching human behaviour and sociality from a quantitative point of view with unforeseen statistical power. Mobile telephone Call Detail Records (CDRs), automatically collected by telecom operators for billing purposes, have proven especially fruitful for understanding one-to-one communication patterns as well as the dynamics of social networks that are reflected in such patterns. We present an overview of empirical results on the multi-scale dynamics of social dynamics and networks inferred from mobile telephone calls. We begin with the shortest timescales and fastest dynamics, such as burstiness of call sequences between individuals, and “zoom out” towards longer temporal and larger structural scales, from temporal motifs formed by correlated calls between multiple individuals to long-term dynamics of social groups. We conclude this overview with a future outlook.
In the era when Facebook and Twitter dominate the market for social media, Google has introduced Google+ (G+) and reported a significant growth in its size while others called it a ghost town. This begs the question of whether G+ can really attract a significant number of connected and active users despite the dominance of Facebook and Twitter. This paper presents a detailed longitudinal characterization of G+ based on large-scale measurements. We identify the main components of G+ structure and characterize the key feature of their users and their evolution over time. We then conduct detailed analysis on the evolution of connectivity and activity among users in the largest connected component (LCC) of G+ structure, and compare their characteristics to other major online social networks (OSNs). We show that despite the dramatic growth in the size of G+, the relative size of the LCC has been decreasing and its connectivity has become less clustered. While the aggregate user activity has gradually increased, only a very small fraction of users exhibit any type of activity, and an even smaller fraction of these users attracts any reaction. The identity of users with most followers and reactions reveal that most of them are related to high-tech industry. To our knowledge, this study offers the most comprehensive characterization of G+ based on the largest collected datasets.
Recent widespread adoption of electronic and pervasive technologies has enabled the study of human behavior at an unprecedented level, uncovering universal patterns underlying human activity, mobility, and interpersonal communication. In the present work, we investigate whether deviations from these universal patterns may reveal information about the socio-economical status of geographical regions. We quantify the extent to which deviations in diurnal rhythm, mobility patterns, and communication styles across regions relate to their unemployment incidence. For this we examine a country-scale publicly articulated social media dataset, where we quantify individual behavioral features from over 19 million geo-located messages distributed among more than 340 different Spanish economic regions, inferred by computing communities of cohesive mobility fluxes. We find that regions exhibiting more diverse mobility fluxes, earlier diurnal rhythms, and more correct grammatical styles display lower unemployment rates. As a result, we provide a simple model able to produce accurate, easily interpretable reconstruction of regional unemployment incidence from their social-media digital fingerprints alone. Our results show that cost-effective economical indicators can be built based on publicly-available social media datasets.
We analyze how to value future costs and benefits when they must be discounted relative to the present. We introduce the subject for the nonspecialist and take into account the randomness of the economic evolution by studying the discount function of three widely used processes for the dynamics of interest rates: Ornstein-Uhlenbeck, Feller, and log-normal. Besides obtaining exact expressions for the discount function and simple asymptotic approximations, we show that historical average interest rates overestimate long-run discount rates and that this effect can be large. In other words, long-run discount rates should be substantially less than the average rate observed in the past, otherwise any cost-benefit calculation would be biased in favor of the present and against interventions that may protect the future.
Information flow during catastrophic events is a critical aspect of disaster management. Modern communication platforms, in particular online social networks, provide an opportunity to study such flow and derive early-warning sensors, thus improving emergency preparedness and response. Performance of the social networks sensor method, based on topological and behavioral properties derived from the “friendship paradox”, is studied here for over 50 million Twitter messages posted before, during, and after Hurricane Sandy. We find that differences in users’ network centrality effectively translate into moderate awareness advantage (up to 26 hours); and that geo-location of users within or outside of the hurricane-affected area plays a significant role in determining the scale of such an advantage. Emotional response appears to be universal regardless of the position in the network topology, and displays characteristic, easily detectable patterns, opening a possibility to implement a simple “sentiment sensing” technique that can detect and locate disasters.
In this paper, we present closed-form expressions for the wave function that governs the evolution of the discrete-time quantum walk on the line when the coin operator is arbitrary. The formulas were derived assuming that the walker can either remain put in the place or proceed in a fixed direction but never move backward, although they can be easily modified to describe the case in which the particle can travel in both directions. We use these expressions to explore properties of magnitudes associated to the process, as the probability mass function or the probability current, even though we also consider the asymptotic behavior of the exact solution. Within this approximation, we will estimate upper and lower bounds, examine the origins of an emerging approximate symmetry, and deduce the general form of the stationary probability density of the relative location of the walker.
Understanding how much two individuals are alike in their interests (i.e., interest similarity) has become virtually essential for many applications and services in Online Social Networks (OSNs). Since users do not always explicitly elaborate their interests in OSNs like Facebook, how to determine users' interest similarity without fully knowing their interests is a practical problem. In this paper, we investigate how users' interest similarity relates to various social features (e.g. geographic distance); and accordingly infer whether the interests of two users are alike or unalike where one of the users' interests are unknown. Relying on a large Facebook dataset, which contains 479,048 users and 5,263,351 user-generated interests, we present comprehensive empirical studies and verify the homophily of interest similarity across three interest domains (movies, music and TV shows). The homophily reveals that people tend to exhibit more similar tastes if they have similar demographic information (e.g., age, location), or if they are friends. It also shows that the individuals with a higher interest entropy usually share more interests with others. Based on these results, we provide a practical prediction model under a real OSN environment. For a given user with no interest information, this model can select some individuals who not only exhibit many interests but also probably achieve high interest similarities with the given user. Eventually, we illustrate a use case to demonstrate that the proposed prediction model could facilitate decision-making for OSN applications and services.
This paper unveils some features of a discrete-time quantum walk on the line whose coin depends on the temporal variable. After considering the most general form of the unitary coin operator, it focuses on the role played by the two phase factors that one can incorporate there, and shows how both terms influence the evolution of the system. A closer analysis reveals that the probabilistic properties of the motion of the walker remain unaltered when the update rule of these phases is chosen adequately. This invariance is based on a symmetry with consequences not yet fully explored.
Content discovery is a critical issue in unstructured Peer-to-Peer (P2P) networks as nodes maintain only local network information. However, similarly without global information about human networks, one still can find specific persons via his/her friends by using social information. Therefore, in this paper, we investigate the problem of how social information (i.e., friends and background information) could benefit content discovery in P2P networks. We collect social information of 384,494 user profiles from Facebook, and build a social P2P network model based on the empirical analysis. In this model, we enrich nodes in P2P networks with social information and link nodes via their friendships. Each node extracts two types of social features–Knowledge and Similarity –and assigns more weight to the friends that have higher similarity and more knowledge. Furthermore, we present a novel content discovery algorithm which can explore the latent relationships among a node’s friends. A node computes stable scores for all its friends regarding their weight and the latent relationships. It then selects the top friends with higher scores to query content. Extensive experiments validate performance of the proposed mechanism. In particular, for personal interests searching, the proposed mechanism can achieve 100% of Search Success Rate by selecting the top 20 friends within two-hop. It also achieves 6.5 Hits on average, which improves 8x8x the performance of the compared methods.
We review the level-crossing problem which includes the first-passage and escape problems as well as the theory of extreme values (the maximum, the minimum, the maximum absolute value and the range or span). We set the definitions and general results and apply them to one-dimensional diffusion processes with explicit results for the Brownian motion and the Ornstein–Uhlenbeck (OU) process.
While human societies are extraordinarily cooperative in comparison with other social species, the question of why we cooperate with unrelated individuals remains open. Here we report results of a lab-in-the-field experiment with people of different ages in a social dilemma. We find that the average amount of cooperativeness is independent of age except for the elderly, who cooperate more, and a behavioural transition from reciprocal, but more volatile behaviour to more persistent actions towards the end of adolescence. Although all ages react to the cooperation received in the previous round, young teenagers mostly respond to what they see in their neighbourhood regardless of their previous actions. Decisions then become more predictable through midlife, when the act of cooperating or not is more likely to be repeated. Our results show that mechanisms such as reciprocity, which is based on reacting to previous actions, may promote cooperation in general, but its influence can be hindered by the fluctuating behaviour in the case of children.
The book contains review articles on recent advances in first-passage phenomena and applications contributed by leading international experts. It is intended for graduate students and researchers who are interested in learning about this intriguing and important topic.
We review the question of the extreme values attained by a random process. We relate it to level crossings to one boundary (first-passage problems) as well as to two boundaries (escape problems). The extremes studied are the maximum, the minimum, the maximum absolute value, and the range or span. We specialize in diffusion processes and present detailed results for the Wiener and Feller processes.