Statistical Mechanics for Big Data

acquisition, analysis and modeling

Description

Technological advances during the last fifteen years have boosted our capacity to generate and store data. Indeed, according to some estimates 90% of the world’s stored data has been generated in the last two years, and the availability of such large quantities of data is changing the way we face crisis response, social mobilization, marketing, and intelligence. In science, the hopes created by big data are also high, and analyses of large datasets are behind recent breakthroughs in areas such as astrophysics, genomics or particle physics.

Big data is a particularly relevant opportunity for the study of complex systems such as cells, societies, ecosystems or economies, the study of which has traditionally been constrained by the limited information available from the different components (and layers of components) comprising the systems. The availability of unprecedented highly detailed data on these systems opens the door to significantly advance our understanding of their behavior, and of how this behavior evolves in time.

Our project is based on the premise that statistical mechanics tools and methods, when applied to large-scale data acquisition, analytics, and modeling, have the potential to successfully address the problem of transforming data into knowledge. The overarching goal of this project is, precisely, to develop and apply a comprehensive set of statistical mechanics methods for large-scale data analysis. To achieve this goal, we propose three main objectives: (M) To develop statistical mechanics tools for the analysis of large-scale data; (D) to develop crowd-sourced data acquisition and processing protocols; (A) To analyze and model, using the methodologies and/or data from objectives M and A, complex systems in three different areas: biochemical systems, techno-social systems, and economic systems.

The methods we propose to develop are aimed at network and non-network data that are, in general, heterogeneous, multidimensional and multilevel, and time-resolved. The development of such methods will allow the construction of predictive models from large quantities of heterogeneous data, and provide guidance and innovative recommendations to practitioners and stakeholders (from academia, industry, and government). Our project relies not only on the interdisciplinary nature of the methodologies of the participants (statistical mechanics, computer science, mathematics, statistics) but also on the direct contact with experts in the fields of economy, finance, biology or chemistry. In this sense it is important to stress the quality of the members of the teams, their experience in different fields and the collaborations they maintain with internationally renowned experts, companies and institutions in many different areas.

Finally, our project deals with problems of large impact in technology and society. Given its comprehensive nature and the collaboration with companies, we expect our project to produce a large number of results with direct impact in our society and economy. Not only on the big data business, but also in problems like urban planning, marketing, financial markets or biology."

Highlights

Bone fusion in normal and pathological development is constrained by the network architecture of the human skull

Esteve-Altava, B, Valles-Catala, T, Guimera, R, Sales-Pardo, M, Rasskin-Gutman, D.

Craniosynostosis, the premature fusion of cranial bones, affects the correct development of the skull producing morphological malformations in newborns. To assess the susceptibility of each craniofacial articulation to close prematurely, we used a network model of the skull to quantify the link r...

Journal

Control of cell–cell forces and collective cell dynamics by the intercellular adhesome

Bazellieres, E, Conte, V, Elosegui-Artola, A, Serra-Picamal, X, Bintanel-Morcillo, M, Roca-Cusachs, P, Muñoz, JJ, Sales-Pardo, M, Guimera, R, Trepat, T.

Dynamics of epithelial tissues determine key processes in development, tissue healing and cancer invasion. These processes are critically influenced by cell–cell adhesion forces. However, the identity of the proteins that resist and transmit forces at cell–cell junctions remains unclear, and how ...

Journal

iMet: A network-based computational tool to assist in the annotation of metabolites from tandem mass spectra

Aguilar-Mogas, A, Sales-Pardo, M, Navarro, M, Guimerà, R, Yanes, O.

Structural annotation of metabolites relies mainly on tandem mass spectrometry (MS/MS) analysis. However, approximately 90% of the known metabolites reported in metabolomic databases do not have annotated spectral data from standards. This situation has fostered the development of computational t...

Journal

People

Roger Guimerà

Universitat Rovira i Virgili - ICREA Research Professor

Contact

roger.guimera@urv.cat

@sees_lab

Site

Marta Sales-Pardo

Universitat Rovira i Virgili - Associate Professor

Contact

marta.sales@urv.cat

@sees_lab

Site

Esteban Moro

Universidad Carlos III de Madrid - Associate Professor

Contact

emoro@math.uc3m.es

@estebanmoro

Site

Ángel Cuevas

Universidad Carlos III de Madrid - Assistant Professor

Insttitut Mines-Telecom SudParis - Adjunct Professor

Contact

acrumin@it.uc3m.es

@acrumin

Site

Jaume Masoliver

Universitat de Barcelona - Professor

Contact

jaume.masoliver@ub.edu

Site

Miquel Montero

Universitat de Barcelona - Associate Professor

Contact

miquel.montero@ub.edu

Site

Josep Perelló

Universtiat de Barcelona - Associate Professor

Contact

josep.perello@ub.edu

@josperello

Site

Jordi Duch

Universitat Rovira i Virgili - Associate Professor

Contact

jordi.duch@urv.cat

@tanisjones

Publications

Craniosynostosis, the premature fusion of cranial bones, affects the correct development of the skull producing morphological malformations in newborns. To assess the susceptibility of each craniofacial articulation to close prematurely, we used a network model of the skull to quantify the link reliability (an index based on stochastic block models and Bayesian inference) of each articulation. We show that, of the 93 human skull articulations at birth, the few articulations that are associated with non-syndromic craniosynostosis conditions have statistically significant lower reliability scores than the others. In a similar way, articulations that close during the normal postnatal development of the skull have also lower reliability scores than those articulations that persist through adult life. These results indicate a relationship between the architecture of the skull and the specific articulations that close during normal development as well as in pathological conditions. Our findings suggest that the topological arrangement of skull bones might act as a structural constraint, predisposing some articulations to closure, both in normal and pathological development, also affecting the long-term evolution of the skull.
Dynamics of epithelial tissues determine key processes in development, tissue healing and cancer invasion. These processes are critically influenced by cell–cell adhesion forces. However, the identity of the proteins that resist and transmit forces at cell–cell junctions remains unclear, and how these proteins control tissue dynamics is largely unknown. Here we provide a systematic study of the interplay between cell–cell adhesion proteins, intercellular forces and epithelial tissue dynamics. We show that collective cellular responses to selective perturbations of the intercellular adhesome conform to three mechanical phenotypes. These phenotypes are controlled by different molecular modules and characterized by distinct relationships between cellular kinematics and intercellular forces. We show that these forces and their rates can be predicted by the concentrations of cadherins and catenins. Unexpectedly, we identified different mechanical roles for ​P-cadherin and ​E-cadherin; whereas ​P-cadherin predicts levels of intercellular force, ​E-cadherin predicts the rate at which intercellular force builds up.
Structural annotation of metabolites relies mainly on tandem mass spectrometry (MS/MS) analysis. However, approximately 90% of the known metabolites reported in metabolomic databases do not have annotated spectral data from standards. This situation has fostered the development of computational tools that predict fragmentation patterns in silico and compare these to experimental MS/MS spectra. However, because such methods require the molecular structure of the detected compound to be available for the algorithm, the identification of novel metabolites in organisms relevant for biotechnological and medical applications remains a challenge. Here, we present iMet, a computational tool that facilitates structural annotation of metabolites not described in databases. iMet uses MS/MS spectra and the exact mass of an unknown metabolite to identify metabolites in a reference database that are structurally similar to the unknown metabolite. The algorithm also suggests the chemical transformation that converts the known metabolites into the unknown one. As a proxy for the structural annotation of novel metabolites, we tested 148 metabolites following a leave-one-out cross-validation procedure or by using MS/MS spectra experimentally obtained in our laboratory. We show that for 89% of the 148 metabolites at least one of the top four matches identified by iMet enables the proper annotation of the unknown metabolites. To further validate iMet, we tested 31 metabolites proposed in the 2012–16 CASMI challenges.
Recommendation systems are designed to predict users’ preferences and provide them with recommendations for items such as books or movies that suit their needs. Recent developments show that some probabilistic models for user preferences yield better predictions than latent feature models such as matrix factorization. However, it has not been possible to use them in real-world datasets because they are not computationally efficient. We have developed a rigorous probabilistic model that outperforms leading approaches for recommendation and whose parameters can be fitted efficiently with an algorithm whose running time scales linearly with the size of the dataset. This model and inference algorithm open the door to more approaches to recommendation and to other problems where matrix factorization is currently used.
In a complex system, perturbations propagate by following paths on the network of interactions among the system’s units. In contrast to what happens with the spreading of epidemics, observations of general perturbations are often very sparse in time (there is a single observation of the perturbed system) and in “space” (only a few perturbed and unperturbed units are observed). A major challenge in many areas, from biology to the social sciences, is to infer the propagation paths from observations of the effects of perturbation under these sparsity conditions. We address this problem and show that it is possible to go beyond the usual approach of using the shortest paths connecting the known perturbed nodes. Specifically, we show that a simple and general probabilistic model, which we solved using belief propagation, provides fast and accurate estimates of the probabilities of nodes being perturbed.
Communities are basic components in networks. As a promising social application, community recommendation selects a few items (e.g., movies and books) to recommend to a group of users. It usually achieves higher recommendation precision if the users share more interests; whereas, in plenty of communities (e.g., families, work groups), the users often share few. With billions of communities in online social networks, quickly selecting the communities where the members are similar in interests is a prerequisite for community recommendation. To this end, we propose an easy-to-compute metric, Community Similarity Degree (CSD), to estimate the degree of interest similarity among multiple users in a community. Based on 3460 emulated Facebook communities, we conduct extensive empirical studies to reveal the characteristics of CSD and validate the effectiveness of CSD. In particular, we demonstrate that selecting communities with larger CSD can achieve higher recommendation precision. In addition, we verify the computation efficiency of CSD: it costs less than 1 hour to calculate CSD for over 1 million of communities. Finally, we draw insights about feasible extensions to the definition of CSD, and point out the practical uses of CSD in a variety of applications other than community recommendation.
In complex systems, the network of interactions we observe between systems components is the aggregate of the interactions that occur through different mechanisms or layers. Recent studies reveal that the existence of multiple interaction layers can have a dramatic impact in the dynamical processes occurring on these systems. However, these studies assume that the interactions between systems components in each one of the layers are known, while typically for real-world systems we do not have that information. Here, we address the issue of uncovering the different interaction layers from aggregate data by introducing multilayer stochastic block models (SBMs), a generalization of single-layer SBMs that considers different mechanisms of layer aggregation. First, we find the complete probabilistic solution to the problem of finding the optimal multilayer SBM for a given aggregate-observed network. Because this solution is computationally intractable, we propose an approximation that enables us to verify that multilayer SBMs are more predictive of network structure in real-world complex systems.
In social networks, individuals constantly drop ties and replace them by new ones in a highly unpredictable fashion. This highly dynamical nature of social ties has important implications for processes such as the spread of information or of epidemics. Several studies have demonstrated the influence of a number of factors on the intricate microscopic process of tie replacement, but the macroscopic long-term effects of such changes remain largely unexplored. Here we investigate whether, despite the inherent randomness at the microscopic level, there are macroscopic statistical regularities in the long-term evolution of social networks. In particular, we analyze the email network of a large organization with over 1,000 individuals throughout four consecutive years. We find that, although the evolution of individual ties is highly unpredictable, the macro-evolution of social communication networks follows well-defined statistical patterns, characterized by exponentially decaying log-variations of the weight of social ties and of individuals’ social strength. At the same time, we find that individuals have social signatures and communication strategies that are remarkably stable over the scale of several years.
Thrombus formation is a multiscale phenomenon triggered by platelet deposition over a protrombotic surface (eg. a ruptured atherosclerotic plaque). Despite the medical urgency for computational tools that aid in the early diagnosis of thrombotic events, the integration of computational models of thrombus formation at different scales requires a comprehensive understanding of the role and limitation of each modelling approach. We propose three different modelling approaches to predict platelet deposition. Specifically, we consider measurements of platelet deposition under blood flow conditions in a perfusion chamber for different time periods (3, 5, 10, 20 and 30 minutes) at shear rates of 212 1/s, 1390 1/s and 1690 1/s. Our modelling approaches are: i) a model based on the mass-transfer boundary layer theory; ii) a machine-learning approach; and iii) a phenomenological model. The results indicate that the three approaches on average have median errors of 21%, 20.7% and 14.2%, respectively. Our study demonstrates the feasibility of using an empirical data set as a proxy for a real-patient scenario in which practitioners have accumulated data on a given number of patients and want to obtain a diagnosis for a new patient about whom they only have the current observation of a certain number of variables.
Facebook is today the most popular social network with more than one billion subscribers worldwide. To provide good quality of service (e.g., low access delay) to their clients, FB relies on Akamai, which provides a worldwide content distribution network with a large number of edge servers that are much closer to FB subscribers. In this article we aim to depict a global picture of the current FB network infrastructure deployment taking into account both native FB servers and Akamai nodes. Toward this end, we have performed a measurement-based analysis during a period of two weeks using 463 Planet- Lab nodes distributed across 41 countries. Based on the obtained data we compare the average access delay that nodes in different countries experience accessing both native FB servers and Akamai nodes. In addition, we obtain a wide view of the deployment of Akamai nodes serving FB users worldwide. Finally, we analyze the geographical coverage of those nodes, and demonstrate that in most of the cases Akamai nodes located in a particular country service not only local FB subscribers, but also FB users located in nearby countries.
Circadian rhythms are known to be important drivers of human activity and the recent availability of electronic records of human behaviour has provided fine-grained data of temporal patterns of activity on a large scale. Further, questionnaire studies have identified important individual differences in circadian rhythms, with people broadly categorised into morning-like or evening-like individuals. However, little is known about the social aspects of these circadian rhythms, or how they vary across individuals. In this study we use a unique 18-month dataset that combines mobile phone calls and questionnaire data to examine individual differences in the daily rhythms of mobile phone activity. We demonstrate clear individual differences in daily patterns of phone calls, and show that these individual differences are persistent despite a high degree of turnover in the individuals' social networks. Further, women's calls were longer than men's calls, especially during the evening and at night, and these calls were typically focused on a small number of emotionally intense relationships. These results demonstrate that individual differences in circadian rhythms are not just related to broad patterns of morningness and eveningness, but have a strong social component, in directing phone calls to specific individuals at specific times of day.
Big Data on electronic records of social interactions allow approaching human behaviour and sociality from a quantitative point of view with unforeseen statistical power. Mobile telephone Call Detail Records (CDRs), automatically collected by telecom operators for billing purposes, have proven especially fruitful for understanding one-to-one communication patterns as well as the dynamics of social networks that are reflected in such patterns. We present an overview of empirical results on the multi-scale dynamics of social dynamics and networks inferred from mobile telephone calls. We begin with the shortest timescales and fastest dynamics, such as burstiness of call sequences between individuals, and “zoom out” towards longer temporal and larger structural scales, from temporal motifs formed by correlated calls between multiple individuals to long-term dynamics of social groups. We conclude this overview with a future outlook.
High-throughput experimental techniques and bioinformatics tools make it possible to obtain reconstructions of the metabolism of microbial species. Combined with mathematical frameworks such as flux balance analysis, which assumes that nutrients are used so as to maximize growth, these reconstructions enable us to predict microbial growth. Although such predictions are generally accurate, these approaches do not give insights on how different nutrients are used to produce growth, and thus are difficult to generalize to new media or to different organisms. Here, we propose a systems-level phenomenological model of metabolism inspired by the virial expansion. Our model predicts biomass production given the nutrient uptakes and a reduced set of parameters, which can be easily determined experimentally. To validate our model, we test it against in silico simulations and experimental measurements of growth, and find good agreement. From a biological point of view, our model uncovers the impact that individual nutrients and the synergistic interaction between nutrient pairs have on growth, and suggests that we can understand the growth maximization principle as the optimization of nutrient synergies.
In the era when Facebook and Twitter dominate the market for social media, Google has introduced Google+ (G+) and reported a significant growth in its size while others called it a ghost town. This begs the question of whether G+ can really attract a significant number of connected and active users despite the dominance of Facebook and Twitter. This paper presents a detailed longitudinal characterization of G+ based on large-scale measurements. We identify the main components of G+ structure and characterize the key feature of their users and their evolution over time. We then conduct detailed analysis on the evolution of connectivity and activity among users in the largest connected component (LCC) of G+ structure, and compare their characteristics to other major online social networks (OSNs). We show that despite the dramatic growth in the size of G+, the relative size of the LCC has been decreasing and its connectivity has become less clustered. While the aggregate user activity has gradually increased, only a very small fraction of users exhibit any type of activity, and an even smaller fraction of these users attracts any reaction. The identity of users with most followers and reactions reveal that most of them are related to high-tech industry. To our knowledge, this study offers the most comprehensive characterization of G+ based on the largest collected datasets.
Recent widespread adoption of electronic and pervasive technologies has enabled the study of human behavior at an unprecedented level, uncovering universal patterns underlying human activity, mobility, and interpersonal communication. In the present work, we investigate whether deviations from these universal patterns may reveal information about the socio-economical status of geographical regions. We quantify the extent to which deviations in diurnal rhythm, mobility patterns, and communication styles across regions relate to their unemployment incidence. For this we examine a country-scale publicly articulated social media dataset, where we quantify individual behavioral features from over 19 million geo-located messages distributed among more than 340 different Spanish economic regions, inferred by computing communities of cohesive mobility fluxes. We find that regions exhibiting more diverse mobility fluxes, earlier diurnal rhythms, and more correct grammatical styles display lower unemployment rates. As a result, we provide a simple model able to produce accurate, easily interpretable reconstruction of regional unemployment incidence from their social-media digital fingerprints alone. Our results show that cost-effective economical indicators can be built based on publicly-available social media datasets.
We analyze how to value future costs and benefits when they must be discounted relative to the present. We introduce the subject for the nonspecialist and take into account the randomness of the economic evolution by studying the discount function of three widely used processes for the dynamics of interest rates: Ornstein-Uhlenbeck, Feller, and log-normal. Besides obtaining exact expressions for the discount function and simple asymptotic approximations, we show that historical average interest rates overestimate long-run discount rates and that this effect can be large. In other words, long-run discount rates should be substantially less than the average rate observed in the past, otherwise any cost-benefit calculation would be biased in favor of the present and against interventions that may protect the future.
Information flow during catastrophic events is a critical aspect of disaster management. Modern communication platforms, in particular online social networks, provide an opportunity to study such flow and derive early-warning sensors, thus improving emergency preparedness and response. Performance of the social networks sensor method, based on topological and behavioral properties derived from the “friendship paradox”, is studied here for over 50 million Twitter messages posted before, during, and after Hurricane Sandy. We find that differences in users’ network centrality effectively translate into moderate awareness advantage (up to 26 hours); and that geo-location of users within or outside of the hurricane-affected area plays a significant role in determining the scale of such an advantage. Emotional response appears to be universal regardless of the position in the network topology, and displays characteristic, easily detectable patterns, opening a possibility to implement a simple “sentiment sensing” technique that can detect and locate disasters.
In this paper, we present closed-form expressions for the wave function that governs the evolution of the discrete-time quantum walk on the line when the coin operator is arbitrary. The formulas were derived assuming that the walker can either remain put in the place or proceed in a fixed direction but never move backward, although they can be easily modified to describe the case in which the particle can travel in both directions. We use these expressions to explore properties of magnitudes associated to the process, as the probability mass function or the probability current, even though we also consider the asymptotic behavior of the exact solution. Within this approximation, we will estimate upper and lower bounds, examine the origins of an emerging approximate symmetry, and deduce the general form of the stationary probability density of the relative location of the walker.
Understanding how much two individuals are alike in their interests (i.e., interest similarity) has become virtually essential for many applications and services in Online Social Networks (OSNs). Since users do not always explicitly elaborate their interests in OSNs like Facebook, how to determine users' interest similarity without fully knowing their interests is a practical problem. In this paper, we investigate how users' interest similarity relates to various social features (e.g. geographic distance); and accordingly infer whether the interests of two users are alike or unalike where one of the users' interests are unknown. Relying on a large Facebook dataset, which contains 479,048 users and 5,263,351 user-generated interests, we present comprehensive empirical studies and verify the homophily of interest similarity across three interest domains (movies, music and TV shows). The homophily reveals that people tend to exhibit more similar tastes if they have similar demographic information (e.g., age, location), or if they are friends. It also shows that the individuals with a higher interest entropy usually share more interests with others. Based on these results, we provide a practical prediction model under a real OSN environment. For a given user with no interest information, this model can select some individuals who not only exhibit many interests but also probably achieve high interest similarities with the given user. Eventually, we illustrate a use case to demonstrate that the proposed prediction model could facilitate decision-making for OSN applications and services.
This paper unveils some features of a discrete-time quantum walk on the line whose coin depends on the temporal variable. After considering the most general form of the unitary coin operator, it focuses on the role played by the two phase factors that one can incorporate there, and shows how both terms influence the evolution of the system. A closer analysis reveals that the probabilistic properties of the motion of the walker remain unaltered when the update rule of these phases is chosen adequately. This invariance is based on a symmetry with consequences not yet fully explored.
Content discovery is a critical issue in unstructured Peer-to-Peer (P2P) networks as nodes maintain only local network information. However, similarly without global information about human networks, one still can find specific persons via his/her friends by using social information. Therefore, in this paper, we investigate the problem of how social information (i.e., friends and background information) could benefit content discovery in P2P networks. We collect social information of 384,494 user profiles from Facebook, and build a social P2P network model based on the empirical analysis. In this model, we enrich nodes in P2P networks with social information and link nodes via their friendships. Each node extracts two types of social features–Knowledge and Similarity –and assigns more weight to the friends that have higher similarity and more knowledge. Furthermore, we present a novel content discovery algorithm which can explore the latent relationships among a node’s friends. A node computes stable scores for all its friends regarding their weight and the latent relationships. It then selects the top friends with higher scores to query content. Extensive experiments validate performance of the proposed mechanism. In particular, for personal interests searching, the proposed mechanism can achieve 100% of Search Success Rate by selecting the top 20 friends within two-hop. It also achieves 6.5 Hits on average, which improves 8x8x the performance of the compared methods.
We review the level-crossing problem which includes the first-passage and escape problems as well as the theory of extreme values (the maximum, the minimum, the maximum absolute value and the range or span). We set the definitions and general results and apply them to one-dimensional diffusion processes with explicit results for the Brownian motion and the Ornstein–Uhlenbeck (OU) process.
While human societies are extraordinarily cooperative in comparison with other social species, the question of why we cooperate with unrelated individuals remains open. Here we report results of a lab-in-the-field experiment with people of different ages in a social dilemma. We find that the average amount of cooperativeness is independent of age except for the elderly, who cooperate more, and a behavioural transition from reciprocal, but more volatile behaviour to more persistent actions towards the end of adolescence. Although all ages react to the cooperation received in the previous round, young teenagers mostly respond to what they see in their neighbourhood regardless of their previous actions. Decisions then become more predictable through midlife, when the act of cooperating or not is more likely to be repeated. Our results show that mechanisms such as reciprocity, which is based on reacting to previous actions, may promote cooperation in general, but its influence can be hindered by the fluctuating behaviour in the case of children.
The book contains review articles on recent advances in first-passage phenomena and applications contributed by leading international experts. It is intended for graduate students and researchers who are interested in learning about this intriguing and important topic.
We review the question of the extreme values attained by a random process. We relate it to level crossings to one boundary (first-passage problems) as well as to two boundaries (escape problems). The extremes studied are the maximum, the minimum, the maximum absolute value, and the range or span. We specialize in diffusion processes and present detailed results for the Wiener and Feller processes.