The Parable of Google Flu: Traps in Big Data Analysis
In February 2013, Google Flu Trends (GFT) made headlines but not for a reason that Google executives or the creators of the flu tracking system would have hoped. Nature reported that GFT was predicting more than double the proportion of doctor visits for influenza-like illness (ILI) than the Centers for Disease Control and Prevention (CDC), which bases its estimates on surveillance reports from laboratories across the United States (1, 2). This happened despite the fact that GFT was built to predict CDC reports. Given that GFT is often held up as an exemplary use of big data (3, 4), what lessons can we draw from this error?
Sums of variables at the onset of chaos
We explain how specific dynamical properties give rise to the limit distribution of sums of deterministic variables at the transition to chaos via the period-doubling route. We study the sums of successive positions generated by an ensemble of initial conditions uniformly distributed in the entire phase space of a unimodal map as represented by the logistic map. We find that these sums acquire their salient, multiscale, features from the repellor preimage structure that dominates the dynamics toward the attractors along the period-doubling cascade. And we explain how these properties transmit from the sums to their distribution. Specifically, we show how the stationary distribution of sums of positions at the Feigebaum point is built up from those associated with the supercycle attractors forming a hierarchical structure with multifractal and discrete scale invariance properties.
The Bursty Dynamics of the Twitter Information Network
In online social media systems users are not only posting, consuming, and resharing content, but also creating new and destroying existing connections in the underlying social network. While each of these two types of dynamics has individually been studied in the past, much less is known about the connection between the two. How does user information posting and seeking behavior interact with the evolution of the underlying social network structure?
Here, we study ways in which network structure reacts to users posting and sharing content. We examine the complete dynamics of the Twitter information network, where users post and reshare information while they also create and destroy connections. We find that the dynamics of network structure can be characterized by steady rates of change, interrupted by sudden bursts. Information diffusion in the form of cascades of post re-sharing often creates such sudden bursts of new connections, which significantly change users’ local network structure. These bursts transform users’ networks of followers to become structurally more cohesive as well as more homogenous in terms of follower interests. We also explore the effect of the information content on the dynamics of the network and find evidence that the appearance of new topics and real-world events can lead to significant changes in edge creations and deletions. Lastly, we develop a model that quantifies the dynamics of the network and the occurrence of these bursts as a function of the information spreading through the network. The model can successfully predict which information diffusion events will lead to bursts in network dynamics.
Geo-located Twitter as proxy for global mobility patterns
Pervasive presence of location-sharing services made it possible for researchers to gain an unprecedented access to the direct records of human activity in space and time. This article analyses geo-located Twitter messages in order to uncover global patterns of human mobility. Based on a dataset of almost a billion tweets recorded in 2012, we estimate the volume of international travelers by country of residence. Mobility profiles of different nations were examined based on such characteristics as mobility rate, radius of gyration, diversity of destinations, and inflowÿÿoutflow balance. Temporal patterns disclose the universally valid seasons of increased international mobility and the particular character of international travels of different nations. Our analysis of the community structure of the Twitter mobility network reveals spatially cohesive regions that follow the regional division of the world. We validate our result using global tourism statistics and mobility models provided by other authors and argue that Twitter is exceptionally useful for understanding and quantifying global mobility patterns.
Shock waves on complex networks
Power grids, road maps, and river streams are examples of infrastructural networks which are highly vulnerable to external perturbations. An abrupt local change of load (voltage, traffic density, or water level) might propagate in a cascading way and affect a significant fraction of the network. Almost discontinuous perturbations can be modeled by shock waves which can eventually interfere constructively and endanger the normal functionality of the infrastructure. We study their dynamics by solving the Burgers equation under random perturbations on several real and artificial directed graphs. Even for graphs with a narrow distribution of node properties (e.g., degree or betweenness), a steady state is reached exhibiting a heterogeneous load distribution, having a difference of one order of magnitude between the highest and average loads. Unexpectedly we find for the European power grid and for finite Watts-Strogatz networks a broad pronounced bimodal distribution for the loads. To identify the most vulnerable nodes, we introduce the concept of node-basin size, a purely topological property which we show to be strongly correlated to the average load of a node.
Netconomics: Novel Forecasting Techniques from the Combination of Big Data, Network Science and Economics
The combination of the network theoretic approach with recently available abundant economic data leads to the development of novel analytic and computational tools for modelling and forecasting key economic indicators. The main idea is to introduce a topological component into the analysis, taking into account consistently all higher-order interactions. We present three basic methodologies to demonstrate different approaches to harness the resulting network gain. First, a multiple linear regression optimisation algorithm is used to generate a relational network between individual components of national balance of payment accounts. This model describes annual statistics with a high accuracy and delivers good forecasts for the majority of indicators. Second, an early-warning mechanism for global financial crises is presented, which combines network measures with standard economic indicators. From the analysis of the cross-border portfolio investment network of long-term debt securities, the proliferation of a wide range of over-the-counter-traded financial derivative products, such as credit default swaps, can be described in terms of gross-market values and notional outstanding amounts, which are associated with increased levels of market interdependence and systemic risk. Third, considering the flow-network of goods traded between G-20 economies, network statistics provide better proxies for key economic measures than conventional indicators. For example, it is shown that a country’s gate-keeping potential, as a measure for local power, projects its annual change of GDP generally far better than the volume of its imports or exports.
Predicting Scientific Success Based on Coauthorship Networks
We address the question to what extent the success of scientific articles is due to social influence. Analyzing a data set of over 100000 publications from the field of Computer Science, we study how centrality in the coauthorship network differs between authors who have highly cited papers and those who do not. We further show that a machine learning classifier, based only on coauthorship network centrality measures at time of publication, is able to predict with high precision whether an article will be highly cited five years after publication. By this we provide quantitative insight into the social dimension of scientific publishing – challenging the perception of citations as an objective, socially unbiased measure of scientific success.
Correlation of automorphism group size and topological properties with program-size complexity evaluations of graphs and complex networks
We show that numerical approximations of Kolmogorov complexity (K) of graphs and networks capture some group-theoretic and topological properties of empirical networks, ranging from metabolic to social networks, and of small synthetic networks that we have produced. That K and the size of the group of automorphisms of a graph are correlated opens up interesting connections to problems in computational geometry, and thus connects several measures and concepts from complexity science. We derive these results via two different Kolmogorov complexity approximation methods applied to the adjacency matrices of the graphs and networks. The methods used are the traditional lossless compression approach to Kolmogorov complexity, and a normalized version of a Block Decomposition Method (BDM) based on algorithmic probability theory.
– Gottfried Mayer, Founding Editor
– Carlos Gershenson, Editor-in-Chief
Complexity Digest Subscriptions