Forecasting with Massive Data in Real Time

Tools and Analytical Techniques

MICROSOFT TECHNOLOGY CENTER, TIMES SQUARE, NY

The aim in this workshop is to display the efforts of industry and academia to cope with large amounts of data that have to be analyzed in real time. As many analytical solutions are specifically designed with an infrastructure in mind, it is the intention during the meetings to discuss not only the algorithms but also the underlying implementations and their suitability for specific applications. The focus is on automated tools.

New (and future) infrastructures/languages/paradigms to tackle massive amounts of data
Forecasting techniques and related applications
- Early Warning Systems
- Pattern recognition
- Anomaly detection
- Event detection
- Nowcasting
- Large-scale networks
Applications performing as close to real-time as possible.
Accuracy-latency-size limitations
What will the automated future decision-making look like? How to prepare for that future?

PROGRAM OVERVIEW

TOPICS

OBJECTIVE

PROGRAM

Schedule

INVITED SPEAKERS

Panos Toulis

Booth School of Business

Implicit Stochastic Gradient Descent for Robust Statistical Analysis with Massive Data Sets

SGD is the jackknife of modern statistical analysis with very large data sets--such as deep learning--but the standard procedures can be hard to tune, and cannot effectively combine numerical stability with statistical efficiency. We present an implicit procedure that combines fast computation with a solution to the stability issues, without sacrificing statistical efficiency. Extensive simulations and real-world data analysis are carried out through our “sgd” R package. Implicit methods are poised to become the workhorse of estimation with large data sets.

Mirco Mannucci

HoloMathics, LLC

Node Alertness - Monitoring change at the Local Level in Large Evolving Graphs

Graph Mining is by now an established area of Data Mining. Detecting patterns of evolution in dynamic graphs has been already investigated in several quarters. However, continuous monitoring change in rapidly evolving big graphs is still a challenge. This article argues for pushing the monitoring activity at the node level, whereas some global knowledge merger will integrate the individual discoveries into a global picture.

We shall discuss some preliminary implementation of this technique using Apache Spark GraphFrames, as well as applications and future directions.

(Dr. Mannucci, CEO of HoloMathics, is working as a Big Data Analytics Consultant at FINRA Technology.)

James Wright

Microsoft Research

Demand Forecasting from Massive Usage Logs

Accurately forecasting the demand for distinct service offerings is a crucial task, both for forecasting revenue and for planning capacity. This talk will describe a demand modeling exercise based on complete daily usage logs from a large online service provider that provides services with a rich set of attributes that have complex effects upon both customer demand and capacity costs.

Michael Kane

Yale University

A Cointegration Approach to Identifying Systemic Risk in Markets

In domains such as financial markets, it can be exceedingly difficult to predict what will happen. For example, news events may occur at any time; they can affect markets in a variety of ways, and they are not amenable to predictive models. However, it is often possible to gauge the systemic risk to tell the extent to which an event can affect the market. This talk analyzes the use of cointegration in financial markets to assess systemic risk, using the 2010 FlashCrash as a case study. Based on this analysis, we will explore an alternative to current, single-stock circuit breaker/collar rules employed by FINRA to control market volatility.

Carlotta Domeniconi, George Mason University

Uday Kamath, BAE Systems Applied Intelligence

Finding Needles in Many Haystacks: A General-purpose Distributed Approach to Large-scale Learning

The need for mining massive data has become paramount in areas like security, education, web mining, social network analysis, and a variety of scientific pursuits. However many traditional supervised and unsupervised learning algorithms break down when applied in big data scenarios: this is known as the “big data problem”. Among other concerns, big data presents serious scalability difficulties for these algorithms.

The two most common ways to get around the scalability issue is to either sample the data to reduce its size, or to customize the learning algorithm to improve its running time via parallelization. Both of these have problems. Sampling often fails because the discovery of useful patterns can require the analysis of the entire collection of data (we call this the needles in the haystacks problem). Techniques that customize individual algorithms typically do not generalize to other algorithms. Further many standard parallelization methods used in this customization can be inefficient when used for the iterative computation which is so often a core part of machine learning algorithms.

In this talk, we discuss current approaches and tools in use to address these shortcomings. In particular, we introduce a method for distributed machine learning which directly tackles key problems posing challenges to successful and scalable mining of big data. We bring together ideas from stochastic optimization and ensemble learning to design a novel and general paradigm to achieve scalable machine learning. The method is general-purpose with regard to the machine learning algorithm and easily adaptable to a variety of heterogeneous grid or cloud computing scenarios. In a nutshell, the emergent behavior of a grid of learning algorithms makes possible the effective processing of large amounts of data, culminating in the discovery of that fraction of data that is crucial to the problem at hand. The emergent behavior only requires local interaction among the learners, resulting in a high speed-up under parallelism. The method does not sacrifice accuracy like sampling does, while at the same time it achieves a general scalable solution that doesn’t need to be tailored for each algorithm.

Jayant Kalagnanam

IBM Research

A Massive Data-Driven Platform for Manufacturing Analytics

Industry 4.0 is the use of IOT to enable realtime access to sensor data for various assets and processed in the production value chain, and the use of this data to create a digital representation or model (also referred to as a cyber physical system) - an accurate representation of the physical world. The digital model is then used for situational awareness, anomaly detection, process monitoring and advisory control for optimizing outcomes (defined by productivity and throughput). I will present ongoing work in IBM Research for Industry 4.0 that is experimenting with a large-scale data ingestion and analytics platform that leverages statistical and machine learning techniques to drive cost savings and operational efficiency across the factory value chain.

Domenico Giannone

Federal Reserve Bank of New York

Now-Casting and the Real-Time Data Flow (Preliminary title and abstract)

The term now-casting is a contraction for now and forecasting and has been used for a long time in meteorology and recently also in economics. In this presentation we survey recent developments in economic now-casting with special focus on those models that formalize key features of how market participants and policymakers read macroeconomic data releases in real-time, which involves monitoring many data, forming expectations about them and revising the assessment on the state of the economy whenever realizations diverge sizeably from those expectations.

Topics to discuss will include state space representations (factor model, model with daily data, mixed-frequency VAR), now-cast updates and news, and practical models (bridge and MIDAS-type equations).

Empirical applications will cover GDP now-casting and a daily index of the state of the economy. Current and future directions on the implementation of these approaches will also be described.

Keynote Speaker

Paul Cohen

DARPA

Beyond Big Data: Technology to Understand Complicated Systems

A difficult question for big data analytics is ‘why?’ Specific questions include the following: Why did this drug stop working? Why is there food insecurity in specific regions of the world? We have component-level, not system-level, understanding of the complicated systems on which we depend for survival. Two current Defense Advanced Research Projects Agency (DARPA) programs are addressing the need for causal models and quantitative analysis to answer the ‘why.’

The World Modelers program aims to develop technology to integrate qualitative causal analyses with quantitative models and relevant data to provide comprehensive understanding of complicated, dynamic national security questions. The goal is to develop approaches that can accommodate and integrate dozens of contributing models connected by thousands of pathways—orders of magnitude beyond what is possible today—to provide clearly parameterized, quantitative projections within weeks or even hours of processing, compared to the months or years it takes today to understand considerably simpler systems. The first use case of World Modelers is food insecurity resulting from interactions among climate, water, soil, markets, and physical security.

Big mechanisms are large, explanatory models of complicated systems in which interactions have important causal effects. The collection of big data is increasingly automated, but the creation of big mechanisms remains a human endeavor made increasingly difficult by the fragmentation and distribution of knowledge. The goal of the Big Mechanism program is for machines to help humans to model and analyze very complicated systems by reading fragmented literatures and assembling the reasoning with models. The domain of the program is cancer biology with an emphasis on signaling pathways. Although the domain of the Big Mechanism program is cancer biology, the overarching goal of the program is to develop technologies for a new kind of science in which research is integrated more or less immediately—automatically or semi-automatically—into causal, explanatory models of unprecedented completeness and consistency. To the extent that the construction of big mechanisms can be automated, it could change how science is done.

Keynote Speaker

CALL FOR PAPERS

SCIENTIFIC COMMITTEE

Claudio Antonini

Michael Kane, Yale

George Monokroussos, Amazon

Selected papers will be published in a special section of the International Journal of Forecasting. Authors of selected papers or presentations will be invited to submit full papers to the IJF.

The Call for Papers is closed.

DATES

Abstract submission deadline 28 February 2017

Abstract acceptance 10 March 2017

Early registration deadline 17 March 2017

International Symposium on Forecasting 25-28 June 2017

Registration