Market Reconstruction 2.0: Visualization at Scale - FIS

Viewer
Transcript

Market Reconstruction 2.0: Visualization at Scale Neil Palmer, Justine Chen, Sam Sinha, Filipe Araujo, Michael Grinthal, Fawad Rafi - FIS {neil.palmer; justine.chen; sam.sinha; filipe.araujo; michael.grinthal; fawad.rafi}@fisglobal.com

Abstract This paper describes the front-end architecture for a large-scale securities transaction surveillance and forensics platform capable of supporting the ingestion, linkage and analysis of granular U.S. equity and equity options market events spanning multi-year periods. One component of this platform is a regulatory user interface (UI) that facilitates the navigation and visualization of the entire universe of U.S. market events, expected to grow to a size of 35 petabytes (PB) in seven years. Various aspects of the front end’s construction, architectural design and UI and user experience (UX) approaches are detailed, including the key benefits and drawbacks of the chosen architecture and technology stack.

1 Introduction FISTM developed the Market Reconstruction Platform (MRP) 1 using Google Cloud Dataflow and Google Cloud Bigtable as tools in processing and indexing highly granular market data events, ultimately publishing the results into a Google BigQuery dataset for visualization and analytics. 2 This last step afforded the UI team an opportunity to design an interface with analytical capabilities that will allow the audience of the platform to visualize the correlation of nearly 35 PB of U.S. equity and equity options data. Visualization of this dataset is effectively the last mile – the culmination point of vigorous number crunching by 3,500 nodes in Google’s Cloud, all working together to link disparate events to the order of billions. 3 Ultimately, here is where the responsibility lies in providing regulators access to an ocean of trade data in a manner that is simple, intuitive and responsive. To design such an interface, it was crucial to understand the inherent challenges of presenting data at these scales, as well as the prevailing limitations to applying advanced visualization techniques within modern web browsers. In the following sections, the two key goals of the web application are discussed, as well as our initial approaches to achieving them. We will dissect the associated challenges, the synthesized data used to simulate the requisite scale demanded of the platform, and the design decisions that were made, from both usability and architecture perspectives.

2 A window into 35 petabytes of time-series data 2.1 The challenge The principal challenge of the interface can be summarized as providing regulators with a way to discover events of interest in a timely and intuitive manner – from a dataset of market events expected to grow to 35 PB. The primary responsibilities of the web application, therefore, are to provide regulators with the ability to navigate the dataset, visualizing the results in a meaningful way and ensuring that all security measures for access control are seamless. To provide a sense of the scale of information that regulators must absorb, interpret and act upon via the platform, the primary U.S. options market data feed, SIAC OPRA, in April of 2016 observed a peak rate of over 10,000,000 messages per second, 4 with the average size of each message being approximately 66 bytes. 5 This adds up to approximately 5.2 Gbps of information. The productive navigation of arbitrarily event-dense windows of time, spanning multiple execution venues and instruments, and delivered exclusively through a modern browser interface, is the crux of the challenge with which we are faced.

2.2 A scatterplot of events Palmer (2016) extensively details the ingestion and indexing process for the universe of synthetic market events visualized herein. One particularly noteworthy characteristic of the MRP’s overall system architecture is the absence of a traditional middleware API tier. As the dataset must physically reside remotely from the edge browser performing the visualizations, the volume and frequency of communication between the browser and the ultimate repository is a function of the volume of data narrowed by the user’s search criteria. While a standard, REST-style API design for such a system need not be complex or intricate, the ability to eschew with traditional middleware tiers brings a welcome reduction to the net operational burden of the solution. In lieu of a dedicated middleware tier, the front-end application communicates with BigQuery directly, sending SQL constructed by the front-end over HTTPS to BigQuery’s Google Cloud Platform API endpoint. 1 see “Solving the Biggest Big Data Challenge in Capital Markets” at

http://www.fisglobal.com/Solutions/Institutional-and-Wholesale/Broker-Dealer/Market-Reconstructions-and-Visualization

2 N. Palmer, S. Sferrazza, S. Just, A. Najman (2016) “Market Visualization 2.0: A Financial Services Application of Google Cloud Bigtable and Google Cloud Dataflow” 3 Ibid. 4 Financial Information Forum (2016) ”FIF April 2016 Market Data Capacity Statistics” 5 see Nanex (2012) “Nanex Compression on an OPRA Direct Feed” at http://www.nanex.net/opracompression.html

Market Reconstruction 2.0: Visualization at Scale

1

We found the lack of a dedicated middleware tier to be a key enabler for the rapid development of the front-end application. However, this placed a stronger emphasis on the format of the schema that the front-end would be querying directly. In essence, the published dataset’s schema was the primary back-end API against which the visualizations were developed. For both performance and usability factors, the schema was denormalized in order to align the query patterns of the front-end to the storage of the underlying events. Figure 1 describes the attributes of the market event data available to the front-end for analysis and visualization. 6 Figure 1: Market event data attributes. +-----------------+--------------------------------+ | Last modified | Schema | +-----------------+--------------------------------+ | 24 Mar 11:52:33 | |- id: string | | | |- Parent: string | | | |- GrandParent: string | | | |- timestamp: string | | | |- EventId: string | | | |- ReporterId: string | | | |- Child: string | | | |- CustomerId: string | | | |- CustomerIdValid: string | | | |- side: string | | | |- CHILD_side: string | | | |- symbol: string | | | |- CHILD_symbol: string | | | |- CHILD_ordType: string | | | |- eventType: string | | | |- brokenTag: string | | | |- ReporterIdValid: boolean | | | |- SideMatch: boolean | | | |- SymbolMatch: boolean | | | |- SymbolValid: boolean | | | |- OrderTypeMatch: boolean | | | |- splitCount: integer | | | |- timeInForce: integer | | | |- size: float | | | |- avgPrice: float | +-----------------+--------------------------------+

From here, it was a logical step to identify a few key pieces of information for display, and then offer a way to drill down for further details about each event and its associated relationships. This being a historical time-series dataset, plotting events on a timeline quite naturally affords an intuitive perspective to their visualization. Another dimension we chose to plot is the order size (denoting the volume of shares) of the event. Finally, a third identifier, the event type, is represented using color. For order sizes being plotted against the time dimension, these three attributes provide end users with fundamental, event-specific information at a single glance. The next step adds interactivity to the flat X-Y event chart, in the form of a tooltip that appears as the cursor hovers over an event. The tooltip contains additional information about the targeted event such as its side (BUY or SELL) and reporter ID (i.e. the originating market participant). In addition, panning and zooming capabilities provide an additional layer of interactivity to pinpoint events of interest in a sea of trades that may have taken place for a security within only a matter of hours. 6 Palmer, et. al supra

Market Reconstruction 2.0: Visualization at Scale

2

Figure 2: MRP provides a multi-dimensional visualization of events on a timeline.

2.2.1 Querying for events The scatter plot is generated by selecting query parameters from the Search page with inputs such as: ● Stock Symbol ● Start and End Dates

● Side (Buy or Sell) ● Start and End Times

● Reporter ID ● Min and Max Prices

● Event Type (New, Routed, Filled, etc.) ● Min and Max Order Sizes

Figure 3.

The user-specified parameters from the query page are subsequently transformed into a query by the Events API. The query is then executed on a BigQuery dataset containing event data. The JSON response is subsequently converted into a JavaScript array and then fed into the Polymer scatter plot web component via data binding. A Polymer observer function on the array object kicks off the Data-

Market Reconstruction 2.0: Visualization at Scale

3

Driven Documents (D3) scatter plot generation process as soon as the array changes, which may be triggered by the user searching for a different symbol.

2.2.2 Implementing with D3 The scatter plot is built exclusively with functions available in the D3.js library. 7 Apart from some initial challenges on how to place groups of elements on the SVG so that they maintain their interactive properties, we found D3 to be an excellent library for displaying the type and volume of events data with which the system will work. Some experimentation was warranted to ensure that panning and zooming capabilities were in conflict with each other, and more importantly, did not shield the dots from mouseover, mouseout and click events – the latter being critical to identifying an interesting event and subsequently navigating to its lifecycle. The net result was a series of steps to create the chart, add interactive capabilities, plug in the data, and let D3 do the heavy lifting. The fundamental algorithm for this chart rendering can be broken down, as illustrated in Figure 4. Figure 4: A breakdown of the chart’s algorithm. ●

●

● ● ● ●

Initialization ○ Date format manipulation and sorting ○ Get chart dimensions with this.getBoundingClientRect(), and derive the chart height and width Generate scales and axes from the data ○ Create the x and y scales with D3 functions  d3.time.scale() for the time (x) axis  d3.scale.linear() for the order size (y) axis ○ Create the x and y axes and pass the respective scales  d3.svg.axis() Create the chart SVG with the generated axes Append the chart SVG to the parent container in the page Add legend and info card to the chart SVG Add capabilities and plug in the data ○ Use D3's zoom behavior and pass it the x scale for zooming on the x axis, and direct it to redraw the event dots upon zooming ○ Define and add a clip-path so that circles outside the range disappear instead of bleeding out of the graph ○ Append the x and y axes to the main chart ○ Bind the data, append circles to the clip-path, and add necessary attributes (such as event ID) to identify them ○ Add mouseover and mouseout events to show/hide the tooltip - the tooltip picks up event details from attributes added to the dots ○ Add click event to pick up the eventId of the dot and direct the user to the lifecycle page for the event

2.3 Limitations of the scatter plot So far, the combination of a Polymer web component housing the D3.js scatter plot gives us a great way to display event details on a timeline. However, it does not solve the more difficult part of the problem – scale. A search for events within a large enough time window can return a set that is simply too large for the browser to effectively handle. While in our test runs we have seen the interface handle up to 3,500 events gracefully, choppiness is evident at event volumes over this threshold. Interactive features such as panning and zooming, which require the browser to calculate placement of dots and redraw the 7 see “D3” at https://d3js.org/

Market Reconstruction 2.0: Visualization at Scale

4

SVG, can bring the animation frame rate down to single digits. Additionally, browsers running in less than ideal conditions, such as on workstations with older hardware, must also be accommodated. Figure 5: Steady increases in events tested well, yet the animation frame rate declined past a certain threshold.

Figure 5 illustrates a fairly linear increase in both BigQuery response times (in blue) and D3 render times (in green), as the number of events steadily increases from 1,000 to 10,000. However, it does not account for the panning and zooming animations. From the chart, we can see that D3 takes 827 milliseconds to render 10,000 events – a substantial part of which is used to calculate dot placement on the SVG. Each time a zoom or pan event is triggered, D3 uses a translation vector to redraw the dots, which is unsustainable for a large number of events. On the other hand, removing pan and zoom capabilities results in the likelihood of highly cluttered or illegible plots. Since the ultimate goal is to facilitate navigation of the massive dataset, UI capabilities are a first-class feature. Illegibility is not an option. Furthermore, constraining the time inputs of the search page to a smaller maximum window is also not an ideal solution to the problem. There is always the possibility that an inordinate number of events occurs within a time window that was previously deemed acceptable. It was necessary to develop a solution that would be agnostic to the actual granularity of the time windows under analysis. To accomplish this, we realized the need to continually aggregate events that lie past a certain optimal threshold (largely constrained by device interface characteristics) before providing the results to D3 for rendering of the particular time frame.

2.4 Aggregation as a tactic for managing dataset navigation 2.4.1 Slicing time-series data In the scenario in which the user submits a query spanning a large enough time window, such that the response cannot be shown on a scatter plot, some form of aggregation is necessary. The ultimate goal for the UI is to allow the user to gracefully and fluidly zoom in step by step until they reach a granularity where the UI can display the scatter plot without any visual or user-experience degradation.

Market Reconstruction 2.0: Visualization at Scale

5

With that in mind, we found that Binned Aggregation is a good strategy for the type of data with which we are working. Binning is a way to aggregate data by counting the number of points falling within a predefined “bin” – a logical set of dimensions exhibited by the data. 8 Numerical variables are used to define bins as adjacent intervals over a continuous range. Since this data is represented as a time series, we chose hours as the lowest granularity of aggregation. This means that each bin will contain an hour's worth of event counts for the particular symbol in question, as illustrated in Figure 6. Figure 6: An example of Binned Aggregation using event counts for symbols in hour blocks.

The advantage to this approach is that we can easily sum up the number of events in any combination of hour blocks desired, as illustrated in Figure 7.

8 Z. Liu, B. Jiang, J. Heer (2013) “imMens: Real-time Visual Querying of Big Data”

Market Reconstruction 2.0: Visualization at Scale

6

Figure 7: Binned Aggregation provides the flexibility to sum up events in any combination of hour blocks.

To aggregate events in this manner, it is necessary to extract the hour, day and month attributes of each event’s time stamp into its own distinct columns, and use those values by which to group events. In a traditional RDBMS, for performance reasons, this might warrant saving intermediate query results into a separate temporary table on which to run further aggregations. However, as illustrated in the chart below, BigQuery accommodates aggregation queries that employ multiple group-by clauses with ease. Figure 8: Response times were not materially affected by sharp increases in datasets.

Market Reconstruction 2.0: Visualization at Scale

7

In one performance test iteration, an aggregation query was executed using a group-by clause on the hour, day, month and event type, and run in serial against datasets exhibiting exponentially increasing overall sizes. We took the average response time from 25 executions of this particular query and, as indicated by the slope of the trend line, there was only an increase in response time of a single minute, despite the increasingly sharp growth of the datasets. For datasets containing from 150,000 rows to 75,000,000 rows, we concluded that there was no material change in response times.

2.4.2 A conversation between the browser and BigQuery Building upon the aggregation technique above, a play-by-play interaction between the browser and BigQuery was developed. The interaction begins with the user entering the query parameters in a form and then submitting the form. Figure 9 highlights the interaction during a search of all market events in symbol GOOG for the entire first quarter of 2016. Figure 9: In this simple interactive chart created in D3, a tooltip helps users drill down into more detail.

Market Reconstruction 2.0: Visualization at Scale

8

Figure 10.

The stacked column bar shown in Figure 9 is a simple interactive chart created in D3. When the user moves the pointer over a particular section of a stacked bar, a tooltip with relevant details, such as the number of total events for that section or the total order size of that event type, are displayed. This helps end users determine how to drill down further. Clicking on a bar will launch another query to retrieve further breakdowns as shown in Figures 11-15. Figure 11.

Market Reconstruction 2.0: Visualization at Scale

9

Figure 12.

Figure 13.

Market Reconstruction 2.0: Visualization at Scale

10

Figure 14.

Figure 15.

At every zoom level, all the generated bar charts persist within the interface, allowing the user to zoom in and out at any point by selecting a different bar. The illustration in Figure 16 depicts the zoom-in process.

Market Reconstruction 2.0: Visualization at Scale

11

Figure 16: Zoom-in process.

2.5 Future enhancements 2.5.1 Graceful rendering of overlapping events The granularity of the data being fed into D3 impacts the way events are displayed in the scatter plot. Due to the high volume of events taking place in a short period of time, the probability of overlapping events is high – and even higher when the time stamp unit is in seconds rather than milliseconds. Zooming to enough detail will separate events that are very close; however, events with identical order size and time stamp (especially when rolled up into seconds) cannot be separated in the graph, as shown in Figure 17. The UI should eventually and ultimately account for this.

Market Reconstruction 2.0: Visualization at Scale

12

Figure 17.

To get around this limitation, an improvement under consideration is to display a pop-up tooltip that separates multiple overlapped events with click-handlers for each individual event, facilitating navigation to their respective, individual lifecycles.

2.5.2 Improving user interaction with a Voronoi Grid Voronoi Tessellation is a method of dissecting a plane containing multiple points into regions represented by convex polygons. Each polygon contains exactly one generating point, and every location within a given polygon is closer to the generating point of the polygon than any other point. The resulting diagram is also known as a Voronoi diagram. The edges are therefore drawn where the distances between them and their two nearest points are the same. The D3 library contains a d3.geom.voronoi() function which, when fed the X and Y scales and the data representing the events, generates a Voronoi grid overlay. 9 Figure 18 shows a comparison of two identical plots both with and without the Voronoi grid. Figure 18: Two identical plots – one with the Voronoi grid and one without.

9 see “Using a D3 Voronoi grid to improve a chart's interactive experience” at http://www.visualcinnamon.com/2015/07/voronoi.html

Market Reconstruction 2.0: Visualization at Scale

13

D3 offers us the ability to style the grid in any way we choose. In the above illustration, we applied color to make it visible, but ultimately the grid should be invisible. Once we have a Voronoi grid overlay on the graph, we switch the mouseover event target from the dots over to the polygons surrounding the dots – thus sparing the user the trouble of having to move the cursor directly over to the dot to generate the tooltip. Not only does this result in a smoother, more pleasant user experience, it also makes the application more accessible for users with vision or motor skill difficulties.

2.5.3 Filtering and visual cues Two other features that can greatly enhance the user experience when looking through a large number of events are filtering and sizing event dots. • Filtering A filtering bar at the top of the graph enables the user to quickly restrict the display of specific event types or order sizes outside a parameterizable threshold. • Sizing event dots in proportion to the value of a specific data dimension An easy way to communicate the relative values of events within the scatter plot is to draw the event dots proportionally to the size (or another arbitrary dimension) of the event, such that a larger order sizes would yield larger circle radii when rendered. A simple visual cue such as this can go great lengths in helping the user quickly identify events of interest on the plot.

3 Visualizing relationships among market events 3.1 The challenge While the first challenge is navigating a massive number of distinct events to ultimately identify events of interest, the second challenge is to present the user with a view of the end-to-end chain to which the selected event belongs. This gives the user the ability to trace the entire lifecycle of an order event starting from the initial order down to the ultimate fulfillment or cancellation of each child order. Clicking on an event in the scatter plot takes the user to the lifecycle tab of the UI, where D3 is used to generate the lifecycle graph.

3.2 Event lifecycle attributes A lifecycle is a directed graph comprising events as nodes. When a beneficiary order to buy or sell listed equities or options is placed, a new order is created by the participant in receipt of the order (typically a broker-dealer). To minimize transaction costs or speed up the order execution, the participant may route some or part of that order to another exchange or alternative trading system (ATS). The broker-dealer may choose to fulfill some or part of that order from its own inventory. 10 Alternatively, a participant may split the order up and route it to many different exchanges. 11 The order may go through a series of these routes before finally being fulfilled. At any time, the entire order may be cancelled or replaced as well. A typical lifecycle may contain the following distinct events: • New order • Routed order • Intermediary or representative order (i.e. a routed order that has split into multiple orders) • Filled order (i.e. the completion of an order) • Cancelled order For each event, the following attributes are among those captured: • Time stamp • Order size (# of shares) 10 see “Internalization” at https://www.sec.gov/answers/internalization.htm 11 see “Order Routing” at http://www.tradingacademy.com/resources/financial-education-center/order-routing.aspx

Market Reconstruction 2.0: Visualization at Scale

14

• • • • •

Symbol Market Participant Identifier (MPID) Side (e.g. buy or sell) Price (if specified, e.g. for a limit order) Account (i.e. beneficial party of the order instruction)

3.3 Architecture of lifecycle compilation and presentation In order to render the lifecycle graph, the application was decomposed into four principal modules: • Lifecycle API • Market data • Force graph • Lifecycle data table Figure 19 details the relationships between these four modules within the application architecture.

3.3.1 Lifecycle API Events are stored as a nested BigQuery dataset. This dataset contains the aforementioned details of an individual event. Lifecycle metadata is stored within a separate BigQuery dataset that contains only the event IDs comprising each lifecycle. In order to combine the two datasets, a third dataset was created that “walks” the lifecycle table. It essentially identifies each event ID on the path and joins it against the event data. The lifecycle API component is responsible for querying lifecycle data from the combined BigQuery dataset. It receives a unique event denoted as $EventID and returns all lifecycles that contain that event. It ultimately derives the following SQL to execute via the BigQuery API, as shown in Figure 20.

Market Reconstruction 2.0: Visualization at Scale

15

Figure 20: Lifecycle SQL. SELECT EventId AS eventId, ReporterId AS reporterId, CustomerId As customerId, eventType AS ‘type’, id, timestamp, symbol, size, side, avgPrice, parent, child FROM [lifecycles], WHERE GrandParent = $EventID, GROUP BY eventId, reporterId, customerId, type, id, timestamp, symbol, size, side, avgPrice, parent, child ORDER BY timestamp

The dataset returned by this query is subsequently converted into a JavaScript array. The force graph, market data and lifecycle tabular data components all employ Polymer data binding to receive and process this array.

3.3.2 Market data This component is responsible for rendering the last trade execution for a particular symbol, as well as the prevailing best bid and ask prices on the trade’s execution venue for a given time frame. In the example in Figure 21, data for ticker symbol GOOG is depicted for the period of 10:31:38.509 through 11:06:06.958 on March 24, 2016.

Market Reconstruction 2.0: Visualization at Scale

16

Figure 21: Market data.

This presentation assists regulatory end users with determining whether the transactions in a particular lifecycle reflect the prevailing price and size of trades contemporaneously executed in other markets. The data for this component is stored within a BigQuery dataset containing public market data that has been optimized for this pattern of queries. It relies on the lifecycle API component to identify the start and end times for the requested time window, based on the user’s input.

3.3.3 Force graph This UI component represents a directed graph comprising the events in a lifecycle, as illustrated in Figure 22.

Market Reconstruction 2.0: Visualization at Scale

17

Figure 22: Force graph.

To display the various event types and transitions, the force graph component was created using the D3 JavaScript library, with the nodes color-coded to denote the various event types. The synthetic event data for the above scenario represents a new order from participant GLII to buy 726 shares of symbol CSCO at 9:33:17 AM on March 24, 2016. Hovering over an event node reveals additional details about that event, such as ticker symbol, time, order size and market participant, which are displayed in a panel above the graph. The side of the trade (i.e. whether this was an order to sell or an order to buy) is also indicated, as shown in Figure 23.

Market Reconstruction 2.0: Visualization at Scale

18

Figure 23: Additional details about an event are revealed.

3.3.4 Lifecycle tabular data This a two-dimensional representation of lifecycle data received from the lifecycle API component, an example of which is shown in Figure 24. Figure 24: Lifecycle data.

Each row in this table represents a single event in the lifecycle. When a user clicks a row, it expands to display event details. Users can copy this data to a spreadsheet and filter or sort it.

3.4 Lifecycle implementation To represent the lifecycle as a force graph, we used the D3 library, which contains a force graph component that affords most of the features necessary for creating a lifecycle. The code snippet shown in Figure 25 creates a basic force graph. Figure 25. var force = d3.layout.force() .nodes(d3.values(this.nodes)) .links(this.links) .size([width, height]) .linkDistance(15) .charge(-0.40 * height) .on('tick', tick) .start();

Market Reconstruction 2.0: Visualization at Scale

19

The D3 force graph relies on SVG paths and circles 12 and expects five distinct data elements: 1. 2. 3. 4. 5.

A collection of nodes A collection of links between the nodes Dimensions of the chart area A charge property that determines the distance between nodes. This is usually specified as a negative number. A charge value of -100 will push the nodes further than a value of -1. A tick function that determines the path between nodes. One uses SVG notation to define whether this path will be an arc or a straight line.

It is necessary to pre-compute the list of nodes and links using lifecycle data, which is performed in a process out-of-band from the web application. The prevailing assumption is that the relevant universe of securities transactions already exists behind BigQuery’s façade. 13 There are two ways to create arrows on paths between nodes. An SVG path can be defined that includes the outline of an arrowhead and then filled with black color. 14 This approach affords more flexibility and control over the size and color of the arrowhead, depending on the pair of nodes being connected. But it also entails a computation overhead that can affect performance as the dataset increases. A more scalable approach is to add a custom cascading style sheets (CSS) class to each path and then use CSS styles to create a fixed-size arrowhead for all paths. This latter approach is the one that was used.

3.5 Implementation challenges 3.5.1 Dynamic sizing of nodes One of the requirements that presented a challenge was how to grow or shrink the nodes based on the number of shares (i.e. order size) in a particular event. If an event was split into two child events with half the number of shares, we should reduce the radius of the children to half the size. As such, this requires a number of calculations. The first stage is an iteration through the lifecycle data received from BigQuery API to determine the maximum and minimum event size. This establishes the multiplication factor for all nodes that lie somewhere between the maximum and minimum event size. The minimum circle radius was kept at nine pixels since anything less than that would be too hard to distinguish. Also, the maximum size had to be no more than twice the minimum. Otherwise, if an unbounded maximum size was used, an inordinately large circle runs the risk of obscuring adjacent circles. Next, each node is multiplied by the ratio of its event size, compared with the minimum event size. A much bigger challenge was adjusting the length of the path between nodes. When nodes exhibit variable size, and each node is filled with opaque color, the node can obscure arrowheads on the path. The path is drawn from the center of the parent circle to the center of the child circle. The only way to display the arrowhead is to adjust the length of the line without violating the angle between the two centers. To do this, basic trigonometry was used and the cosine and sine of the angle was determined. Using the cosine and sine values, 15 the offset for both height and width was computed, as depicted in Figure 26.

12 see “An A to Z of Extra Features for the D3 force layout” at http://www.coppelia.io/2014/07/an-a-to-z-of-extra-features-for-the-d3-force-layout/ 13 Palmer, et. al supra 14 see “W3C Specification. SVG Paths” at https://www.w3.org/TR/SVG/paths.html#PathData 15 see Wikipedia “Trignometric Functions” at https://en.wikipedia.org/wiki/Trigonometric_functions

Market Reconstruction 2.0: Visualization at Scale

20

Figure 26.

Computations for sine, cosine and offset were as follows: cosin(θ) = (x2 – x1) ÷ z sin(θ) = (y2 – y1) ÷ z Δx = r2 × cosin(θ) Δy = r2 × sin(θ) y3 = y2 – Δy x3 = x2 – Δx Finally the offset was subtracted from the height and width to determine the destination coordinates. By doing this, the arrowhead appears at the edge of the destination circle, instead of the center. Performing these computations for every node and path is substantially expensive.

3.5.2 Performance With more than 55 events in a lifecycle, there is a palpable 10- to 15-second delay in rendering the graph. This performance issue is partly due to the instantiation of the individual nodes and links from the underlying lifecycle data.

Market Reconstruction 2.0: Visualization at Scale

21

Another significant performance factor is that SVG shapes are being used in the D3 graph. As these shapes preserve their scale when the browser window is resized, D3 needs to store a large number of vectors for each shape to do so. Figure 27 shows the performance of a web UI degrading substantially when SVG is used to display. 16 Figure 27: Render times, varying number of objects.

Performance improves substantially when switching to a two-dimensional canvas. 17 The idea of dynamically sizing nodes was abandoned for the same reason. For each node and path, the graph was computing cosines, sines and path offsets. Even the radius of each circle was being computed at runtime using the maximum and minimum event sizes. This resulted in an inferior performing UI.

3.6 Future enhancements A force graph by definition does not have a time axis. The nodes are arranged in random order. Their precedence is only determined by the direction of the link. This means that one cannot tell which events occurred chronologically earlier simply by looking at the chart. Displaying the earliest event on the left-most edge of the graph and subsequent events to its right would be a desirable representation to facilitate an at-a-glance assessment of the general chronology of events. Also, if the lifecycle only contains a small number of total events, these should be distributed more widely. The graph could shrink or expand depending on the number of component nodes. This functionality would require a dynamic link distance and charge function. Another improvement would be the ability to zoom in or out to specific events based on a specific time range. Having a time axis would help in this endeavor. The user could then define a time window and redraw the graph to just include events in that time window. An alternative visualization for a single lifecycle could be as a tree that is oriented left to right, as shown in Figure 28.

16 see “Why is SVG so slow?” at http://frozeman.de/blog/2013/08/why-is-svg-so-slow/ 17 see “Speeding Up D3.js: A Checklist” at https://www.safaribooksonline.com/blog/2014/02/20/speeding-d3-js-checklist/

Market Reconstruction 2.0: Visualization at Scale

22

Figure 28.

This is essentially a collapsible tree with the parent being the left-most node. 18 Whenever an order is split, the tree splits into multiple branches. The user can click a node to collapse (hide) its children and click it again to expand (show) them. Figure 27 shows the same lifecycle with some nodes collapsed. Figure 29.

A user may double-click a node to zoom into it, as shown in Figure 30.

18 see “D3.js Drag and Drop, Zoomable, Panning, Collapsible Tree with auto-sizing” at http://bl.ocks.org/robschmuecker/7880033

Market Reconstruction 2.0: Visualization at Scale

23

Figure 30.

With this approach, the display of arrows on the lines themselves is superfluous since the left-to-right orientation visually establishes the temporal hierarchy.

4 Prototype architecture The technology stack we chose, as shown in Figure 31, reflects the accelerated timeline in which the prototype was developed. The freedom and flexibility of Polymer web components enabled us to build a complete client-side application, leveraging Google’s facilities for access and identity management and client-side routing with Page.js. 19 The lack of a distinct middleware tier to broker authentication and queries speaks to the flexibility and ease of use of the BigQuery API and Google Web Components (which are Polymer components built by Google to access APIs for their services). This essentially amounts to a serverless architecture.20 From an operational maintenance standpoint, this is a particularly desirable modality. 21 As shown in the architecture diagram, each page in the presentation layer is essentially a Polymer component build with many child components. The scatter plot and lifecycle components house the respective D3 graphs. They ingest events data and lifecycle data as input and subsequently generate the graphs. The API layer consists of three components that handle querying and return data for the scatter plot, the lifecycle graph and the market data table. They also share a Polymer Behavior component that contains common functions such as loading the BigQuery API, making the request, transforming the query and response, caching the response, and so on. Finally, the database layer consists of BigQuery. There are two separate datasets to which the application needs access. The initial query is run against the lifecycle and events dataset to generate the scatter plot. When an event of interest is selected, a second query is run against the same dataset to get the lifecycle data for that event. Once the lifecycle is generated, the general time window of the lifecycle is used to query the market data dataset to get the prevailing market prices clustered around each trade execution event in the lifecycle.

19 see ”page.js” at https://visionmedia.github.io/page.js/ 20 see ”What *is* Serverless Architecture?” at https://medium.com/@PaulDJohnston/what-is-serverless-architecture-43b9ea4babca 21 Palmer, et. al supra

Market Reconstruction 2.0: Visualization at Scale

24

Figure 31: The chosen technology stack accelerated the prototype’s development.

5 Discussion 5.1 Why Polymer and D3 Choosing technologies for this web application took careful thought. We needed to be able to bootstrap quickly and iterate easily on our initial design. With the scope of the application being relatively small, we needed to avoid a heavyweight, opinionated framework. With these considerations in mind, we ultimately landed on a technology stack that fit our needs. Polymer is a web component library that allowed us to separate the concerns of our application and develop different features in parallel without affecting other work. 22 We were free from the syntactic and framework restraints imposed by other solutions such as AngularJS. With Polymer and a custom Polymer starter kit, 23 we were able to quickly build out a simple, single-page application (SPA) to demonstrate our concept and its feasibility with both realistic volumes and formats of data. Finding the best way to display our data was central to our application’s success. The visualization library needed to be fast and customizable to handle the massive amounts of data it will process. D3 fit our needs and came with excellent documentation to get up to speed with its (sometimes obtuse) syntax. D3 supports a wide array of graph types, with customization options for each one. We were able to tweak the standard graphs provided with D3 to create meaningful visualizations of entire market transaction lifecycles.

22 see “Polymer Project” at https://www.polymer-project.org/1.0/ 23 see “Polymer BootLicker” at https://github.com/filaraujo/polymer-bootlicker

Market Reconstruction 2.0: Visualization at Scale

25

5.2 Looking beyond the prototype Some desirable features to consider for future iterations include: • User settings and favorites The prototype was built using the Google Cloud Platform’s native Identity and Access Management (IAM) functionality. 24 Since it is a serverless set-up, there was no middleware built to save user data. For the production application, however, the user profiling facilities should be enhanced. This will give us the ability to save user data such as application settings, favorite symbols, queries and UI themes. • Sharing with other users The ability to share graphs, queries and particular lifecycles or events of interest with other authorized users in the system. • Exporting data and graphs The ability to export data in CSV or JSON format, and offer downloadable graph snapshots in PDF or JPG format.

6 Conclusion Without a means of harvesting behavioral and regulatory insight from a rapidly-growing universe of data, an organization’s time and effort spent establishing big data capabilities may be squandered. It is paramount not only to empower analysts with tools to conduct analytics independently, but also for developers to stay relentlessly focused on the business context in question. 25 Developers across all tiers of the application stack should be encouraged to pursue a deeper understanding of the underlying problems their constituents must solve, as this yields improved alignment of any particular solution to those problems. From an engineering standpoint, one positive outcome of this exercise was the realization that modern front-end technologies do offer a wealth of capabilities to assist in navigating and interpreting massive amounts of data. Nonetheless, important work must be conducted by development teams to best refine the universe of possibilities and optimally map them to the ultimate domain problem at hand. For the use case of securities transaction analysis and market reconstruction, a nimble combination of just a handful of tools, combined with a robust and straightforward back-end architecture, and sound engineering methodologies, enabled us to manifest a highly navigable interface that compliments the semantics of the underlying dataset, without compromising the front-end’s ability to scale to arbitrarily large event repositories.

7 Acknowledgements We would like to thank the following people for providing their help, support, and technical and domain expertise in building this next generation prototype: Adam Najman, Benjamin Tortorelli, Seong Kim, Sebastian Just, Denine Guy, Ji Zhou, Salvatore Sferrazza, Emily Hsiung, Gary Mui, Beau Alexander, Mike Harrington, Bob Jackson, Mark Davis, Marc Mondragon, Chris Tancsa, Petra Kass, Steve Silberstein, Regina Brab and Marianne Brown.

24 see “Identity and Access Management” at https://cloud.google.com/iam/ 25 see “Extracting Insights from Vast Stores of Data” at https://hbr.org/2016/08/extracting-insights-from-vast-stores-of-data

Market Reconstruction 2.0: Visualization at Scale

26