[This post is the 2nd in a series of technical-focused articles exploring the challenges of using conceptual thinking to create written data analysis.]
Time is of the essence. This is true if you are in a hurry, but it’s also true if you are doing data analysis. Individual data values are almost always contextualized by looking at time-related events such as how much they have moved up or down, whether they are part of a trend, how high/low they are compared to past values, etc. Fortunately for human beings, manipulating time is something that comes easy to us. For example, if you know how to put together a monthly written report on a given topic, it would probably be quite obvious how to put together a weekly version of the same report.
For computers, however, this is not so easy, because they don’t think conceptually. Sure, software systems like Power BI and Tableau can show you the correct numbers for any given time period, but they don’t have the sophistication to think about any given time period, and this prevents them from giving you a time-flexible written synopsis.. Sophisticated, conceptual thinking is required because there are many (often subtle) difficulties when calculating and writing about time. In this post, I’m going to walk through some of those difficulties, and also show how a Conceptual Automata System (CAS) has the ability to navigate those difficulties.
Challenge #1 - What time periods are relevant?
Before trying to figure out how to manipulate time periods, a CAS needs to understand what time periods would even be relevant to a reader. For instance, an HR coordinator might be interested in a yearly high/low for a given metric, whereas a stock trader might be interested in the (slightly different) 52-week high/lows. A manufacturing VP, on the other hand, might care most about quarterly numbers, and so that would be set up as the default time period for their report. The best time period to contextualize information will almost always vary by use case. .
Furthermore, the CAS has to take into account whether certain time periods are relevant to certain analytical events, such as streaks or outlier changes. For example, it might be interesting to note that a stock’s increase for the day was the highest in six months. It would not make sense to note that same story for a quarterly time period, since that might only represent two periods. That said, it would be equally invalid to require a 180 period gap between noting outlier quarterly events. Therefore, you need to have a sliding scale that allows longer time periods to need fewer intervals between outlier events. Many other stories would need similar intelligence to customize their thresholds based on the time period (and use case).
Challenge #2 - What to compare to?
One of the most complicated time-related issues is determining what comparison to make when calculating how much a metric has changed. In the simplest situation, you have a metric (let’s say it’s a monthly figure) and you would compare it to the previous month. But let’s say you have a very seasonal metric- in that case you might want to compare it to the same month of the previous year.
This requires adding additional intelligence to each of your time-based ‘event’ stories. For instance, if you have a story about a metric increasing, that story cannot simply look to the previous item, but rather needs to seek guidance from its parent object to figure out what to compare it to. You also need to impart intelligence to stories such as trends or streaks to make sure they also are running year-over-year comparisons. Of course, all these changes also require alterations to how you would write about or visualize the stories.
Challenge #3 - Pure time periods versus current time periods
If, on September 15th, somebody tells you that a given metric is ‘down for the month’, what exactly does that mean? It could either mean that the metric is down through the first 15 days of the month, or it could mean that it has been down since August 15th. Both forms of measurement have applications, and often different use cases will favor one or the other. Stock traders, for instance, would likely care much more about the change over the previous 30 days, whereas a manufacturer might care more about how the current month has been trending. To be truly flexible, a CAS needs to be able to handle both these use cases, or even switch between both of them within the same report.
Challenge #4 - Aggregated vs. simple time periods
There are two different ways that smaller time periods can relate to larger time periods—as waypoints or as components. Compare a month’s worth of stock data versus a month’s worth of retail data. To answer the question ‘how did Apple stock change this month’ we would compare the first day of the month to the last day of the month. If we were asking the same question of a set of retail data, we would need to add all the days in the month and then compare the aggregate of all those data points to the aggregate of the previous month.
Dealing with aggregated objects is tricky because it is usually impractical to simply aggregate and sequence every possible time period in a data set, so you might start off by just creating, for example, a monthly aggregate of the overall information. But what if a user (or the CAS) wants to look at a subset of the data, such as a region in a set of retail data? The CAS must be able to generate an aggregated time object for each region. The only way to do that is to impart ‘self-assembly’ intelligence to each aggregated object, so that it knows how to take each component (such as each day of data) and merge them together to create a composite object that covers the entire time period.
Putting it all together
As complicated as some of these issues are, the real difficulty stems from the fact that each of the solutions to these problems needs to work independently. For example, let’s say you tasked the CAS with building a report on retail numbers for April 15th. The CAS would need to: (1) understand that a month-to-date report is more appropriate than a past 30 day report, (2) be able to automatically build an object composed of data from the first 15 days of April, (3) sequence that object along with all other ‘first 15 days’ objects (and their sub-components), (4) understand that the relevant comparison for this April’s partial month numbers is to the partial month of April of the previous year, and (5) be able to write and visualize the results of its analysis.
Not only must all of the time-based intelligence work well together, but you can’t limp over the finish line with a bunch of spaghetti code to make it work. This is because the ‘time’ dimension is just one of a whole series of independent dimensions that you need to get working, such as different subjects (or groups of subjects), metrics, user preferences, content length, formatting, visualizing, and others.
The good news is that once you get all of this working, you’ve unlocked the ability to quickly understand exactly what you need to know about any time period within your data. The CAS can even do things like give you a ‘last 4 days’ report when you come back from vacation. It can also use its fluidity in moving up and down time periods to easily add context to a report- contextualizing an annual report with how things have gone in the last month, for example. True flexibility over different time periods is a data analysis superpower that previously only humans had, but no longer.