Dealing with Dates in Data Analytics

Table of Contents

    Those of us that do not have to deal with time zones, date formats, and leap years, and the like might wonder what the big deal is about working with dates. I mean, a date and time is a date and time, right? December 31, 2022, at midnight is a Friday. It’s New Year’s Eve. How hard is that?

    Well, quite hard, in fact. See, we live on a spherical object that goes around the sun, and when it’s midnight for some, it isn’t necessarily midnight for others. As someone once said, time is relative. But in our situation, it is not the speed of light that makes things so challenging, it is the differing standards that countries use that make things nearly impossible to deal with. Making things worse is the fact that not all years are alike, and a day does not consist of exactly 24 hours, either.

    Dealing with dates and times gets even more complex when your data is stored in a database in one part of the world, and you are using it in another time zone. The date/time stamp is critical to knowing when an event takes place, but if your data source doesn’t provide any information about the relative offset from your time zone, your metrics can be way off.

    Thus begins our journey into the complexity of dealing with dates and times in data analytics and our discussion of how ClicData deals with that complexity.

    Starting Simple

    One of the simplest ways we can all be in agreement is to represent dates in the same format. Why do some insist on using DD/MM/YYYY and others insist on MM/DD/YYYY when it would be so much easier for everyone to use YYYY/MM/DD? We don’t write out the time in different arrangements, do we? We would never write 12:45:59 as 59:12:45 or 45:59:12. It doesn’t make sense. Yet we do that regularly with dates.

    Yet everyone knows how to write a date starting with the year, followed by the month, and then, finally, the day. It makes comparing and sorting work so much easier. Take three dates: 2021/05/12, 2021/03/20, and 2020/12/20, for example. If you pull the slashes out, you immediately have numbers that represent the correct sequence of dates and can be sorted. The same is not true for DD/MM/YYYY or MM/DD/YYYY formats. Not only are they out of sequence, but for any date from 1 to 12, it’s impossible to determine which number is the month and which one is the day just by looking at them!

    Wait a Millisecond!

    Another interesting notation problem is the fact that there are 24 hours in one day, 60 minutes in one hour, 60 seconds in a minute, and 1000 milliseconds in a second. Makes a lot of sense, right?

    For most of us, this hardly makes a difference, but when we are recording events such as computer transactions, when things are measured in milliseconds, if not micro- or nanoseconds, then it becomes quite important. In the world of financial transactions and blockchain, the sequence by which one transaction relates to the other is critical.

    So let’s see what a true and proper date/time looks like:

    2021-12-12 12:45:59.1000

    picture1

    All that is missing is information to nail down where the event took place. Theoretically, we know the time up to the millisecond, but did it happen in Sydney or Toronto time?

    If two transactions are stamped with the same time, and they’re in two different time zones, it is important to know which was first. A mistake or error, either way, could mean the difference between purchasing stock or currency at a different rate and could result in thousands and even millions in gains or losses.

    Time Zone vs. Time Offset

    And this is where things get complicated. To better understand the inherent problem with calculating durations from dates, we need to recognize that a time zone is not the same as a time offset.  

    A time zone refers to the time that is observed by a place or geographical location. Because of this, a city or region is assigned or follows a time zone.

    So Eastern Time and Central European Time (CET) are time zones that certain countries and regions of the world follow. Sometimes regions choose to follow other time zones at certain times of the year. For example, New York observes Eastern Standard Time (EST) from November to April and Eastern Daylight Time (EDT) the other months of the year. Paris observes CET from a date in October until March and CEST the rest of the year.

    A time offset is how many hours or minutes from the Coordinated Universal Time (UTC), which is never adjusted for Daylight Savings Time.

    When it is observing EST, for example, New York is five hours behind UTC, and when it is observing EDT, it is six hours behind UTC. Depending on other time zones and their time offset, at certain times of the year, New York can sometimes seem like it’s ahead and sometimes behind other locations. The beauty of using UTC is that it provides one standard that does not change; at any location around the world, the time can be calculated relative to UTC, avoiding confusion with competing time zones.

    Using UTC might make booking meetings more laborious, but in data analysis, it is more complicated if a time offset is not stored with the date.

    Database Transactions

    A transaction is anything that you write to a file or a database that took place at a specific time.

    Back in the days when storage was expensive, developers would use just six digits to represent a date—something like “01/01/89.” Then, with the big Y2K event, dates were converted to eight digits, so the previous date would become “01/01/1989.” But that still wasn’t enough because 1 a.m. on 01/01/1989 in one place might not be 01/01/1989 somewhere else. It could easily be 31/12/1989.

    Too bad we did not learn from our past mistakes with Y2K; many developers and database architects continue to attempt to minimize storage and performance by omitting a time zone offset column. It is especially prevalent in small companies and countries that don’t span multiple time zones.

    But a business often grows or is acquired eventually. Or maybe their servers are moved to a cloud infrastructure provider that isn’t in the same time zone or that has multiple regions for backup and redundancy. Then things take a bad turn.

    Much like Y2K, developers will scramble to check all servers, applications, and code for calculations and comparisons between dates to ensure that time zones are taken into account and that a date represents the date it was meant to represent. But this entails a huge amount of work and cost for any company. Better to consider that it might be worth it to include a time zone with dates whenever it makes sense.

    Dates in ClicData

    It is important to understand that when you import or refresh data from an external source, ClicData will do its best to return the most exact date and time and therefore store it as a date/time offset. However, not all APIs and source databases provide that data type, so in those cases, the date and time will be stored simply as a date and time. It is up to you to understand from your provider what time zone a date and time refer to. Here are some hints and ideas about how to deal with dates and times with ClicData.

    Getting the Current Date and Time

    In views and transformations, there are many functions you can use to return the current date and time. Regardless of where your data is stored in the world, all ClicData database servers are in UTC time, so using these functions is always relative to UTC.

    Assuming that you executed the functions below at 13:14:53 UTC on February 5, 2021:

       SYSDATETIMEOFFSET()  2021-02-05 13:14:53.0013475 +00:00

       SYSDATETIME()        2021-02-05 13:14:53.0013475

       SYSUTCDATETIME()     2021-02-05 13:14:53.0013475

       GETDATE()            2021-02-05 13:14:53.01

       GETUTCDATE()         2021-02-05 13:14:53.01

    Notice that the last four functions do not give any indication of what time zone the process is taking place. As a result, they are imprecise and depend on the user knowing the time offset, in this case, UTC.

    Calculating Duration Between Two Dates

    As an example, say you have a date in a table that does not have a time offset. The date is January 1, 2021, and the time is 09:00.

    We would like to know the difference between that date and the current date and time, that is, for ease of calculation, February 1, 2021, at 09:00.

    Here are three possible scenarios:

    1. If the two dates take place in the same time zone (e.g., EST), there is a 744-hour difference between them.
    2. If the two dates are in the same time offset (-5), there is a 744-hour difference between them.
    3. But if the first date is in New York (EST, which is five hours less than UTC) and the second one is in UTC, we get that there are only 739 hours between them. To get the right number of hours between the two dates, we need to do one of two things:
    1. Either get the current date and time from the SQL server that has the same time offset as the dates in the data, or
    2. Convert the data’s date and time to a date/time offset and use the SYSDATETIMEOFFSET function.

    Let’s try it.

    Our Azure servers are all on UTC date/time. It is now 2021-FEB-05 18:30:00 in my time zone, which is also UTC.

    SELECT

    DATEDIFF(hour, ‘2021-JAN-05 18:30:00’, SYSDATETIME()),

    DATEDIFF(hour, ‘2021-JAN-05 18:30:00’, SYSDATETIMEOFFSET()),

    DATEDIFF(hour, ‘2021-JAN-05 18:30:00’, SYSUTCDATETIME())

    All of the above in our servers return 744 hours. This is correct because it assumes the imprecise dates passed are all in the same time zone of the server.

    Let’s say that I am in New York (EST or UTC -5).

    Right now, in New York, it is not 2021-FEB-05 18:30:00. It is 2021-FEB-05 13:30:00.

    SELECT

    DATEDIFF(hour, ‘2021-JAN-05 18:30:00’, SYSDATETIME()),

    DATEDIFF(hour, ‘2021-JAN-05 18:30:00’, SYSDATETIMEOFFSET()),

    DATEDIFF(hour, ‘2021-JAN-05 18:30:00’, SYSUTCDATETIME())

    I would expect the above query to return fewer than 744 hours since it’s not yet 18:30 in New York, and the month is not yet over. But if I run the above query, I still get 744 hours.

    • Option 1: Convert ‘2021-JAN-05 18:30:00’ to a UTC date time

    SELECT

    SWITCHOFFSET(CAST(‘2021-JAN-05 18:30:00’ AS datetimeoffset), ‘-05:00’),

    DATEDIFF(hour, SWITCHOFFSET(CAST(‘2021-JAN-05 18:30:00’ AS datetimeoffset), ‘-05:00’), SYSDATETIME()),

    DATEDIFF(hour, SWITCHOFFSET(CAST(‘2021-JAN-05 18:30:00’ AS datetimeoffset), ‘-05:00’), SYSDATETIMEOFFSET()),

    DATEDIFF(hour, SWITCHOFFSET(CAST(‘2021-JAN-05 18:30:00’ AS datetimeoffset), ‘-05:00’), SYSUTCDATETIME())

    This returns: 2021-01-05 13:30:00.0000000 -05:00

    and 744, 744, 744

    We convert the date/time column into a date/time offset and then use a SWITCHOFFSET to adjust to my time zone.

    • Option 2: Compare with the current date/time in my time zone/offset

    In my opinion, this is the more practical approach.

    SWITCHOFFSET(SYSDATETIMEOFFSET(),’-05:00’),SYSDATETIMEOFFSET(),

    DATEDIFF(hour, ‘2021-JAN-05 18:30:00’, SWITCHOFFSET(SYSDATETIME(),’-05:00’)),

    DATEDIFF(hour, ‘2021-JAN-05 18:30:00’, SWITCHOFFSET(SYSDATETIMEOFFSET(),’-05:00’)),

    DATEDIFF(hour, ‘2021-JAN-05 18:30:00’, SWITCHOFFSET(SYSUTCDATETIME(),’-05:00’))

    The above returns:

    2021-02-05 13:07:47.0324627 -05:00

    2021-02-05 18:07:47.0324627 +00:00

    and “744” for all of them.

    Notice that the first value converts the servers’ UTC date/time (shown in the next column) to New York’s time offset, once again equating the time offsets and making the duration calculation valid.

    That’s great. But how about Daylight Savings Time?

    Let’s say a user has a table that registers website visits. It is 01:50 a.m. on November 7, 2021, in New York. When a visitor enters the website, a log is generated, with no time zone indicated, as

    2021-NOV-07 01:50:00 – User123.

    If the same user comes back 20 minutes later, Daylight Saving Time has ended, so the clocks turn back an hour. The new entry is registered as 2021-NOV-07 01:10:00 – User123. If we now try to calculate the duration between the two events, we get minus 40 minutes! The user apparently traveled back in time!

    This is why it is important to use date time offset on data. If we had done so, the data would be registered as

    2021-NOV-07 01:50:00 -05:00 – User123

    2021-NOV-07 01:10:00 -04:00 – User123

    If we do the math now, the offset calculates properly because 1 a.m. is not the same “1 a.m.”

    So what to do if you get a date like this:

     2021-NOV-07 01:10:00

    You need to convert it to a date/time with offset. Here is the above example in ClicData SQL functions:

    CAST(‘2021-NOV-07 01:00:00’ AS DATETIME) AT TIME ZONE ‘Eastern Standard Time’,

    CAST(‘2021-NOV-07 03:00:00’ AS DATETIME) AT TIME ZONE ‘Eastern Standard Time’

    This returns:

    2021-11-07 01:00:00.000 -04:00

    2021-11-07 03:00:00.000 -05:00

    In other words, it found out for you if that specific day and time is in Daylight Savings Time or not and added the offset to ensure that it returned the accurate date.

    In Summary

    Even with the advancement of computer languages, faster processing, and the ability to store massive amounts of data with little to no loss of performance, the complexity of dealing with dates is still an ongoing problem. Here is a recent issue that I came across at the time of writing this blog:

    Even after this lengthy discussion, I barely touched on many of the issues that you might come across when dealing with dates. Hopefully, it has shed some light on the subject for you.

    Dealing with dates is a nightmare for any developer, even those that have been dealing with them for a long time. It can even be the subject of humor as this video clearly states the issues and complexities that we created for ourselves.

    Hopefully, as time progresses, we will be able to simplify the issue or at least come up with better alternatives to code to prevent calculations errors and leap year mistakes.