top of page
  • Writer's pictureHarry Tan

Harmonizing Date Havoc in Snowflake: Streamlining the Conversion of Diverse Date Strings into Standardized Formats

In today’s data-driven age, information pours in from countless sources — each with its unique format, structure, and style. Whether we are developing an application that consumes data from different users worldwide or integrating multiple systems with disparate date standards, we face the ubiquitous challenge of date formatting inconsistency. This may seem like a mundane problem to some, but the implications of not addressing it can be profound.


The Universality of the Date Formatting Challenge

Imagine the complexity of managing a platform that relies on data from diverse sources, both human and machine. Users might input custom fields, manually entering dates based on their personal or regional preferences. Concurrently, the platform might be integrating data from various APIs or legacy systems, each with its unique date formatting standards. Such a setup is a melting pot of date formats. When you have a date logged as ‘12/03/2022’ from one user and another system delivering ‘2022.03.12’, distinguishing between these formats without a standardized approach becomes an intricate puzzle. This can lead to potential data inconsistencies, errors in analysis, and a host of operational challenges.

Similarly, data migration projects, which often involve transferring vast amounts of data between systems, can run into significant roadblocks when date formats are not aligned. The time and resources spent cleaning, reformatting, and validating these dates can be overwhelming.

The Value of Date Uniformity

A consistent and reliable date format is not just a luxury — it’s a necessity. Here’s why:

  • Data Integrity: Without uniform dates, we risk corrupting the data, leading to erroneous analyses and potentially misguided business decisions.

  • Efficiency: Automation and data processing require predictable structures. A uniform date format ensures that processes run smoothly without manual intervention.

  • User Experience: Inconsistent date formats can confuse end-users, eroding their trust and satisfaction with the system or application.

Given the prevalence and importance of this challenge, there’s a compelling need for a function that can take a date string — no matter how oddly formatted — and transform it into a consistent, true date. In this article, we will delve into one such implementation, exploring its design and the nuances it addresses to ensure that our date data remains reliable, consistent, and efficient.

In Snowflake, the native function TRY_TO_DATE is designed to convert strings to dates. This function is a specialized variant of the standard TO_DATE and DATE functions, performing the same core operation — converting an input expression into a date. However, TRY_TO_DATE comes with an added advantage: error handling. If the conversion fails, it returns a NULL value instead of raising an error, making it a safer option for handling diverse and unpredictable data inputs. It’s important to note, though, that TRY_TO_DATE supports only two formats.

In my script, I’ve built upon the foundation laid by TRY_TO_DATE. The primary objective of this enhancement is to accommodate a broader array of date formats. This approach is based on reasonable assumptions, which I will delve into after presenting the source code and its outcomes.

First, let’s examine the conversion results.

Below is a comprehensive list of date strings. These were processed both by Snowflake’s original TRY_TO_DATE function, results shown in the snowflake_try column, and by my augmented function, with its output in the AC_try column. The DATE_STR column showcases a diverse range of date formats. While these formats are easily discernible to humans, Snowflake’s native TRY_TO_DATE function struggles, often yielding incorrect results or failing entirely.

Let’s explore these results in more detail.

Now that we’ve explored the output results, let’s dive into the heart of the solution: the script itself. To encapsulate the conversion logic, I’ve crafted a custom function. This approach not only streamlines the process but also ensures reusability for future applications. Here’s a look at how this function is constructed and how it extends Snowflake’s native capabilities

with test_dates as
(SELECT value::varchar as date_str
FROM TABLE(FLATTEN(input => parse_json('
[
\'2011-01-02\' ,\'1/2/2022\' ,\'2011/01/02\' ,\'01-18-2022\' ,\'18-01-2022\' ,\'1/20/2022\' ,\'19/1/2022\' ,\'20110102\' ,\'19980102\' ,\'23110102\' ,\'18980102\' ,\'99/01/02\' ,\'1/20/99\' ,\'1/20/23\' ,\'19/1/23\' ,\'01-18-22\' ,\'18-01-22\' ,\'99-01-02\' ,\'99-20-02\'
]
'))) f)
select
date_str,
TRY_TO_DATE(date_str) as snowflake_try,
case
    when date_str regexp '(19|20)\\\d{2}\\\d{1,2}\\\d{1,2}' then to_date(date_str, 'yyyymmdd')
    when date_str regexp '\\\d{1,2}/\\\d{1,2}/\\\d{1,2}' and try_to_date(date_str, 'mm/dd/yy') is not null then try_to_date(date_str, 'mm/dd/yy')
    when date_str regexp '\\\d{2}-\\\d{1,2}-\\\d{1,2}' and try_to_date(date_str, 'yy-mm-dd') is not null then try_to_date(date_str, 'yy-mm-dd')
    when date_str regexp '\\\d{1,2}-\\\d{1,2}-\\\d{1,2}' then try_to_date(date_str, 'mm-dd-yy')
    when TRY_TO_DATE(date_str) is not null then TRY_TO_DATE(date_str)
    when date_str regexp '\\\d{4}/\\\d{1,2}/\\\d{1,2}' then try_to_date(date_str, 'yyyy/mm/dd')
    when date_str regexp '\\\d{1,2}-\\\d{1,2}-\\\d{4}' and try_to_date(date_str, 'mm-dd-yyyy') is not null then try_to_date(date_str, 'mm-dd-yyyy')
    when date_str regexp '\\\d{2}/\\\d{1,2}/\\\d{1,2}' and try_to_date(date_str, 'yy/mm/dd') is not null then try_to_date(date_str, 'yy/mm/dd')
    when TRY_TO_DATE(date_str) is null and date_str regexp '\\\d{1,2}/\\\d{1,2}/\\\d{4}' then try_to_date(date_str, 'dd/mm/yyyy')
    when TRY_TO_DATE(date_str) is null and date_str regexp '\\\d{1,2}-\\\d{1,2}-\\\d{4}' then try_to_date(date_str, 'dd-mm-yyyy')
    else NULL end as AC_try
from test_dates;

Here are the assumptions I made. When it is ambiguous, a date string can be converted to two valid dates and even confuse humans, here are the rules:

mm/dd/yyyy > dd/mm/yyyy
mm-dd-yyyy > dd-mm-yyyy
mm/dd/yy> yy/mm/dd > dd/mm/yy
yy-mm-dd > dd-mm-yythere is never yy-dd-mm

That is everything I would like to cover in this article. My future work will be implementing the similar extensions to TRY_TO_TIME and TRY_TO_TIMESTAMP. I will add the implementation to this article once I have them.

0 views0 comments

Comentarios


bottom of page