ankit   July 10, 2018

The Role of Data in AI-powered Technologies

Ever since its formulation in the 1940s and term coining in 1956, Artificial Intelligence (AI) has accelerated in leaps and bursts to mature into the fourth Industrial Revolution [1]. Companies across all industry sectors are actively pursuing AI-powered solutions. If you are incorporating AI in some form into your workflow, you will have come across the notion that data is the fuel for Artificial Intelligence. Along with computational power and machine learning algorithms, data forms the third and most essential component of any AI engineered solution. The purpose of this article is to elucidate why this is so.

Any AI solution at its heart is a computer program. A program is a process that transforms the input (such as a list of reviews on the latest smartphone) into output (such as classifying the said reviews as positive/negative). One thing to keep in mind is that AI programs distinctly differ from traditional computer programs. While whimsical in nature, the diagram below is a popular [2] illustration of this distinction.

Traditional computer programs are a sequence of instructions which transform the input to the required output. Artificial Intelligence programs, in contrast, do not have access to the steps required to transform an input to an output. This is mostly because the types of problems addressed by AI are typically:

  •     Problems where the solution is too complex or diverse to logically break down into a sequence of instructions such as identifying a picture, translating a language or answering questions on any topic (too many rules)
  •     Problems where the solution is not universally applicable and needs to adapt such as a recommendation engine suggesting similar books or movies for each person (rules change dynamically)

AI programs [3] learn the sequence of instructions for transforming an input into the required output by identifying patterns in examples of input and output data. Therefore, previously solved examples of input and output, in large quantities, are vital in the designing of AI programs. Moreover, any computer program where the instructions are not expressly stated is a form of artificial intelligence.

It is this data aspect of AI, which render many business use cases looking to integrate AI, similar to the chicken-or-egg problem. AI programs are needed to solve a problem, but in order to design the AI program, we need solutions to the very problem. An additional point to consider in engineering AI-powered solutions is the type, size, and kind of data. The most common cause for an AI program to fail or generate the wrong result is insufficient data. Insufficiency does not necessarily mean the number of instances of input-output sequences, but also the types of input-output sequences. Consider for example, an AI program that learns whether an incoming email is a spam or not. If the data used to train the program contains only examples of spam, the resulting system may have difficulty in correctly classifying the emails because it has never seen examples of non-spam.

Most often, the exact specification of data in sufficient quantities is not readily available. The job of a data scientist / AI engineer then becomes acquiring data as similar to the required form as possible by scouring open data resources and then tuning the AI program. Sometimes, an AI program is available in the form of an open-source implementation, but it was trained on a totally different dataset. The data scientist’s job then becomes adapting or tuning the AI engine to the required data specifications. A good analogy for such a case is the art of cooking. You have the recipe (AI program) and the ingredients to make the dish (similar data). All you need now is to experiment in order to find the exact measurements of each ingredient to make a tasty dish (optimal AI program).

There are also pitfalls such as training too well on a dataset that the AI program cannot generalize on new unseen input. This was one of the reasons for several AI prediction algorithms failing to correctly identify surprising results in the FIFA World Cup 2018 [4]. Of course, one can argue that using science and technology in what essentially amounts to fortune-telling (games of chance) is not scientific and therefore bound to fail.

On the other hand, a brilliant example of where user data has dramatically improved AI programs over time is the predictive text typing in smartphones. The number of hilarious or annoying autocorrects gradually reduces as the AI program learns and adapts from the words you type most often.

Data in the form of user-generated content on social media and data logged by the multitude of sensory devices are being collected in such huge quantities that according to an IBM report [5] 90% of the data in the world today has been created in the last two years. It sure is an exciting time to leverage data and design AI programs to make the world a better place.

References & Notes:

[1] Klaus Schwab. 2017. The Fourth Industrial Revolution. Crown Publishing Group, New York, NY, USA.

[2] https://www.datasciencecentral.com/profiles/blogs/traditional-programming-versus-machine-learning-in-one-picture

[3] There are multiple approaches to Artificial Intelligence-based software solutions, but the focus of this article is the current predominant techniques of machine learning and deep learning which requires data to train and find patterns.

[4] https://news.cgtn.com/news/3d3d774e7855444e78457a6333566d54/share_p.html

[5] https://www.ibm.com/blogs/insights-on-business/consumer-products/2-5-quintillion-bytes-of-data-created-every-day-how-does-cpg-retail-manage-it/

Leave a Reply

Your email address will not be published. Required fields are marked *