Automated Generation of Metadata: Solutions for SAP, Oracle, and Microsoft Dynamics Inventories

Image pour l'article Génération de métadonnées

Imagine a scenario where the descriptions and categories of your inventory products are generated automatically. This prospect is now within reach thanks to generative artificial intelligence.

Like any innovation project, the aim is to achieve increased productivity, reduced costs, better resource management, all with the ultimate goal of gaining a competitive advantage. And this is precisely what the integration of AI into the ERP enables.

The use case presented in this article, namely metadata generation, is unquestionably an opportunity for businesses in the retail sector that utilize ERPs like SAP, Oracle, and Microsoft Dynamics.

Benefits of Integrating Generative AI with SAP / Oracle / Microsoft Dynamics

Improved Search and Recommendations
Thanks to precisely generated metadata, search and recommendation functionalities within the ERP can be greatly enhanced. For instance, AI-driven search could yield more pertinent results for users seeking specific products or components

Data Enrichment
Beyond basic metadata, generative AI can also contribute to enriching product data. For example, it can suggest potentially complementary products or add-ons based on metadata from other similar products.

Scalability
For companies with vast and constantly evolving inventories, manually updating or creating metadata for each product can be a challenging task. Generative AI can be scaled up to handle thousands of products, ensuring consistent metadata generation and updates.

Operation of Generative AI and Integration into SAP / Oracle / Microsoft Dynamics

1. Metadata Generation

Automated Generation of Descriptions
For new products, AI automatically generates descriptions based on similar products in the inventory or on brief information provided by the user.


Categorization and Labeling
Generative AI can suggest or generate categories or labels for products based on their descriptions, images, or other attributes.

Localization
If you operate in multiple regions, AI can be trained to generate product metadata in several languages, facilitating the localization of inventory items.

2. Quality Control and Refinement

Feedback Loop
To continually enhance accuracy, a feedback mechanism is implemented where incorrect or inadequate metadata generated by AI is corrected by humans. These corrected data points serve as additional training data, refining the AI’s results over time.

Validation Process
Before newly generated metadata is accepted, a validation step is conducted to ensure accuracy and relevance.

3. Integration with the ERP

The generative AI system can be integrated with SAP, Oracle, or Microsoft Dynamics. This can be achieved through API integrations or custom modules, ensuring that the generated metadata seamlessly integrates into the inventory management system.


Series on generative artificial intelligence

This article is part of a series we have produced to help businesses better understand generative AI and its possibilities.


Metadata Generation with Artificial Intelligence [Video featuring Olivier Blais] (French)

In this video, I delve into how generative AI provides businesses working with solutions like SAP, Oracle, and Microsoft Dynamics the opportunity to streamline the automatic generation of metadata for their inventory products. Applying this technology to this specific context undoubtedly stands as one of the most promising prospects in the realm of retail. I wish you an enjoyable viewing experience.

[This article is a verbatim translation of the video segment by Olivier Blais, generated by generative artificial intelligence tools and corrected by a human.]

Introduction to Generative AI for Metadata Generation

Hello everyone. This week, we’re going to discuss a use case of generative artificial intelligence that particularly interests me. Why? Because it’s a case that will truly save time and help us address a rather tedious issue, which is creating metadata.

I’m not sure if you’re aware, but for a store to display thousands of products on a website, for example, or to ensure that all products, all items are properly cataloged, it takes a lot of manual effort, a lot of time, and requires large teams to enter information into systems. Often, this information is duplicated, and sometimes it involves taking information from one system and converting it to be used in another system. Sometimes, it’s about analyzing a product description to be able to categorize it. And this is a requirement, something we always need, whether it’s in ERP-style systems or all sorts of other systems, and it’s a lot of work that’s needed.

Here, from the beginning, for years, we’ve been saying that we’ll try to do things as best as we can. So, we end up creating processes that are a little more efficient. We save minutes here and there by simplifying the metadata generation process.

Opportunity in Retail

However, with generative AI, we bring solutions that will change how we generate our information. Here, all we need is to have a product description, for example, and to know what the different fields are, what the requirements are, a few constraints to be able to generate metadata very precisely.

We’ll not only be able to save time, but also increase speed. We might see a reduction of fifty to seventy-five percent in the time it takes to generate metadata, but we’ll also gain in accuracy. We’ll be able to generate information of much higher quality.

How We Generate Metadata

How does this work? It’s based on the metadata we already have in our systems. So here, we don’t even need to leave our systems. We take information we already have and draw inspiration from it. We take these examples and provide them to the generative AI.

So, we take a product description, we take examples of metadata, and essentially say to replicate this structure. And there you have it.

Practical Application of AI for Metadata Generation

In fact, here’s an example. Let’s take a hardware store or a grocery store, for instance. Consider a grocery store; it has a lot of inventory items.

You go to the grocery store, there are tens of thousands of items. And every month, every week, there are even new items. So, what that means is that every week, you potentially have a team of a hundred people whose task is to enter information about a new type of tomato or new cans of soup in order to properly catalog and sell them. This is tedious, time-consuming, and doesn’t add much value.

What we’re talking about here is essentially taking information provided by suppliers, entering it into a generative AI solution, and getting back the necessary metadata to enter into an ERP solution like SAP. Once the metadata is in the SAP solution, the job is done, and you can start selling the item and put it on the shelves.

Feel free to share if you have other ideas for use cases; I’d be happy to discuss them.

Conclusion and Perspectives

In conclusion, the integration of generative artificial intelligence into business practices paves the way for significant and transformative advancements.

The approach to metadata generation relies on information already present in the company’s systems, thus avoiding unnecessary efforts and redundancies. By simply providing a product description and examples of metadata, companies can swiftly and accurately obtain the data needed to fuel their ERP systems. This innovation simplifies and streamlines inventory management, allowing teams to focus on higher-value tasks and accelerating the introduction of new products to the market.

The use case in the retail sector is just the beginning of a broader exploration of the possibilities offered by generative artificial intelligence.

The Essential Use Cases of Generative Artificial Intelligence

Featured image_Les cas d’usage essentiels de l’intelligence artificielle générative

Generative artificial intelligence: why is everyone talking about it?

With the growing popularity of generative artificial intelligence technologies, such as PaLM 2 and ChatGPT, more and more companies are seeking ways to integrate AI into their daily operations. According to a McKinsey report, generative AI will have a significant impact on the economy by increasing the economic value of AI by 15 to 40%, representing an estimated annual value of $2.6 to $4.4 trillion. That’s huge!

Generative AI has the potential to revolutionize several sectors, and currently, generative AI solutions are already being deployed to simplify tasks and optimize processes. Google and Microsoft now offer tools specifically designed to facilitate the integration of this technology in enterprises. By the way, if you haven’t already read it, we recommend using generative AI tools in workplaces by opting for a solution tailored for businesses.

In short, generative AI offers the potential to automate, improve, and accelerate various tasks. In this article, our goal is to explore how this technology can enhance work and demonstrate how companies can benefit from it.

In all the examples below, we advise, as with all AI systems we develop, involving humans in the process. For us, generative AI improves the efficiency of your employees, but it is essential to keep humans in the loop.

Without further ado, here are a few examples.

 

Use of Generative AI

1. Content Generation

Generative AI opens up impressive new perspectives in the realm of dynamic content creation. A well-known use case, generative AI can be employed to automatically generate text. This technology can be applied in various contexts, here are the most relevant ones:

  • Generating Metadata for Products.
    Generative AI is revolutionizing how businesses manage product metadata in their inventory by automating tagging, description creation, and categorization. Through advanced natural language processing and image analysis, generative AI extracts essential attributes of products, generating accurate and relevant metadata. These operations optimize stock update processes and also enhance search functionality and user experience on e-commerce platforms. The scalability and efficiency of generative AI make it an invaluable tool for companies seeking to optimize their product information management.
  • External Conversational Agent.
    Conversational agents or chatbots powered by generative AI can interact with users in a natural and seamless manner while adhering to internal governance policies and your brand image. They can generate relevant and coherent responses based on posed questions, thereby improving user experience and customer service efficiency.
  • Document Generation.
    Generative AI can produce comprehensive documents such as reports, blog articles, summaries, etc., based on provided input information. This can be particularly useful for generating extensive content, for example, legal reports and case analyses for lawyers.

2. Text Summarization

Generative AI’s ability to distill the essence of a text and summarize it concisely finds diverse applications:

  • Customer Flow Analysis.
    By analyzing customer comments, reviews, and reactions, generative AI can generate summaries that provide valuable insights into customer trends and preferences, thus helping businesses make informed decisions.
  • Research Assistance for Experts.
    In technical or specialized fields, generative AI can assist experts by generating summaries of complex research or condensing technical documents into understandable key points. For instance, in the banking sector, generative AI can play a crucial role in supporting experts in understanding and interpreting complex research related to finance, economics, and markets. A concrete example would be analyzing and synthesizing detailed financial reports and academic research papers.
  • Item Segmentation into Categories.
    Generative AI can aid in segmenting large amounts of text into relevant categories, which is useful for organization and subsequent data analysis. In marketing, companies often collect vast amounts of data from various sources, including social media, surveys, and market analyses. Generative AI can be used to segment this data into relevant categories. For instance, a fashion company can use AI to classify customer comments based on style trends, color preferences, or reactions to different collections. To facilitate inventory management, businesses can segment items, stores, or customers using structured data. Generative AI rapidly identifies dominant customer opinions and behaviors, enabling better decision-making while maintaining effective stock management.


Series on generative artificial intelligence

This article is part of a series we have produced to help businesses better understand generative AI and its possibilities.


3. Generating Multiple Content Types

Generative AI is transforming the creation of computer code and software solutions:

  • Code Generation (Text-to-Code Conversion).
    By understanding natural language instructions, generative AI can convert functional specifications into source code, thereby accelerating the development process.
  • Image Personalization.
    Generative AI can create customized images based on textual descriptions, offering new possibilities for visual customization. In product design, design teams can quickly and automatically explore different visual variations of a product based on textual specifications. This can expedite the prototyping process and allow for the exploration of visual concepts before materializing them.
  • Recommendation Engine.
    Generative AI excels in creating tailor-made code recommendations and software architectures. This results in enhanced development team efficiency, enabling rapid detection of code anomalies and instant receipt of improvement suggestions.

4. Semantic Research

Generative AI is an asset in complex data research and analysis:

  • Internal Conversational Agent.
    Organizations can benefit from internal conversational agents that assist employees in quickly searching for and retrieving information from vast databases, including their internal database. Employees can interact with the agent naturally to ask complex questions and receive relevant answers, facilitating decision-making and access to internal knowledge.
  • Insight Generation.
    Generative AI can help identify trends and hidden insights in large and diverse datasets, offering a fresh perspective on research. This can be useful for analyzing unstructured data, identifying trends, creating customer segments, or predicting future trends. This capability allows businesses to rapidly extract impactful information from documents and transform them into actionable knowledge.
  • Customer 360° View.
    Generative AI can be used to aggregate and unify heterogeneous data into a comprehensive view of each customer. Using advanced machine learning and natural language processing techniques, AI can identify relationships between different data points and create enriched customer profiles. This enables sales, marketing, and customer service teams to have a deep understanding of each customer’s preferences, behaviors, and needs.

Essential Use Cases of Generative Artificial Intelligence

[Cheat sheet]

Essential use cases for generative artificial intelligence

Download our Generative AI use case checklist. Simply fill in the form and you’ll receive your copy by e-mail.


Initiate by selecting a low-complexity, high-value use case for your organization

When embarking on a generative AI project, it’s often advisable to begin with a Proof of Concept (PoC) that offers low complexity and rapid high value. A PoC is a practical demonstration that validates the technical feasibility and potential value of a solution before committing to a fully integrated system.

Let’s consider the concrete example of a Proof of Concept for a generative AI-powered virtual assistant. Such a system enables customer support agents to easily access internal knowledge sources, ask questions, and receive relevant real-time answers. Swiftly showcasing the power of such a solution on your data and within your corporate context can not only enhance employee productivity but also generate enthusiasm by highlighting the benefits of generative AI within the organization.

Furthermore, through an internal virtual assistant PoC, a company can test the effectiveness of generative AI before applying it to customer-facing applications. This helps comprehend limitations and necessary improvements while minimizing the risks associated with implementing new technology.

“With great power comes great responsibility.” 

-Uncle Ben

At Moov AI, we believe in the immense potential of generative artificial intelligence and advocate for a more responsible use of AI through the leadership of Olivier Blais in LIAD, ISO standards on AI, and with the Quebec Innovation Council. Just like with any AI project, we aim to reduce risk levels. You can watch Oliver’s conference on generative AI, where the risks associated with this type of project are addressed along with how to mitigate them. It’s crucial to maintain cautious optimism. While the technology is impressive, it needs to be explored with security at the forefront.

In conclusion 

In an environment where the possibilities for using generative AI are vast, it’s essential to be discerning when choosing which use cases to explore. Before adopting a generative AI solution, as with all AI projects, it’s vital to think about the business objectives you’ve set yourself. An AI solution must meet your business objectives.

Now that we’ve outlined the various use cases for generative AI, you need to ask yourself what the next steps are. The first thing we’d advise you to do is to consider the questions proposed by McKinsey.

  • To what extent can technology help or disrupt our industry and/or our company’s value chain?
  • What are our policies and positions? For example, do we wait cautiously to see how the technology develops, invest in pilot projects, or seek to develop a new business? Should the position vary according to areas of the business?
  • Given the limitations of the models, what are our criteria for selecting which use cases to target?
  • How do we go about creating an effective ecosystem of partners, communities and platforms?
  • What legal and community standards must these models comply with so that we can maintain the trust of our stakeholders?

If you want to know more about the best way to start an AI (or generative AI) project, we’ve got a good eBook for you.

Getting started with a proof of concept can be a beneficial approach, offering quick value while enabling your organization to become familiar with generative AI and develop internal traction in the face of innovation. By taking these preliminary steps, you’ll be better prepared to maximize the benefits of this emerging technology, while addressing the specific needs of your business.

Why you should use ChatGPT in a business context

The challenges of integrating innovation like generative artificial intelligence solutions in your company

OpenAI made a big impact in the field of artificial intelligence by unveiling ChatGPT, sparking a frenzy of adoption among millions of people. For the first time, we witnessed a true democratization of artificial intelligence. This innovation opened the eyes of everyday individuals and the business world to new possibilities. Generative AI enables everyone to explore the capabilities of this advanced technology almost instantly.

Following OpenAI’s success, other companies such as Google (PaLM-2, Bard), Anthropic (Claude), and Meta (LLaMa) also released their large language models (LLMs) to compete with OpenAI (ChatGPT, GPT-4). These solutions are all powerful generative AI tools that can generate precise and rich responses from prompts.

However, the rapid success of these LLMs raises several risks, including societal and reputational concerns. Questions have also been raised about their integration in a professional context. This brings us to a well-known concept in business: innovation management. It’s not just about using new technology for the sake of it, but rather implementing it with a specific goal to achieve a competitive advantage. Like any other AI solution, considering the adoption of generative AI must begin with the business objectives you have set.

The challenges lie in innovating with generative AI, deploying it at scale, integrating it into the current company system, and managing the associated risks. When a solution arrives on the market so abruptly, it is essential to understand it thoroughly before adopting it too quickly and potentially having to backtrack.

Thus, companies quickly face the limitations and potential risks of generative AI. Let’s be clear: using the free version to automate business processes is a bad idea. Despite its performance and opportunities, this tool is not a B2B solution. This raises the question: how can we extend these capabilities to our professional activities more appropriately while mitigating the risks?

Thankfully, there are solutions specifically designed for enterprise use and offering business capabilities. B2B tools provided by Google, Microsoft, and AWS cater to the specific needs of businesses, allowing them to fully leverage the benefits of generative AI while ensuring optimal security and efficiency.

Risks and limitations of generative AI

Before delving into the topic, let us explain why no company should use the public version of ChatGPT (or other similar tools) to blindly automate critical processes within their business.

Data security

Data security is a major concern when using public platforms like Bard and ChatGPT. It is crucial to adopt preventive measures now to avoid sharing sensitive information through these tools.

By default, the data input into these tools creates a security breach as it goes to a third-party server. This information is transmitted to the servers of the company that created these solutions. All information provided via prompts to ChatGPT, for example, can be used by OpenAI.

Now, this is not for the purpose of “stealing information to dominate the world.” Instead, it’s to deepen the understanding of use cases and improve the technology. However, it is essential to recognize the potential risk this poses to the security of our company’s information. There are already examples of misusing ChatGPT, like the former Samsung employee who used it to optimize his code, inadvertently sharing sensitive company information with OpenAI’s servers. This represents an internal information breach.

Therefore, it is highly recommended to exercise caution and not share sensitive information that could compromise data confidentiality and security when using OpenAI’s APIs.

ChatGPT is trained until 2021

It’s also important to note that ChatGPT was trained until September 2021, which means its knowledge and capabilities may not be up-to-date with the latest information. This applies to all other LLMs trained on past data as well. For example, if you ask a generative AI solution about the latest financial statements of Shopify, you might get outdated information. This highlights the importance of understanding the temporal limitations of generative AI solutions and not considering their responses as up-to-date information in all respects. If you want recent information, it’s essential to include the most recent data in your query and base the response on that information.

Hallucinations

Chatbot responses can be useful, humorous, or, in some cases, outright invented. Grand language models can sometimes hallucinate. Hallucination here refers to false information in the generated text that may seem plausible but is actually incorrect. With generative AI text solutions, the responses are delivered with such confidence that they can easily mislead. If you want to write a poem about haystacks using ChatGPT for entertainment purposes, the impact of a potential hallucination is minimal. However, in a professional environment, where you consult information to make critical decisions, data accuracy is paramount.

When using generative AI solutions’ query prompts without providing sufficient context, you receive the most statistically plausible response, but the LLM may misinterpret it due to a lack of context surrounding the request. This puts you at risk of receiving a response that seems appropriate but might contain false information. It is our responsibility to validate the generated information to ensure accuracy before publishing it, to avoid unintentionally creating “fake news.”

In short, we cannot fully trust what the machine regurgitates, and that is problematic in a business context.


Series on generative artificial intelligence

This article is part of a series we have produced to help businesses better understand generative AI and its possibilities.


Generative AI solutions adapted for enterprises

As mentioned earlier, ChatGPT is more of a B2C tool and comes with risks concerning the security of business information.

Thankfully, there are now tools specifically designed for enterprises provided by Google, Microsoft and AWS. These tools are recommended for professional use.

For example, Google Cloud’s suite now includes several tools such as Generative AI App Builder, Duet AI for Google Workspace , and generative AI support on Vertex. With these, you have the robust and secure structure of Google for your business projects.

Security is reinforced as the content remains within a secure and private shell, ensuring that the information transmitted to the model will not be stored publicly. You can also apply your data governance plan to comply with your internal security processes, which is much more reassuring.

Creating your own generative AI solution in-house

An interesting practice with professional generative AI versions is the possibility for a company to integrate its own technical documents and create an in-house solution by defining specific limits and parameters. This approach allows for developing a personalized virtual assistant accessible to all members of the company, offering easy access to internal knowledge bases. This initiative encourages collaboration and simplifies the dissemination of information within the organization, thereby enhancing overall efficiency and productivity. By customizing generative AI to meet the company’s needs, you can maximize the benefits of this technology while adhering to the organization’s specific policies and requirements. Both Google and Microsoft allow you to create a company-specific interface and offer a secure generative AI solution that considers the organization’s specific parameters.

This practical example provides an opportunity to embark on a generative AI project that offers great value while presenting low risks.

Validating before using

How do you estimate if generative AI performs well in solving your problem? It’s simple; just measure the success rate over 50-100 similar queries. This will help estimate the potential success when using the solution in a normal context and give you more confidence.

It’s super powerful to be able to create query templates or a process once certain tasks have been validated by users. This way, you can generate a platform that can act as a more generalist personal assistant while automating or optimizing specific tasks. Project teams are encouraged to stay alert for tasks that bring the most value and try to generate these templates or processes to make them accessible to all users.

An overview of a generative AI solution

Let’s consider a company that wants to automate the automatic writing of submissions to address new business opportunities.

Task to automate: Automatic writing of submissions based on customer data from your CRM, historical submission data from the past, and various knowledge bases such as internal documentation, customer exchanges, or other data.

Query interface: A software interface integrated with your existing tools (Teams, Slack, Salesforce, etc.) that allows your users to write queries in natural language that activate the generative AI solution. For example: “Write a submission for our new client Olivier from Moov AI company for a proof of concept to automate submissions for a particular project using generative AI.”

API (input and output): API stands for “Application Programming Interface.” An API defines the methods and data formats that developers can use to access the functionalities of software, platforms, or third-party services. These two APIs facilitate information exchange and integration between your systems and the generative AI solution.

Data: The generative AI solution will have been previously personalized with your data. Think of generative AI tools as empty shells into which you can integrate your data and leverage the same response power as open solutions. The response of the generative AI solution will be based on your data.

This data can include your CRM data, customer email exchanges, knowledge base, documentation, past submissions, project reports, actual project costs, etc. Anything relevant and written can be integrated into these solutions.

Professional cloud platform: These solutions no longer need an introduction. Unlike solutions offered to the general public like ChatGPT, cloud solutions offer increased data security, the possibility to apply your data governance plan, and monitoring capabilities for your environments, data, and models. Your data and the various commands you issue will remain in your secure Cloud environment, safe from external eyes.

Generative AI solution and task executed: Your query will be processed by the generative AI solution using your data and your company as context. The result will be, for example, the drafting of a new submission tailored to your service offering and the specific needs of the client you just targeted. This draft submission will be complete and ready for review by a human colleague who will send it to the client. All of this in just a few minutes.

Good tools for good work

The launch of ChatGPT by OpenAI has opened exciting new perspectives in the field of artificial intelligence. The democratization of this advanced technology has allowed millions of users to explore its capabilities almost instantly. However, it is essential to consider the risks and limitations associated with using these LLMs. Using generative AI tools for business is significant for safe professional use.

Ultimately, by understanding the risks and using these technologies judiciously, we can leverage their full potential while ensuring the security and protection of sensitive data. Artificial intelligence continues to evolve, and it is important to adopt a balanced and thoughtful approach in its use to make the most of it.

How to leverage generative artificial intelligence solutions in business without drifting

Exploiter sans dérive Gen AI_Featured image

With the advent of ChatGPT and other text and image generation tools, generative artificial intelligence (AI) solutions offer truly revolutionary prospects for your business. Generative AI solutions will have more impact on our businesses and ways of working than the arrival of the Internet.

In this presentation, Olivier shows you how to leverage these technologies right now to go beyond simple question-and-answer scenarios and achieve your development goals while accelerating your capabilities.

Through concrete examples and in-depth analysis, Olivier explores the various applications of generative AI, as well as the limitations and challenges associated with these tools, which, it must be reiterated, are still in their infancy.

Olivier also addresses the ethical considerations related to the use of generative AI tools and proposes ways to ensure the quality of solutions delivered using generative AI.


Series on generative artificial intelligence

This article is part of a series we have produced to help businesses better understand generative AI and its possibilities.


Conference on demand (French)

In-Depth Verbatim Conference Translation from French


[This article is a verbatim transcription translated from French of the conference presented by Olivier Blais :

‘How to Harness Generative Artificial Intelligence Solutions in Business Without Drift’.

It is worth noting that this transcription and translation were done using artificial intelligence tools. A job that would have taken 3 to 5 hours was completed in just a few minutes.]

Introduction

Olivier – Speaker

Thank you very much for your time. It’s greatly appreciated. I know we’re all busy here on a Thursday morning. We’re anxious to get back to the office, but at the same time, I think generative AI has something special. I think it taps into the imagination. I can understand why you’re here. I’m also super excited to talk about this topic. I want to know, has everyone arrived? The absentees are missing out, so let’s get started. Before we begin the presentation, I had a little survey for you. By a show of hands, I was wondering who in the room has certain concerns about generative AI? I see hands going up very quickly. Perfect. Thank you. Also, I had a question about the capabilities of generative AI. Who is excited about generative AI? Who wants to use it? Excellent. We’ll send a sales representative to talk to you shortly. No, that’s a joke. It’s a joke, but not really. I’m glad to have seen both worried and excited hands. In fact, for me, there’s still a duality between the two.

The duality of generative AI

[01min 24s]
I am confident but cautious about the technology. That’s what we’re going to talk about today. That’s why we’re discussing generative AI, but without going too far. How to effectively use these technologies to generate benefits without excessive risk. What I’m going to do is… Excuse me, first, I’m going to introduce myself. My name is Olivier Blais, co-founder of Moov AI. I am in charge of innovation for the company and a speaker. I speak from time to time; I like listening to myself speak. What we’re going to do is talk about generative AI, of course. We’ll start from the beginning. We’ll introduce the topic, but we’ll go further. Why? Because we are all AI developers at the moment. It’s special, but with generative AI, it’s a paradigm shift. I’m no longer just speaking to a couple of mathematicians who studied artificial intelligence. I’m speaking to Mr. and Mrs. Everyone because now everyone has the opportunity to use these technologies to generate results. So, everyone needs to be aware, everyone needs to understand how to effectively exploit the technology.

The Hype Cycle of Generative AI

[02min 38s]
But don’t worry, I’ll try to keep it fairly soft. We won’t dive into mathematics, I promise you. And I’ll also talk about responsible AI, which is key. Making sure that when we do things, we do them correctly. So, we’ll delve into a slightly more theoretical period, but first, I’m curious to know more about where you are in the hype cycle. It’s extremely hyped right now, ChatGPT, it’s all you hear about. I don’t even go on LinkedIn anymore because that’s all it is now. GPT this, GPT that. But in fact, it’s really a curve. When I saw this curve for the first time, it stuck with me. And everyone here, whether you realize it or not, you’re at one of these stages of the curve. I won’t ask everyone where they are, that would be really complex, but I find it interesting to know what the upcoming stages are in our journey. For me, initially, it was around January, I would say. At Moov AI, we started using GPT-2, GPT-3 since 2019.

[03min 56s]
We found ourselves making a documentary, for example, with Matthieu Dugal where we generated a conversational agent. It’s been a while, but since we heard about ChatGPT, that’s when it really awakened people and created hype. At first, we think it’s magic. You input something like “give me a poem about haystacks,” and it generates a poem about haystacks, and it’s incredible. It really feels like magic, but at some point, when you start using it for something useful, that’s when you fall into small “rabbit holes,” that’s when you encounter irregularities. For example, you might question it about a person, a public figure, and then it gets everything wrong. You might use it for calculations, and it can make mistakes in calculations. You scratch your head and think, “Okay, what’s happening?” And increasingly, you find yourself identifying weaknesses in these models. Ultimately, it’s not necessarily magic.

Graphique sur le cycle d'adoption de l'IA générative

Risks and Limitations of Generative AI

[05min]
It’s a lot on the surface. However, there are some issues to consider with ChatGPT. Firstly, it’s not updated daily. The last update was in January 2021. This means that if you ask questions about current news or events, it may not be aware of them. So, we can identify this as a weakness. Additionally, when using ChatGPT, your data is transmitted to OpenAI, creating a potential security vulnerability. This reveals the weaknesses in these otherwise cool tools. While there are ways to mitigate these risks, it’s essential to be aware of and understand how to properly leverage the technology, using best practices and learning from examples.

Advantages and Best Practices of Generative AI

[06min 12s]
Despite these challenges, it’s fascinating to see what can be achieved with generative AI. While we acknowledge the problems, we strive to overcome them and mitigate the identified risks. Many people are currently using generative AI in functional, deployed applications. It works. And now, we can move forward and apply the same principles to real-world use cases. Speaking of applications, let’s take a step back. Our focus is not just on ChatGPT; it’s one solution among many. For instance, Google has its own solution. I refer to this as B2C (Business-to-Consumer) – something that is accessible to everyone, a democratized technology meant for widespread utilization.

Generative AI Tools for Businesses

[07min 34s]
It’s exciting because generative AI enables various text analysis and offers numerous use cases that benefit everyone. There are also tools available specifically for businesses (B2B). It’s important to understand the distinction between the two. We don’t recommend using consumer-oriented tools for enterprise purposes. Instead, we suggest utilizing tools tailored for businesses. For example, solutions like GPT4, Palm, ChatGPT, and BART are just a few examples among hundreds. We refer to these solutions as Large Language Models (LLM). Some of you may have heard of LLM before, and it’s good to see a few hands raised, indicating familiarity. LLMs are language models trained on scraped internet data from various sources, making them highly proficient in understanding and generating text.

Fundamental Models and Language Models

[08min 55s]
This is great because it allows us to do everything we currently do with the technologies we use in our daily lives. However, it fundamentally stems from an approach called foundational models. This approach has been around for over ten years. It’s the ability to model the world. It may sound poetic, “I’m modeling the world,” but that’s essentially what it is. It means that there are human functions that we don’t need to recreate every time, such as image detection. We could create our own model to identify cats from dogs, another model to recognize numbers on a check, and yet another model to identify intruders on a security camera. We could develop them from scratch, but it would require millions of images. Instead, the concept of foundational models emerged, and we thought, “Wait, why don’t we invest more time upfront?” We can develop a model that excels at identifying elements in an image.

The Paradigm Shift with Generative AI

Afterward, we can leverage this capability for tasks related to image detection. It started with image detection, but we quickly realized the benefits in language as well. That’s why we now have generative AI tools that allow us to generate poems, summaries, and perform many other tasks. How were these tools created? They were developed by gathering vast amounts of publicly available internet data. Personally, I don’t have the capacity to do this, but major web players, including the GAFAM companies, can seize this opportunity because they have the storage capacity and advanced capabilities that surpass other organizations. As a result, they have developed highly effective models that understand the world but lack depth. Has anyone here used GPT-3, for example? OGs, I like it—people who were there before ChatGPT. With GPT-3, if you asked it a question, it would essentially provide you with what Wikipedia would say. It wasn’t very interesting from a user experience perspective. This is what sets the new solutions apart. As a second step, there has been an emphasis on dialogue adaptation. It’s fun how we are now getting responses. Organizations have started prioritizing the user experience. This speaks volumes about being more attuned to what users want and how they want to leverage the technology. It makes the experience more enjoyable and motivates people to use and explore it further. Additionally, it has significantly improved the quality because humans prefer to be responded to in a human-like manner. This has increased our adoption of the technology. And now, here we are. Let me explain the difference a bit because it changes many paradigms compared to regular AI solutions. With a regular AI solution, we typically have teams that gather data and train models. Often, each problem has its own model.

Fonctionnement de l'IA générative

[13min 04s]
And after that, once they’ve trained it, then it remains to be deployed so that it can be reused whenever there are new updates or new things that come up. Let me give you an example. Back then, because it’s different now, there was a team at Google, for instance, that focused on sentiment analysis. They would analyze tweets or texts and determine whether the sentiment expressed was positive or negative. So, you had a team solely dedicated to that. They would gather information from the web, identify the associated sentiment, train their model, and then deploy it. From then on, every time a new comment came in, they could identify whether it was positive or not. This had to be done for sentiment analysis, and the same process applied to every new thing you wanted to model. But that’s not the case anymore. The paradigm has changed because there are so many possibilities with language models that you no longer need…

The Possibilities of Generative AI

[14min 20s]
Firstly, it’s already deployed. So, for the average person, for businesses, you no longer need to go through the initial training phase. And that changes the game because it significantly reduces the scope of your project. It’s no longer a project that costs millions to produce because you’ve cut a huge portion of your development costs. Moreover, you don’t need to gather extremely precise data to address the specific problems your model was trained on. Here, you input a question or a prompt, and if your prompt is well-crafted, the result is what you expected to receive. These results now have limitless possibilities. You can have text, predictions, code, tables, images—there’s so much you can get. The benefits we can achieve with these new technologies are incredible. However, I appreciate the duality in the comment made by Google’s CEO on 60 Minutes a few weeks ago: “The urgency to deploy it in a beneficial way is harmful if deployed wrongly.”

[15min 44s]
That’s the real duality. There are so many positives that can come from it, but it can be catastrophic if done incorrectly. It’s the CEO of Google who said that, but honestly, it’s the responsibility of each individual to use these technologies appropriately. If we use them correctly, we can minimize the risks because there are risks. Here, firstly, one risk is that it’s used by everyone. I understand that we might have around 60 people here, but there are hundreds of millions of people who have used ChatGPT, for example. So, it’s now being used by almost everyone, and it’s crucial for each person to be aware of their usage because we could end up causing a disaster if we exploit it in the wrong way. I’ll skip that part. Here, I’ll give examples of things we can do because it’s always helpful to be able to… Sorry, I’ll go back here. These are different use cases, don’t worry, I won’t go through all of them, but there are plenty of use cases. Many of them are related to text analysis.

Nombre d'utilisateurs de ChatGPT en comparaison à TikTok, Instagram, Google translate et Netflix

[17min 00s]
We’re able to analyze a lot of texts, perform classification, and identify elements within a text. Additionally, we can also make predictions, even very conventional ones. So, we can reproduce certain machine learning models using LLM tools. By the way, everything we do with text, we can do with code as well. There might be people who… We’re used to writing text in French or English, but for those who are used to writing in Python, for example, or in C++ or C#, it’s even more efficient because it’s explicit. When you write a function, it’s explicit. Your language is implicit, all the sentence structures, what the words mean. So, let’s consider that the sky is the limit in terms of capabilities. As I mentioned earlier, I think we can agree that question/answer tasks are extremely good by default. So, if we want to build a chatbot, let’s ensure that the chatbots we want to develop have the ability to generate responses. We’ve reached that stage now. What we develop should have the ability to generate responses.

[18min 21s]
What it allows is the ability to address a much wider range of questions. It also reduces development time. You don’t have to think about each individual scenario separately. You give precise instructions to your chatbot, and most of the time, it will provide appropriate responses. You can also control it, which is a possibility. Another element that I really like is the fact that we can trust these models. It’s not just about saying, “Let’s ask questions and see what happens.” We can trust these solutions in certain cases. For example, here we could correct dictations with solutions. For instance, I posed a question here. I tried a little dictation of my own: “Manon bought three cats at the grocery store.” What we can see is the correction that is made. Truly, it was able to go much further. It has been demonstrated that for text correction, it’s incredible. It works really well, and we can trust tools like this for sensitive matters such as dictation correction for our children.

Différents cas d'usage d'IA générative

[19min 47s]
By the way, something quite amusing is that if I ask the Generative AI platform to correct the sentence “Manon buys three cats at the grocery store,” besides correcting the mistakes, they will say that it’s not really at the grocery store where you buy a cat. That’s interesting to know. Otherwise, I don’t really encounter the cat. It’s not necessarily the cat that you want to keep for years. That’s another example. Earlier, I mentioned it when talking about sentiment analysis. But these are things we can do. All the existing APIs will be rapidly replaced. Here, I asked the same question: can you identify the sentiment in the following sentences? You provide the list of sentences, and they return the correct sentiment. That’s exactly the tool that will be used in the upcoming APIs. Once again, we can trust the tools as long as they are used appropriately and optimized for the task you want to accomplish. I even pushed the envelope a bit because earlier, I talked a lot about the risk. Yes, there is a risk, but it’s important to understand the risk properly.

Exemple de correction d'une dictée par ChatGPT

[21min 05s]
I thought, “Wouldn’t it be fun to use Generative AI tools to perform risk analysis of an AI solution?” I thought, “It would be amusing to ask ChatGPT to analyze the risk of ChatGPT itself. Let’s see if I can put it to the test.” I actually received some very interesting responses. First of all, it did a great job. I used a risk framework called NIST, which is highly recognized. I asked the question, “Here are the risks, can you assess the impact, the probability of their occurrence, and even provide justifications?” The task was really well done, and I am extremely satisfied. Here, I’ll give you three examples. The first example is about use cases. Is the use case a risky one? What’s interesting is that it’s not always straightforward, I think we know that, but it’s not due to hallucination or error; it’s a matter of perspective. So, the response I received was, “No, ChatGPT is designed for all use cases.

[22min 19s]
Since we don’t specifically target any use case, we have nothing to blame ourselves for.” I’m exaggerating, but I think it’s interesting because it gives us the impression that we try to avoid as a society. It’s like saying, “You know what? We’ll give the correct answer to everything, and then we can take a step back, and we’re not responsible for what happens.” Our responsibility is to prevent that from happening. It’s like saying, “No, no, look, for each use case, for each scenario we use, we’ll make sure it’s done properly.” So, we’ll be required… There’s an additional level. We can’t just rely on technologies to control each of these cases. It needs to be done in a subsequent step. Another aspect was about the methods for manufacturing, for creating the solution. And here, it made me think a lot. So, OpenAI creates its own model with two or three data scientists. I wonder who has the best capabilities for solution creation.

[23min 27s]
I think I’ll put a lot of my money on these platforms because they have highly qualified people, they have large teams. So, it made me think, and I agree with them. It’s true that the risk is lower. I think the risk is higher if Mr. or Mrs. Average Joe tries to create their own model because they may not have the best methodologies, they may not have the best expertise, and so on. So, there is also a significant benefit that can be gained with these solutions. Lastly, the last point I wanted to discuss is a risk in terms of legal security. And don’t worry, I’ll come back to it later. But something interesting, I got it here. What it told me is that ChatGPT uses third-party data, and that entails risks related to copyright and intellectual property. But all of this is intriguing. Firstly, transparency, being able to understand the model a bit more, but also that generative AI solutions are capable of performing tasks as critical as model risk assessment.

Un tableau de risk assessment of ChatGPT effectué par ChatGPT

[24min 42s]
The guy who is involved with ISO standards is very happy with this type of exercise. Now, I’ll take an even bigger step back. I don’t know if you all agree to use generative AI. I think yes, I saw many hands raised, but some may not be. But I have some bad news for those who are less enthusiastic about using generative AI. You don’t have a choice. Unfortunately or fortunately. Why is that? It’s because technology organizations… Here’s an example from Google, but Microsoft has a very similar roadmap as well. They didn’t just create generative AI tools for everyone to have fun writing poems or creating summaries. They also used them to improve the services they offer to their clients. For example, here are three different levels of generative AI offerings by Google. The clearest one is to say, “I’ll help data scientists and people developing AI solutions to develop generative AI solutions.” I think that’s a given.

[26min 06s]
Everyone is aware that this was coming. But Google takes it a step further and says, “Wait a minute, you don’t always have to have data scientists. There aren’t many people developing AI solutions, but there are many more developers in the world. So, how can we assist developers in their development?” And that’s where the capabilities we’re starting to understand now when using existing solutions come into play. There’s the possibility of using the solutions as they are. For example, Google talks about helping with conversation, helping with search. These are very clear use cases that developers can deploy in existing solutions. They can develop their own applications, select the features they need, and adapt and adjust these features to enhance the end-user experience. And we can go even further than that. So, it’s not just for the average person; there are also business users who have started using tools like Dialogflow.

[27min 22s]
Google has several tools, and these tools will also have generative AI capabilities. What this means is that generative AI is here to stay. We just need to be able to use it effectively. And I have some more news for you: development continues to accelerate. I understand that some people may be happy to know that there are some lingering questions. Apparently, GPT-5 is not being developed, but that doesn’t change the fact that development is ongoing and intensifying. I can provide examples. We have LLaMa, and I’m not talking about the animal, even though there’s a picture of a very technological llama. But LLaMa is a tool that allows you to create your own models, your own internal ChatGPT, for example. So, that continues. We can see the frenzy. Everyone wants a LLaMa. I’m exaggerating. Personally, I would suggest not using that and instead using the right technologies. That’s my take on it. We have much better performance with the existing solutions that have already been tested by millions of people.

[28min 51s]
But I understand that it’s something interesting. It allows us to develop everything on our own laptop. Sure, the geek in me finds it exciting, but the business person finds it a bit overkill for what we’re creating. There are also companies that have their own capabilities in development. For example, Coveo. Coveo is very clear that they have already developed some generative capabilities. Coveo, which is one of Quebec’s gems. And there are other companies like Databricks, a major player in the ecosystem, that is developing Dolly, I believe. So, it will intensify. There will be more and more competition in the market. And there’s also a trend, I’ll just briefly mention it, called “auto GPT,” which is the ability to train GPT, a generative solution, with another generative solution, creating a loop. It’s scary, I agree. Again, it’s important to control it, but for now, it’s a trend that is more prevalent in development, automating certain workflows.

Diapositive représentant différentes innovations dans le domaine de l'intelligence artificielle générative

[30min 09s]
Really, it continues. It’s important to stay informed. It’s important to understand what is happening to ensure we use the best technologies to meet our needs. And to avoid risks, I’m going to talk about three different challenges we have currently. In terms of hallucinations, I’ve been talking about hallucination for a while. What is an hallucination? I’m not talking about hallucination in a desert. An hallucination is an error. Let me give you an example. Everyone has a brother-in-law who says things, he’s so convincing, but sometimes he doesn’t know what he’s talking about. I think everyone has had that brother-in-law or sister-in-law. That’s an hallucination. It’s an error. In the past, the OGs will remember that in a traditional model, you have errors, so sometimes you make incorrect predictions. In this case, it’s a wrong prediction, but it’s so convincing because it’s well-written. An hallucination is a bit more problematic because people who don’t necessarily have the ability to judge the output accurately could be deceived. So, here, you need to be careful every time you produce an output, every time you make a prompt, you need to look at what the result is.

[31min 43s]
Deepfakes, fake news, there are plenty of them, and there will be more and more. So, the ability to ask, “Write me a text about a certain topic,” without fact-checking and posting it on Facebook, is a problem. Why? Hallucination. I think we’re making the connection a bit. It’s much easier to write beautiful texts with false information than it used to be because before, you did it yourself or had it done by people in other countries. But now, it’s much easier. So, we need to ensure that every time we develop things, especially when it’s automated, we avoid the spread of fake news and deepfakes. And finally, in terms of privacy, I think everyone, if people are not aware of it now, I think we will be more and more aware of it. Let’s focus on it because it’s one of the problems we will increasingly see. There are horror stories right now, people copying, pasting trade secrets, putting them into tools. Ultimately, the information is distributed to big companies.

Les principaux risques de l'IA générative : les hallucinations, les deep fake et l'atteinte à la vie privée

[32min 52s]
And there, you have just created significant security vulnerabilities. But that’s why at Moov AI, our stance is somewhat similar to what I mentioned earlier – it’s cautious optimism. In fact, I’ll use the quote from Uncle Ben, for those who remember Spiderman, “With great power comes great responsibility.” That’s why we have decided to embark on this journey. We want to assist our clients because if we don’t, people will do it themselves, and they might do it poorly, promoting fake news and causing more problems than benefits. That’s why it’s important to provide guidance and support. That’s why we are actively involved, for example, in advocating for Canada’s Data and Artificial Intelligence Act, Bill C-27. We are helping to accelerate these efforts. We also have a prominent role in ISO standards to regulate and oversee the development and use of artificial intelligence. Our goal is not only to develop useful things but also to ensure proper control over them.

Présentation des différents efforts pour sécuriser le développement de l'intelligence artificielle.

[34min 13s]
This can be leveraged in three different ways. Firstly, in the field of education. For instance, we have Delphine here, who oversees the Moov AI Academy. We will ensure that we assist individuals in achieving their objectives. That’s for certain. We will also contribute to the development of high-quality solutions. We have already begun doing so by employing machine learning methodologies to demonstrate the effectiveness of our solutions. If we can do it for traditional solutions and prove their worth before deploying them, we can do the same for generative AI solutions. Lastly, we aim to fully comprehend the risks associated with the solutions we undertake. Our objective is not to create more problems but to capitalize on opportunities. Now, let’s briefly discuss risks because it is crucial to address them. I brought you here for that purpose. Just kidding! However, it is essential to have a good understanding of risks. Here, I will discuss four main risks: functional risks. When you build a feature, ultimately, a model is a feature.

Exploiter l'IA générative de façon responsable en éduquant, en répondant à des problèmes précis et en comprenant les risques associés aux solutions.

[35min 33s]
Contrary to what ChatGPT was saying, we are not merely creating a platform that provides answers to everything. Our goal is to develop features that meet your specific needs. How can we do this effectively? From a societal standpoint, how can we ensure that we create a solution that is fair and ethical? The key is to ask the right questions. We also need to consider information security and legal aspects. Now, let’s go through the different risks. When it comes to best practices for functional risk, it is important to define the task you want to accomplish clearly. We have examined various tasks extensively. Therefore, it is crucial to break down the problem in the way we want to approach it. Just because we have a powerful tool at our disposal and can input any prompt doesn’t mean it will provide optimal responses for all scenarios. We shouldn’t overlook scoping and settle for just having a search bar where we can do anything. Ideally, we should ensure that we achieve good performance relative to our specific goals. That’s truly the foundation, and I highly recommend everyone to follow this approach.

Les principaux risques de l'IA générative à atténuer : fonctionnel, sociétal, sécurité de l'information et juridique.

Best Practices for Functional Risks

[36min 53s]
Next, what we want to do is optimize our approaches and prompts. I will show you how to do that shortly. And finally, you need to perform validation. It’s a machine learning tool, an artificial intelligence tool. You want to validate it with multiple data points, prompts, and scenarios, just like we do in traditional artificial intelligence. Just because it works once or twice doesn’t mean we can assume it always works. So, one of my recommendations is to use conventional approaches, approaches that have been proven for validation, and validate what we develop. “Okay, yes, it works.” And this is quantifiable. “Okay. What I wanted to develop works well 90% of the time.” So, you’re able to quantify the percentage of correct answers you obtain. This is highly valuable because it allows you to determine whether you’re shooting yourself in the foot or not by continuing the development. Now, let me give you an example regarding prompt optimization. What I mentioned earlier, and it’s really… I think everyone understands that it’s quite simplistic, is that writing a prompt like “Write me a poem about haystacks” isn’t what you’re going to transform into a process or a product.

Diapositive sur les bonnes pratiques pour atténuer les risques fonctionnels.

[38min 29s]
“It’s not about that. It won’t work well. I like to use the expression ‘future-proof.’ It’s not something that will allow you to deploy a solution that will work in the long term. Instead, what you’ll want to do is… Yes, let me give an example. I apologize. I’ll give an example. It’s like if I ask a question to a financial chatbot that I develop, such as ‘Identify three interesting facts in Shopify’s latest income statement.’ Is that a legitimate question? No, excuse me, ChatGPT is only trained until 2021. Okay, but it doesn’t know that. And if you ask it a question about, for example, the financial statements of 2019, it will give you random answers. The numbers won’t be accurate. Why? Because it’s really far back in the tool’s memory. Instead, what you want to do, and the best way to avoid shooting yourself in the foot, is to provide relevant information to the model. If instead, you manually find the financial statements and copy-paste them to test it.”

Graphique représentant le fonctionnement des prompts dans un modèle d'intelligence artificielle générative.

[39min 42s]
I suggest doing it by the way. You will be pleasantly surprised, but do it programmatically within a solution. And then, you say, “Can you identify three interesting facts in Shopify’s income statements based on this document?” It’s full of numbers, difficult to read even for yourself, but Generative AI tools are capable of interpreting the information. And here, I’ve tested it. Will it give you a response? Yes, it will. And by the way, the numbers have been validated. I didn’t have them validated by Raymond Chabot of Grant Thornton. I’m pushing it a bit, but not that much. But it provides real facts, the right information, and that’s future-proof. So, you’ve just created, yes, admittedly, a little extra complexity, but it’s worth it. Now, if I come back to my proposal, it would be to add information. Firstly, start with a knowledge base. Your knowledge base could be an FAQ, documents related to your company. For example, I want to know about my service offerings.

Un exemple de prompt pour ressortir des faits intéressants dans le dernier état des résultats de Shopify.
Exemple de réponse à un prompt demandant des faits intéressants dans le dernier état des résultats de Shopify.

[40min 59s]
In my projects, I have post-mortems, I put them in a database, and after that, I can ask a generative AI tool questions, and it responds to me. You create a knowledge base, and then you go on to create, essentially, a recommendation tool, so you create a simple search tool that gives you recommendations. What information would you like to add to your prompt? Then you optimize it, and there you have a tool that works well, hallucinates less, and is ready for the future. It’s this type of tool that I propose to use because otherwise, we shoot ourselves in the foot. In terms of societal risks, we need to ask the right questions. I mean, quickly, we need to ask the right questions. If you don’t ask the right questions, please don’t automate anything. Automation, when you don’t know what you’re doing, is the enemy. We don’t automate anything before knowing how to ask the right questions. But again, we need to ask the right questions. We don’t know what we don’t know, right?

Graphique démontrant comment optimiser les prompts dans un modèle d'intelligence artificielle générative.

Societal Risks and Asking the Right Questions

[42min 10s]
There are tools for that. I can share them with you. What I’m going to do is, later on, I’ll share a list of tools with you. One of the tools I like is called the reflexivity grid. I didn’t know it was a word, but apparently, it is. It’s about the ethical issues of AI systems, and it was developed by OBVIA, which is a Quebec organization. And here, for example, we have the ten commandments, the ten subjects, ten categories of risk. It’s really interesting because this grid provides very specific questions. For example… But these are questions that make you think. I’ll give you some examples. Can your system harm the user’s psychological well-being? In some cases, yes. About a month ago, there was a suicide that happened due to abusive use of a chatbot. I know it’s a “cherry-picked” example, but it just shows that there can be a connection between psychological well-being and the use of technology. We just need to be able to understand whether or not our system can affect society.

Bonnes pratiques pour mitiger les risques sociétaux : ne pas automatiser à outrance, se questionner quant aux enjeux éthiques potentiels et atténuer les risques avant même le développement.
Grille de réflexivité sur les enjeux éthiques des systèmes IA (OBVIA)

[43min 31s]
In terms of privacy, of course, we discussed it earlier. In terms of caution, what’s the worst that can happen? Do we have mechanisms to prevent the worst from happening? The worst that can happen is information sharing. If it’s just internal information sharing, but then you always have someone who will validate and correct it, it’s not a big risk. But if you end up creating something that is automatic and it sends false information or makes financial decisions, we recommend having mechanisms to mitigate these risks. In terms of responsibilities, who is ultimately responsible for the solution? We can’t just say, “I rolled something out, I pressed ‘run’.” It runs, it’s supposed to perform a task, but there’s no one in the organization who is in charge. The answer is not “Google is in charge because it’s their solution.” No, no, no. You are in charge of what you develop. So, in other words, if we think we’re going to use ChatGPT to automate a department, I have some news for you.

Information Security and API Usage

[44min 44s]
There will be someone in charge of this new virtual department, and all the bad decisions that are made will be attributed to that person. Are we ready as an organization to do that? That’s another good question. I strongly suggest asking yourselves the right questions and trying to answer them as best as possible. That doesn’t mean you will cancel your projects, but you might structure them differently. In terms of information security, what I’m going to propose, as seen in bold, is that first and foremost, starting today, avoid putting sensitive information, whether in Bard or in ChatGPT. These are two solutions where information is passed to the creator. It’s not because of reasons like “We want to steal the information to dominate the world.” It’s to learn more about the usage patterns that are being captured in order to gain insights into the technology’s use. But we should avoid it because it poses a risk to the security of our company’s information. And personally, do not enter your social insurance numbers. There are many things we want to avoid doing.

[46min 05s]
I wonder if everyone knows my name based on my social insurance number. That’s not a good prompt. Also, the other element, that’s for B2C tools that are free. We know there is nothing free in this world. That’s an example. But for now, what I propose is to avoid using OpenAI’s APIs. Currently, OpenAI uses the information for retraining, so the current tools, for cases where information security is more critical, for example, when creating a chatbot where it’s really the end user who communicates. In that case, you cannot control the information that is disclosed. In this case, I would probably use other alternatives, such as using Google, using Microsoft directly. Professional services guarantee that the information will not leave our own environment. That’s much more reassuring. And then, here, I have a small proposal for companies that I’m starting to hear, which is a proposal that people are adopting, is to create their own generative AI platform. If we start seeing it more and more, here, the use case is very simple. If I work at Pratt & Whitney Canada and I start using ChatGPT, copy and paste, I want to validate spare parts, what are the instructions for this material, the spare parts needed for this particular repair?

Bonnes pratiques pour mitiger les risques de sécurité de l'information : éviter de fournir des données sensible à ChatGPT ou Bard, éviter d'utiliser les API actuels de OpenAI, utiliser les solutions professionnelles (Microsoft, Google) et créer sa propre plateforme d'IA Générative.

Legal Risks and Copyright Considerations

[47min 51s]
It’s a good use case, but if you test it, it means you’ve just provided your engine’s technical specifications to OpenAI. You probably don’t want to do that. Instead, what can you do? You can use professional versions. Firstly, you could use professional versions, whether it’s Google or Microsoft, both offer these capabilities, and you create an interface, and that’s it. So, you’ve just created an interface, you can call it whatever you want. Pratgpt, you can have fun with it, add nice colors. You’ve just reduced a risk in using the technology. You’re not preventing it. The worst thing would be to prevent the use of generative AI because it’s impossible to prevent its use. Instead, you control the security around it, and you can even go further. Earlier, I was talking about a knowledge base. With my Pratgpt, I can give it access to all my technical documentation. That way, I can ask questions, and it provides accurate answers. You can create this interface and truly benefit your organization. That’s an example that, as such, solves many problems.

Créer son propre IA Générative via les versions professionnelles.

[49min 12s]
I’m a solution-oriented person, which is why I like proposing this solution because it’s so elegant. Ultimately, we’ll end up with legal risks. We talk about it a lot: copyright, plagiarism. Yes, there can be those issues. So, firstly, what I propose is to assess the risk of that happening. There are certain risks like sentiment analysis, where there are none. You’re asking if something is positive or negative based on information. So, there’s no risk there. In some cases, the risk will be zero or negligible, but in some cases, the risk will be very high. As a journalist, if I want help in creating my article, there’s a higher risk of plagiarism, intellectual property issues. If I want to automate code development, I might end up using portions that have a commercial license, which I shouldn’t be able to use. And you don’t know it because the output is so elegant that sometimes you can be caught off guard. In these extreme cases, what I propose is that, precisely in situations where there is a significant risk in terms of copyright, I would suggest avoiding the use of ChatGPT because GPT is one of the solutions currently available that hasn’t clarified that they only use publicly available information.

Bonnes pratiques pour mitiger les risques juridiques : Évaluer le risque associé au plagiat et au vol de propriété intellectuelle, préconiser certaines solutions selon de corpus d'entraînement utilisé et intégrer des outils d'évaluation des extrants

Alternative Solutions for Copyright Risks

[50min 55s]
So their position is currently unclear. In these cases, it’s not very common, but still, when it happens, there’s GPT 3, there’s GPT 4, there’s PALM, there are several other tools that can be used to address this issue. Additionally, you can approach the problem differently. Instead, you generate something and then have it validated. For example, from a journalistic perspective, there are tools that can check for plagiarism. So, you can add and refine your tool to minimize risks. I hope I haven’t put anyone to sleep talking about risks, but it’s important to me. So thank you for staying. I haven’t seen anyone yawning. I’ll give myself a pat on the back. In conclusion, I’ve said it before, and I’ll say it again. It’s important to get on board now, to embrace the technology. There are so many benefits that can be reaped.

[52min 08s]
With that, thank you very much for listening.

Canada’s bill C-27: what it is and how to take action to avoid costly impact

Several times I have written about the need for AI legislation, especially since the AI landscape sometimes seems like the wild west. Canada’s bill C-27 is one of the responses from the government that I expected.

In this article, I’ll explain what C-27 is, how your company will be impacted by it, and my take on its impact on the artificial intelligence field.

What is C-27?

C-27 is the Act to enact the Consumer Privacy Protection Act, the Personal Information and Data Protection Tribunal Act and the Artificial Intelligence and Data Act and to make consequential and related amendments to other Acts.

Wow, what a long-winded name! Essentially, C-27, also called the “Digital Charter Implementation Act”, is a Canadian law proposal that was released in June 2022 with the aim to protect personal data about individuals. C-27 could be considered the equivalent of a “modernized” GDPR (General Data Protection Regulation) in Europe, with a broader scope, given that it covers AI systems’ trustworthiness and privacy rights.

This Act is comprehensive and applies to personal information that organizations collect, use, or disclose in the course of commercial activities; or personal information about an employee of, or an applicant for employment with, the organization and that the organization collects, uses, or discloses in connection with the operation of a federal work, undertaking, or business.

What C-27 means in plain English

Essentially, this act ensures that companies take the privacy of their customers’ and employees’ information they collect, use, or disclose seriously.

What are the key differences between GDPR and C-27?

Although they use different clauses and terms, C-27 basically covers the same rights as GDPR (rights to access, opt-out of direct marketing, data portability, or erasure). However, the scope of C-27 is broader, as it explicitly covers employee data.

C-27 also explicitly covers artificial intelligence applications since they use and generate data. More specifically, this Act will require that:

  • Consumers or employees impacted by an AI application can request clear and specific explanations related to the system prediction.
  • High-impact AI applications perform an evaluation of potential negative biases and unjustified discrimination that may negatively treat specific populations or individuals.
  • High-impact AI applications document risks associated with negative biases or harmful outcomes of the application, identify mitigation strategies, and demonstrate the monitoring of these risks.

Why should you care about Canada’s bill C-27?

First, it is a necessary legislative document to ensure that data about Canadian residents is kept secure. For example, only six months after the slow and painful implementation of GDPR in Europe, 44% of respondents in a Deloitte poll believe that organizations care more about their customers’ privacy now that GDPR is in force. This is powerful.

However, this means that a considerable amount of work must be undertaken to comply with C-27. Almost half of all European organizations have made a significant investment in their GDPR compliance capabilities, and 70% of organizations have seen an increase in staff that are partly or entirely focused on GDPR compliance. However, 45% of these organizations are still not compliant to GDPR. According to the GDPR enforcement tracker, since July 2018, 1317 fines have been issued.

Is C-27 going to generate as much chaos for Canadian companies? Probably not. Canadian organizations have already started to adapt to this new era of data privacy. GDPR is not new anymore; it was announced in 2016 and took effect in May 2018. We have learned a lot since then. For example, 85% of Canadian organizations have already appointed a Chief Data Protection Officer (CDPO), and most third-party tools have adapted their products and services to respect data privacy.

In other words:

  • C-27 is going to be implemented. This is certain.
  • This is serious. In Europe, about 20% of individuals have already used their rights through the GDPR.
  • The more proactive you are, the more straightforward and painless your implementation will be.
  • It is not the end of the world. You can be compliant without spending millions of dollars.

All that said, you must start preparing your organization for the implementation of C-27.

Here are four actions you can take right now to be prepared for C-27

1. Control your data collection and management processes.

Maintain good data hygiene so that you will be able to better control personal data in your different tools, systems, and databases.

2. Start embracing data de-identification techniques to minimize the footprint of personal information in your organization.

A great way to limit the amount of personal data flowing into your databases is by limiting its usage. This can be done by eliminating or reducing the number of databases, tables, and fields containing personal data, which will significantly reduce the complexity of complying to C-27. Here are a few de-identification techniques:

  • De-identify: modify personal information to reduce the chances that an individual can be directly identified from it.

    Hashing methods are an example of de-identification as business users cannot identify individuals using the data. Still, the IT and Security teams can convert the hashes into identifiable data if required. De-identification techniques are allowed if appropriate processes and policies are in place to safeguard them.

    In AI systems, de-identification techniques still allow for predictive power. For example, without knowing an exact zip code, individuals from zip code 12345 will have similar characteristics. However, their predictive power is limited compared to the actual data. For example, it is impossible to calculate the distance between zip codes if they are hashed.
  • Anonymize: modify personal information irreversibly and permanently in accordance with generally accepted best practices to ensure that no individual can be identified from the information, whether directly or indirectly, by any means.

    This is a rigorous privacy method that should not be the default in a data science strategy. By default, organizations should de-identify the data as much as they can and only use anonymization when there is no other choice. For example, free form texts and call transcriptions can contain very private and identifiable information that is quite complex to de-identify. In those cases, anonymization is required.
  • Generate synthetic data: create completely fake and realistic data based on existing data so that it is possible to develop analytics and AI applications without risking privacy issues.

    Nowadays, many tools and algorithms let organizations generate realistic synthetic data without jeopardizing real personal data. This technique enables organizations to build AI applications with any type of data, identifiable or not, on tabular, text, or even image data.

    Accenture reports that even brain MRIs will soon be generated synthetically by some organizations, reducing potential security breaches, and enabling more transformative projects given that the data is less restrictive. Generating synthetic data is critical for this use case because the brain structure is unique, and an MRI scan can be used to identify an individual. Therefore, under typical privacy policies, using this identifiable data can be risky and usually would be prohibited or discouraged by organizations. Synthetic data opens the door to opportunities of generating value more easily while mitigating privacy risks.

You will need to strengthen your security measures to demonstrate that the security relative to your material resources, organizations, and techniques is safe in regard to data privacy. A good first step is to document an ISP (information security policy). Then, you might discover irregularities that you will have to manage. Here is a link to some handy templates from SANS.

In conclusion, selecting the right strategy for de-identifying your data is key. Please be careful not to be too restrictive as deleting personal information can restrict the value you can derive from analytics and AI applications. Here is a useful resource from EDUCAUSE to guide you through this exercise.

3. Explicability is becoming a must when building any AI system.

Not only will individuals have the right to understand the reasons behind predictions, but it is also a helpful tool to validate the quality of your AI system.

Are the requirements for explicability restraining organizations from using more sophisticated AI and machine learning algorithms?

No. In fact, over the past decade, the academic community has collaborated to create tools and techniques that generate explanations for potentially very complex algorithms. Nowadays, the challenge comes not from the explainability itself but from explaining the reasons behind the prediction in simple terms. Good User Experience will be required to make the explanations meaningful.

4. Ethical issues and negative bias risk management are other issues that organizations must tackle with C-27.

More concretely, organizations will have to take a risk management approach, which consists of listing potential risks, estimating likelihoods and impacts, and then establishing mitigation plans. This is a simple yet efficient mechanism to manage most risks in an AI project.

To get you started, some actors in the industry have created very useful resources that allow you to complete a self-assessment. Here are 2 useful resources to identify and address ethical and negative bias risks:

  • Here is an excellent resource that lists and describes the most relevant risks for an AI system. This work objects to contribute hereto by identifying relevant sources of risk for AI systems. For this purpose, the differences between AI systems, especially those based on modern machine learning methods, and classical software were analyzed, and the current research fields of trustworthy AI were evaluated.

    A taxonomy could then be created that provides an overview of various AI-specific sources of risk. These new sources of risk should be considered in the overall risk assessment of a system based on AI technologies, examined for their criticality, and managed accordingly at an early stage to prevent a later system failure.
  • OBVIA has partnered with Forum IA Québec to create an excellent reflexivity grid on the ethical issues of artificial intelligence systems (this tool is only available in French for the moment). Presented in the form of a questionnaire with open-ended answers, this grid was designed to help team members who design, implement, and manage AI Systems consider the ethical issues arising from the development and use of these new technologies.

    This grid is part of a participatory research and aims to develop useful ethical tools for practitioners. It is intended to be constantly evolving in light of the needs and experiences of the actors likely to use it.

    I think that self-assessment tools like this one is the way to go as it ensures a certain rigor in the assessment while making the process less painful for end users.

C-27 will come with an extensive and strict set of requirements

In conclusion, C-27 will come with an extensive and strict set of requirements. Although it is for the greater good, organizations will need to put strong effort into their preparations. There are smart ways to be compliant while not jeopardizing your innovation process; purging all your data or not do AI or analytics applications is not a valid option. The silver lining in this situation is that the solutions to comply to C-27 are opportunities to generate additional value.

By controlling your data collection and management process, you will gain maturity, and this should positively impact data collection and quality.

By using de-identification techniques, anonymization techniques only when it is necessary, and by generating synthetic data, you will significantly reduce security risks while pursuing AI applications that seemed too risky before. This will help change management. Synthetic data can also be used to produce larger datasets, which will help build performant AI applications.

By investing in explicability for your AI applications, you will not only comply with C-27 but will also significantly reduce validation and change management efforts as end users and stakeholders can be re-assured when explanations line up with their reality.

Finally, by evaluating and acting upon ethical and negative bias risks, you ensure that your organization does not discriminate against consumers or employees, which can be catastrophic from a legal, reputational, and societal standpoint.

C-27 is good for the population and will help organizations make better use of their data.

Nine Critical Roles for a Successful AI Project

Delivering an AI system is hard. This is why you need a high-performance team with complementary talents and skills. To achieve this, you need a high-performance team with complementary talents and skills.

What is the optimal team to deliver AI projects successfully? It is a complex question to answer. It will vary depending on the type of project and the type of organization you work for. Using the learnings I gathered from mentoring many companies in different industries, here are nine critical roles that make or break AI projects.

Based on my learnings and project delivery experience gained from coaching many companies in different industries, here are the nine essential roles to create and deliver an AI project.

An employee can fill multiple roles based on their skillset. The idea here is that your project team covers them all to maximize your chances for a successful project

Nine critical roles in an AI project team

Composition by role of an AI project team

Data science

Data science is a no-brainer discipline in an AI project. We constantly hear that a Data Scientist is the only role you need to build an AI solution. First, it is not, and second, many activities need to happen for an AI project to succeed. A single Data Scientist is not equipped to address them all. I don’t believe in unicorns.

I usually split data science activities into these 3 separate roles:

Data analysis:

Data is the oil of an AI system. It is paramount to properly analyze the data used to train the AI system. Without oil, it may permanently seize your engine altogether.

Machine learning (ML) modelling:

Once the data has been defined and cleaned, a machine learning model will be trained, validated, and tested to generate predictions. You can then experiment and hopefully incorporate your predictions into a system.

User Experience (UX) design:

UX design is one of the most overlooked parts of an AI system. In my opinion, it is so critical. How will the end-user access the output of the system? How do you make sure that they will understand and trust the results of the system? The UX designer can also work on the explainability of the model and translate it into an understandable non-technical language.

Development

Unfortunately, development is still underestimated. However, many hours of software development are required to develop and deploy an operational AI system. You will realize the number of non-machine-learning infrastructures, processes, and tools needed to have a functional solution.

Enhanced AI System Composition, from Hidden Technical Debt in Machine Learning Systems

You need specific expertise to operationalize AI and build a robust system around the data, machine learning models, and UX created by your data science-focused roles.

Solution architecture

As you can see in the figure above, many hardware and software elements are required to build a system. This skill set is critical to drawing the AI system’s exemplary software architecture to meet the end-user requirements.

Database and software development

Breaking news: an AI solution is a software solution. A specific one, but a software solution nonetheless. Hence, the robustness and efficiency requirements for databases, scripts, and APIs. If you only rely on Data Scientists to deliver an AI solution, you will be disappointed, considering that few Data Scientists master both software development and data science. Again I don’t believe in unicorns.

Solution operationalization

Solution operationalization: Solution operationalization for an AI solution is a combination of DevOps and MLOps. DevOps is a set of practices that aims to shorten the development life cycle and provide continuous delivery with high software quality. Comparatively, MLOps is the process of automating and productionalizing machine learning applications and workflows (Source: phdata).

Cycle DevOps
DevOps cycle, source MLOps vs. DevOps: What is the Difference

Business

Any AI project within an organization should have strong business considerations, as technology cannot solve any problem unless it is aligned with the business reality.

Industry knowledge

This is the most critical role in an AI project. Yes, a non-technical role. Product Owner (or PO) is a common name for this role. A great PO can generate immediate benefits while mitigating risks by developing business rules and heuristics and shaping the AI project’s business requirements. The PO also ensures the project team learns industry knowledge critical to the AI solution and stays aligned with the business stakeholders throughout the project.

Pareto Principle in AI Delivery

Pareto Principle in AI Delivery

Project management

Simply put, most of the problems you will encounter in an AI project can be dealt with if you manage the project right. Project Managers guide the team so that it delivers high-quality projects that meet the business requirements given the timelines and the budget. It’s a fine line to walk, so I suggest you look for experience when hiring a project manager.

Change management

You can build the better AI system in the history of humanity, but if no one uses it, you just lost an opportunity. And money. And time. Communication, training, and support during user testing are vital activities to ensure maximum adoption from stakeholders and end-users.

The success of a project

The 9 proposed roles cover the entirety of an AI project. Identifying who will occupy each of these roles at the beginning of the project increases the chances of success.

At Moov AI our project teams are typically composed of 5 colleagues who cover all of these roles and divide the tasks to be accomplished. As our teams are self-managed, they ensure that external stakeholders cover these roles if no one on the team has the skills to perform them.

Do you see other roles that should be covered in this article? Please feel free to comment above!

The Hitchhiker’s Guide to a Successful Data Science Practice

When we speak about data science in general, one the biggest problems is the sheer lack of data scientists. According to a study published by McKinsey Global Institute, the U.S. economy could be short as many as 250,000 data scientists by 2024. That’s A LOT of people. And this is, of course, if we continue to grow at a steady rate.

Why exactly are data scientists so critical and scarce in today’s businesses? The definition of a data scientist according to DJ Patil, the former US Chief Data Officer could provide us with some insights into the issue:

“A data scientist is that unique blend of skills that can both unlock the insights of data and tell a fantastic story via the data”

DJ Patil, the former US Chief Data Officer

Although this definition is clear and succinct, it also identifies all the problems leading to the data scientist’s shortage.

The 4 Commandments on How to Avoid a Data Scientist Shortage

We Shall Not Keep Our Skills to Ourselves

Keeping this unique blend of skills to ourselves, in the long run, hurts us more than it helps. It’s like any other technical skill, if you keep to yourself and refuse to share your knowledge, you end up working alone.

The data culture in your company will stay as it is so long as your team is seen as an isolated part of your organization.

In order for a company to become more data driven and to build a data science culture, we need to be generous with our knowledge. If we don’t, data science cannot grow organically within the company and will continue being a niche for specialists only. Remember what your mothers told you, sharing is caring.

We Shall Not Become a Jack of All Trades, Masters of None

A “unique blend of skills” suggests a lack of specialization, which can cause problems down the road. For example, a lack in business skills can cause us to work hard and long hours to solve the wrong problems.


On the other hand a lack of statistical skills can cause biased conclusions., leading our business in a complete erroneous direction. In order to become a proficient data scientist, it is imperative that our overall business intelligence and our more technical skills stay sharp and specialized to the tasks at hand.

Even the Einsteins of data science would be left helpless without proper knowledge of the domain in which they are operating.

We Shall Not Define Standards

There are no standards or criteria that make a great data scientist. It’s based on the type of work to achieve. The blends of skills is so unique to every opportunity that finding which data science professional will be best suited for a particular task is becoming a real issue.

The proof is in the pudding. You need to hire a data scientist who has solved similar issues to your own. Are you willing to hire junior profiles? Even junior data scientists should have practiced on generic datasets.

Lack of practice should no longer be an excuse thanks to services like Kaggle and Google Datasets.

Moreover, it is easy for data analysists and other data professionals to call themselves data scientists in order to surf the hype wave. Don’t get me wrong, all of the data science profession are vital in a data-driven business, but it’ll inevitably hurt your culture and performance if you have the wrong employee profile in the driver’s seat.

We Shall Not Take All the Heat (It’s Their Fault)

The fact that organizations often blindly trusts data scientists to unlock insights and to provide business direction is plain, old fashion, risky! Would you bet your company’s future on one groups analysis? Nope! Me neither.

A good data science project is a business project first. This means that, without business vision and an understanding of the business processes, the project might not even be useful at all.

Therefore, without executive and business support, a data science project might not be relevant to the business in the slightest.

However, once management and the business are onboard, if the data department is not involved in the implementation, things may be limited to a statistical model on a computer using extracted data!

Typically, to put a data science project in place, you have to connect to the data sources in real time or in frequent batches, develop an API to run the statistical model on the new data and push the API’s predictions to a separate system, such as a CRM or ERP. Sometimes, these preliminary steps can take more time than the rest of the data science project.

Guidelines to Operationalize Data Science in Your Company

Now that you understand the commandments, here is the bible.

Well Defined Roles

It is completely fine, even ideal to have a data scientist in your company. Data scientists bring innovation and a different set of skills compared to other roles. However, this role might need to change over time.

A data scientist should be more efficient working in an innovation team, helping executing with proof of concepts. It is well known that data scientists tend to seek diversity and for project ownership. Also, the data scientist needs to train the organization on how to become more data driven in general.

A common way to make this happen would be to develop a Center of Excellence or a Data Science Practice. This will enable other analysts and data developers to be proficient in data science in their own teams, while working on innovation projects.

Center of Excellence or a Data Science Practice

Don’t Fear the Business Analyst

For the regular day-to-day data science operations, business analysts should get the bulk of the responsibilities. Even if the he or she might lack certain statistical concepts and coding principles, many tools and training are available that can improve their skills to ensure the job can get done.

However, the BA’s business background and network within the company has given them an advantage, the ability to kick start projects with ease. With all the existing tools, I find that the toughest responsibility in data science is to correctly understand the problems at hand.

Albert Einstein once said:

“If I were given one hour to save the world, I would spend 59 minutes defining the problem and one minute solving it.”

Fortunately, one of the best traits of the Business Analyst is to be good at solving problems.

Deploy Machine Learning with Your Actual Developers

For more complex algorithms and machine learning deployment, things should be left to your more seasoned developers. Keras, Fast.ai, AutoML and other solutions are game changer as they are easy to understand and optimize without in depth knowledge in linear algebra or statistics.

As long as the right methodologies and techniques are used, the results will be quicker, will consists of higher quality code and will be easier to monitor and maintain. Also, it is important to know when using a machine learning algorithm and when other, more primitive methods could be used. Developers should be able use the right tools at the right times.

A common pitfall that can be avoided by working with developers are to declare boundaries for the data science project.

For instance, if the statistical model performs poorly in some cases, instead of working on it for months to make generalized (sometimes it is simply not possible), a developer might instead create different flows to deal with the uncertainty in the result.Simple Infographic

Guide to Data Science Operationalization infographics

Share this Image On Your Site

Data Science is a Team Sport

Data science is a team sport

What I mean to say in this article is that data scientists are currently being misused in a lot of companies. They are perhaps the only ones who understand the full process of insight generation and machine learning pipeline creation.

Expecting tasks that are out of a data scientist’s scope can create a lot of stress and frustrations and can lead to a higher churn rate.

To become less dependent and more efficient, it will be important to take these data science projects seriously, by having a proper structure in place, backed by management support and by the overall business’s interest. If these ingredients are present, you will most likely have quick project adoption.

Moreover, the data science practice needs to have business analysts, developers and data developers, integrated with business units and IT functions to make the most out of each projects.

Open Data Science Conference West 2018: A Summary

Open Data Science Conference West 2018 in San Francisco

I am back from Open Data Science Conference (ODSC West) in California. What a blast! Not only was I able to present my conference on democratization of AI, but I have learned a lot of very interesting stuff!

I honestly am impressed by the projects and technologies presented throughout the week. The state of data science is way more advanced in 2018 than it was a year ago.

Here’s a quick summary of what I’ve found the most interesting. I have decided to keep the summary succinct so I will only talk about some techniques and concepts presented I master the most. I’ve regrouped my learnings in two categories: business concepts and technologies and tools.

Business Concepts

How to Retain Data Scientist in Your Company

We’ve learned that a data scientist doesn’t stay in the same company for more than 18 months on average! According to Harvard Business Review, Data Scientists need support from their manager and organization, which is not often easy in the early stages of a team creation. They also need ownership. It’s complex to keep track of since Data Science is often seen as a support for organizations.

Finally, Data Scientist are also seeking for purpose. Unfortunately, companies with a lack of vision often tend to either give no direction to the data scientists or very low-level tasks to do. These behaviors often drive employee churn.

My take on this: It is critical to keep them motivated. Data Science teams are often created organically by companies that want to benefit from the high number of opportunities that you can tackle with AI. However, it is critical to keep in mind that a Data Science team, like any team, needs a vision, a solid structure, and coaches in place to make sure that the data science professionals thrive.

Data Science Goes Beyond Statistical Models

Implementing tensorflow or scikit learn models is an important ability, but a data scientist must master business and mathematic concepts to have the biggest impact on the businesses they work for.

My take on this: Being a Data Scientist is about using the right tool to solve a problem. If the Data Scientist does not understand the business problem and does not understand the statistical implications of it, it’s not going to work. Data cleanliness and data science methodology are also important concepts to master in order to have performant and relevant statistical models.

The Hype is Real

The question is not even about proving the usefulness of A.I., but rather making the outside world’s expectations more realistic.

A lot of A.I. use cases are very successful, confirming that the hype is not going to disappear. For example, being able to identify Fake News, predicting the reader’s feeling while reading a New York Times’ article, early detection of behavioral health disorders in children, and very advanced image recognition tool were among the great projects presented.

My take on this: The question is not even about proving the usefulness of A.I., but rather making the outside world’s expectations more realistic. A common trait between these projects is that current technologies can already deliver good performance. It is less about technology and more about resolving a real problem.

It Is All about Data

Data is the real competitive advantage going forward, as statistical models are shared amongst the community. As the AI research community constantly grows and is pretty much open source, it means that months and years of research work quickly become available to everyone.

Any Data Scientist can then use these tools to develop best-in-class statistical models to solve their problems. This means that statistical models tend to be similar throughout the community. However, as machine learning is a set of statistical algorithms which identify and generalize patterns from already observed data points, a voluminous and clean dataset is the best way to better exploit a statistical model than your competitors.

My take on this: This couldn’t be more relevant. However, the truth is that most companies have a looooong way to go. Still, if you want to quickly and smartly invest in your data, there are techniques discussed below to help you do so. These techniques were demonstrated to augment or clean your dataset. Some companies present at the event also offer labelling services. This is a gold mine for companies who want to get started with data science.

Technologies & Tools

Data Augmentation

Monte Carlo simulation and active learning are increasingly used successfully to prepare data in an agile fashion, or in cases where data isn’t abundant enough.

Regarding Monte Carlo Marcos Chain, the real advantage is the fact that it provides a serious alternative to augment your dataset, even if you do not have an extensive dataset already (vs a generalized adversarial network which takes already a good amount of data).

Note: it is however crucial to have all of the clusters and use cases present within the population before thinking of synthetically augmenting the dataset. The Hamiltonian MCMC using PYMC3 is a great technique as it allows multiple features while being able to converge better than other similar techniques.

My take on this: While data quality is super important and it is always better to have a big dataset, it is not always possible. Monte Carlo allows companies with smaller dataset to augment it so that they can use more advanced models, when done with care that is. Also, some use cases like forecasting and logistics simulation are more efficient with this technique.

Transfer Learning Is the Way to Go

Transfer learning is the method of adapting existing and proven models to our needs. As these models were already trained by big corporation with a lot of resources, it usually is the best method in the use cases presented so far. You simply have to retrain the last labels to your problematic et voila!

My take on this: Models that are reusable via transfer learning are especially available for image recognition and Natural Language Processing. If you use those types of models, please try transfer learning, you’ll be blown away!

(More on) Transfer Learning

Many people were talking about transfer learning during the conference but no simple framework are available online. So here is a super simplified series of steps to get you started.

  1. You label a number of observations
  2. You fit a shallow (simple) statistical model on the labeled data
  3. You predict all of your observations that was not labeled
  4. You review a number of random predicted label with a high prediction error
  5. You re-label those observations
  6. You go back to step 2 until labels are correctly labeled once you have reviewed it.

Some benefits were discussed as well. For sure, it is quicker to do this once you have labeled manually 2000 observation then having to label millions of observations by hand. Also, you ensure that labels are standardized (vs. Hiring multiple people with different criteria). Finally, this is a super way to correct your model while it is live, when performance starts to be bad.

My take on this: One of the biggest hassle for the Data Science communities is to label data to use it for supervised machine learning. Too much time is spent manually labelling customer support tickets, user profiles, pictures and texts… this is insane, especially considering that manually labelling thousands of items is boring and it is too easy to produce unstandardized results. The smart approach is to label a small batch of data points, use the technique above, and go iteratively until all data points are labeled.

Model Creation Is Just the Beginning

Without real production experience, people can think that machine learning is mainly about building a model with past observations and validate it successfully. However, experienced data scientists have many scars once it is in production. In fact, the real and complex work happens once your model is in production!

A data scientist at Tesla demonstrated how edge cases are critical and part of their testing process. Overall accuracy or lost minimization are clearly not sufficient. Tesla treats its edge cases as regular software delivery test scenarios that have to pass in order to update the models in production.

Other data scientists talked about various sampling biases that were causing very performant models using training data to be terrible. It is vital to make sure that all production use cases an clusters are present within the training dataset. Also, it is important to compare the distribution of the 2 datasets to make sure that the values were matching quite closely.

One of Google’s engineer came to discuss the importance to present the output of a model. Even though you might automate the decision making, but it is always a good idea to understand why the model predicted a case the way it did for periodical validation.

My take on this: Even if you have a live model, it is always a good idea to review how your model performs using real production data, so that all your hard work to define a good statistical model is working well once your project is live. You can be surprised how different to your expectation it might be. Also, you will quickly realize that without ways to interpret your model, you will quickly be lost trying to make sense of your model. Techniques such as LIME and SHAP will clearly help you to translate your model.

People process data technology

It’s about People, Processes, Data and Technologies

Open Data Science Conference was a great reminder of the most important aspects of data science; people, processes, data, technologies.

People

It is important to surround well the data scientists and give them all the ingredients to thrive and really make a difference. If the vision is well defined, they are well surrounded and work on interesting and strategic problems to solve, Data Science professionals will enable the organization to be more data driven.

Processes

In statistics and in computer science, processes and methodologies are critical. Without a doubt, Data Science is no exception. So far, well known methodologies are the model definition and the model validation processes. These processes are key to get valid results.

Lately, as the data science community matures, more practices get discussed such as model DevOps, which consists to validating the statistical model’s accuracy and performance while it is in production. Active learning has also been discussed. Essentially, active learning is the process of adapting a model based on feedback.

Data

The community now realizes that a performant statistical model is really dependant on its underlying data and, thus making it the most important element of data science. Simulation and active learning are interesting and creative approaches to have a bigger dataset even if you do not have access to a lot of labeled data.

Technologies

Once again, new frameworks, tools and algorithms were presented. Research is important in this field. This year though, I was happy to hear about use cases that was successful using already existing technologies. Even using some of the newest algorithms, some projects had only small gains.

It is satisfying to know that current versions are good baselines in most cases. Imagine what you would be able to achieve once you master all four aspects of data science; people, processes, data, technologies while working on resolving strategic problematics.