Canada’s bill C-27: what it is and how to take action to avoid costly impact

Several times I have written about the need for AI legislation, especially since the AI landscape sometimes seems like the wild west. Canada’s bill C-27 is one of the responses from the government that I expected.

In this article, I’ll explain what C-27 is, how your company will be impacted by it, and my take on its impact on the artificial intelligence field.

What is C-27?

C-27 is the Act to enact the Consumer Privacy Protection Act, the Personal Information and Data Protection Tribunal Act and the Artificial Intelligence and Data Act and to make consequential and related amendments to other Acts.

Wow, what a long-winded name! Essentially, C-27, also called the “Digital Charter Implementation Act”, is a Canadian law proposal that was released in June 2022 with the aim to protect personal data about individuals. C-27 could be considered the equivalent of a “modernized” GDPR (General Data Protection Regulation) in Europe, with a broader scope, given that it covers AI systems’ trustworthiness and privacy rights.

This Act is comprehensive and applies to personal information that organizations collect, use, or disclose in the course of commercial activities; or personal information about an employee of, or an applicant for employment with, the organization and that the organization collects, uses, or discloses in connection with the operation of a federal work, undertaking, or business.

What C-27 means in plain English

Essentially, this act ensures that companies take the privacy of their customers’ and employees’ information they collect, use, or disclose seriously.

What are the key differences between GDPR and C-27?

Although they use different clauses and terms, C-27 basically covers the same rights as GDPR (rights to access, opt-out of direct marketing, data portability, or erasure). However, the scope of C-27 is broader, as it explicitly covers employee data.

C-27 also explicitly covers artificial intelligence applications since they use and generate data. More specifically, this Act will require that:

  • Consumers or employees impacted by an AI application can request clear and specific explanations related to the system prediction.
  • High-impact AI applications perform an evaluation of potential negative biases and unjustified discrimination that may negatively treat specific populations or individuals.
  • High-impact AI applications document risks associated with negative biases or harmful outcomes of the application, identify mitigation strategies, and demonstrate the monitoring of these risks.

Why should you care about Canada’s bill C-27?

First, it is a necessary legislative document to ensure that data about Canadian residents is kept secure. For example, only six months after the slow and painful implementation of GDPR in Europe, 44% of respondents in a Deloitte poll believe that organizations care more about their customers’ privacy now that GDPR is in force. This is powerful.

However, this means that a considerable amount of work must be undertaken to comply with C-27. Almost half of all European organizations have made a significant investment in their GDPR compliance capabilities, and 70% of organizations have seen an increase in staff that are partly or entirely focused on GDPR compliance. However, 45% of these organizations are still not compliant to GDPR. According to the GDPR enforcement tracker, since July 2018, 1317 fines have been issued.

Is C-27 going to generate as much chaos for Canadian companies? Probably not. Canadian organizations have already started to adapt to this new era of data privacy. GDPR is not new anymore; it was announced in 2016 and took effect in May 2018. We have learned a lot since then. For example, 85% of Canadian organizations have already appointed a Chief Data Protection Officer (CDPO), and most third-party tools have adapted their products and services to respect data privacy.

In other words:

  • C-27 is going to be implemented. This is certain.
  • This is serious. In Europe, about 20% of individuals have already used their rights through the GDPR.
  • The more proactive you are, the more straightforward and painless your implementation will be.
  • It is not the end of the world. You can be compliant without spending millions of dollars.

All that said, you must start preparing your organization for the implementation of C-27.

Here are four actions you can take right now to be prepared for C-27

1. Control your data collection and management processes.

Maintain good data hygiene so that you will be able to better control personal data in your different tools, systems, and databases.

2. Start embracing data de-identification techniques to minimize the footprint of personal information in your organization.

A great way to limit the amount of personal data flowing into your databases is by limiting its usage. This can be done by eliminating or reducing the number of databases, tables, and fields containing personal data, which will significantly reduce the complexity of complying to C-27. Here are a few de-identification techniques:

  • De-identify: modify personal information to reduce the chances that an individual can be directly identified from it.

    Hashing methods are an example of de-identification as business users cannot identify individuals using the data. Still, the IT and Security teams can convert the hashes into identifiable data if required. De-identification techniques are allowed if appropriate processes and policies are in place to safeguard them.

    In AI systems, de-identification techniques still allow for predictive power. For example, without knowing an exact zip code, individuals from zip code 12345 will have similar characteristics. However, their predictive power is limited compared to the actual data. For example, it is impossible to calculate the distance between zip codes if they are hashed.
  • Anonymize: modify personal information irreversibly and permanently in accordance with generally accepted best practices to ensure that no individual can be identified from the information, whether directly or indirectly, by any means.

    This is a rigorous privacy method that should not be the default in a data science strategy. By default, organizations should de-identify the data as much as they can and only use anonymization when there is no other choice. For example, free form texts and call transcriptions can contain very private and identifiable information that is quite complex to de-identify. In those cases, anonymization is required.
  • Generate synthetic data: create completely fake and realistic data based on existing data so that it is possible to develop analytics and AI applications without risking privacy issues.

    Nowadays, many tools and algorithms let organizations generate realistic synthetic data without jeopardizing real personal data. This technique enables organizations to build AI applications with any type of data, identifiable or not, on tabular, text, or even image data.

    Accenture reports that even brain MRIs will soon be generated synthetically by some organizations, reducing potential security breaches, and enabling more transformative projects given that the data is less restrictive. Generating synthetic data is critical for this use case because the brain structure is unique, and an MRI scan can be used to identify an individual. Therefore, under typical privacy policies, using this identifiable data can be risky and usually would be prohibited or discouraged by organizations. Synthetic data opens the door to opportunities of generating value more easily while mitigating privacy risks.

You will need to strengthen your security measures to demonstrate that the security relative to your material resources, organizations, and techniques is safe in regard to data privacy. A good first step is to document an ISP (information security policy). Then, you might discover irregularities that you will have to manage. Here is a link to some handy templates from SANS.

In conclusion, selecting the right strategy for de-identifying your data is key. Please be careful not to be too restrictive as deleting personal information can restrict the value you can derive from analytics and AI applications. Here is a useful resource from EDUCAUSE to guide you through this exercise.

3. Explicability is becoming a must when building any AI system.

Not only will individuals have the right to understand the reasons behind predictions, but it is also a helpful tool to validate the quality of your AI system.

Are the requirements for explicability restraining organizations from using more sophisticated AI and machine learning algorithms?

No. In fact, over the past decade, the academic community has collaborated to create tools and techniques that generate explanations for potentially very complex algorithms. Nowadays, the challenge comes not from the explainability itself but from explaining the reasons behind the prediction in simple terms. Good User Experience will be required to make the explanations meaningful.

4. Ethical issues and negative bias risk management are other issues that organizations must tackle with C-27.

More concretely, organizations will have to take a risk management approach, which consists of listing potential risks, estimating likelihoods and impacts, and then establishing mitigation plans. This is a simple yet efficient mechanism to manage most risks in an AI project.

To get you started, some actors in the industry have created very useful resources that allow you to complete a self-assessment. Here are 2 useful resources to identify and address ethical and negative bias risks:

  • Here is an excellent resource that lists and describes the most relevant risks for an AI system. This work objects to contribute hereto by identifying relevant sources of risk for AI systems. For this purpose, the differences between AI systems, especially those based on modern machine learning methods, and classical software were analyzed, and the current research fields of trustworthy AI were evaluated.

    A taxonomy could then be created that provides an overview of various AI-specific sources of risk. These new sources of risk should be considered in the overall risk assessment of a system based on AI technologies, examined for their criticality, and managed accordingly at an early stage to prevent a later system failure.
  • OBVIA has partnered with Forum IA Québec to create an excellent reflexivity grid on the ethical issues of artificial intelligence systems (this tool is only available in French for the moment). Presented in the form of a questionnaire with open-ended answers, this grid was designed to help team members who design, implement, and manage AI Systems consider the ethical issues arising from the development and use of these new technologies.

    This grid is part of a participatory research and aims to develop useful ethical tools for practitioners. It is intended to be constantly evolving in light of the needs and experiences of the actors likely to use it.

    I think that self-assessment tools like this one is the way to go as it ensures a certain rigor in the assessment while making the process less painful for end users.

C-27 will come with an extensive and strict set of requirements

In conclusion, C-27 will come with an extensive and strict set of requirements. Although it is for the greater good, organizations will need to put strong effort into their preparations. There are smart ways to be compliant while not jeopardizing your innovation process; purging all your data or not do AI or analytics applications is not a valid option. The silver lining in this situation is that the solutions to comply to C-27 are opportunities to generate additional value.

By controlling your data collection and management process, you will gain maturity, and this should positively impact data collection and quality.

By using de-identification techniques, anonymization techniques only when it is necessary, and by generating synthetic data, you will significantly reduce security risks while pursuing AI applications that seemed too risky before. This will help change management. Synthetic data can also be used to produce larger datasets, which will help build performant AI applications.

By investing in explicability for your AI applications, you will not only comply with C-27 but will also significantly reduce validation and change management efforts as end users and stakeholders can be re-assured when explanations line up with their reality.

Finally, by evaluating and acting upon ethical and negative bias risks, you ensure that your organization does not discriminate against consumers or employees, which can be catastrophic from a legal, reputational, and societal standpoint.

C-27 is good for the population and will help organizations make better use of their data.

Nine Critical Roles for a Successful AI Project

Delivering an AI system is hard. This is why you need a high-performance team with complementary talents and skills. To achieve this, you need a high-performance team with complementary talents and skills.

What is the optimal team to deliver AI projects successfully? It is a complex question to answer. It will vary depending on the type of project and the type of organization you work for. Using the learnings I gathered from mentoring many companies in different industries, here are nine critical roles that make or break AI projects.

Based on my learnings and project delivery experience gained from coaching many companies in different industries, here are the nine essential roles to create and deliver an AI project.

An employee can fill multiple roles based on their skillset. The idea here is that your project team covers them all to maximize your chances for a successful project

Nine critical roles in an AI project team

Composition by role of an AI project team

Data science

Data science is a no-brainer discipline in an AI project. We constantly hear that a Data Scientist is the only role you need to build an AI solution. First, it is not, and second, many activities need to happen for an AI project to succeed. A single Data Scientist is not equipped to address them all. I don’t believe in unicorns.

I usually split data science activities into these 3 separate roles:

Data analysis:

Data is the oil of an AI system. It is paramount to properly analyze the data used to train the AI system. Without oil, it may permanently seize your engine altogether.

Machine learning (ML) modelling:

Once the data has been defined and cleaned, a machine learning model will be trained, validated, and tested to generate predictions. You can then experiment and hopefully incorporate your predictions into a system.

User Experience (UX) design:

UX design is one of the most overlooked parts of an AI system. In my opinion, it is so critical. How will the end-user access the output of the system? How do you make sure that they will understand and trust the results of the system? The UX designer can also work on the explainability of the model and translate it into an understandable non-technical language.

Development

Unfortunately, development is still underestimated. However, many hours of software development are required to develop and deploy an operational AI system. You will realize the number of non-machine-learning infrastructures, processes, and tools needed to have a functional solution.

Enhanced AI System Composition, from Hidden Technical Debt in Machine Learning Systems

You need specific expertise to operationalize AI and build a robust system around the data, machine learning models, and UX created by your data science-focused roles.

Solution architecture

As you can see in the figure above, many hardware and software elements are required to build a system. This skill set is critical to drawing the AI system’s exemplary software architecture to meet the end-user requirements.

Database and software development

Breaking news: an AI solution is a software solution. A specific one, but a software solution nonetheless. Hence, the robustness and efficiency requirements for databases, scripts, and APIs. If you only rely on Data Scientists to deliver an AI solution, you will be disappointed, considering that few Data Scientists master both software development and data science. Again I don’t believe in unicorns.

Solution operationalization

Solution operationalization: Solution operationalization for an AI solution is a combination of DevOps and MLOps. DevOps is a set of practices that aims to shorten the development life cycle and provide continuous delivery with high software quality. Comparatively, MLOps is the process of automating and productionalizing machine learning applications and workflows (Source: phdata).

Cycle DevOps
DevOps cycle, source MLOps vs. DevOps: What is the Difference

Business

Any AI project within an organization should have strong business considerations, as technology cannot solve any problem unless it is aligned with the business reality.

Industry knowledge

This is the most critical role in an AI project. Yes, a non-technical role. Product Owner (or PO) is a common name for this role. A great PO can generate immediate benefits while mitigating risks by developing business rules and heuristics and shaping the AI project’s business requirements. The PO also ensures the project team learns industry knowledge critical to the AI solution and stays aligned with the business stakeholders throughout the project.

Pareto Principle in AI Delivery

Pareto Principle in AI Delivery

Project management

Simply put, most of the problems you will encounter in an AI project can be dealt with if you manage the project right. Project Managers guide the team so that it delivers high-quality projects that meet the business requirements given the timelines and the budget. It’s a fine line to walk, so I suggest you look for experience when hiring a project manager.

Change management

You can build the better AI system in the history of humanity, but if no one uses it, you just lost an opportunity. And money. And time. Communication, training, and support during user testing are vital activities to ensure maximum adoption from stakeholders and end-users.

The success of a project

The 9 proposed roles cover the entirety of an AI project. Identifying who will occupy each of these roles at the beginning of the project increases the chances of success.

At Moov AI our project teams are typically composed of 5 colleagues who cover all of these roles and divide the tasks to be accomplished. As our teams are self-managed, they ensure that external stakeholders cover these roles if no one on the team has the skills to perform them.

Do you see other roles that should be covered in this article? Please feel free to comment above!

The Hitchhiker’s Guide to a Successful Data Science Practice

When we speak about data science in general, one the biggest problems is the sheer lack of data scientists. According to a study published by McKinsey Global Institute, the U.S. economy could be short as many as 250,000 data scientists by 2024. That’s A LOT of people. And this is, of course, if we continue to grow at a steady rate.

Why exactly are data scientists so critical and scarce in today’s businesses? The definition of a data scientist according to DJ Patil, the former US Chief Data Officer could provide us with some insights into the issue:

“A data scientist is that unique blend of skills that can both unlock the insights of data and tell a fantastic story via the data”

DJ Patil, the former US Chief Data Officer

Although this definition is clear and succinct, it also identifies all the problems leading to the data scientist’s shortage.

The 4 Commandments on How to Avoid a Data Scientist Shortage

We Shall Not Keep Our Skills to Ourselves

Keeping this unique blend of skills to ourselves, in the long run, hurts us more than it helps. It’s like any other technical skill, if you keep to yourself and refuse to share your knowledge, you end up working alone.

The data culture in your company will stay as it is so long as your team is seen as an isolated part of your organization.

In order for a company to become more data driven and to build a data science culture, we need to be generous with our knowledge. If we don’t, data science cannot grow organically within the company and will continue being a niche for specialists only. Remember what your mothers told you, sharing is caring.

We Shall Not Become a Jack of All Trades, Masters of None

A “unique blend of skills” suggests a lack of specialization, which can cause problems down the road. For example, a lack in business skills can cause us to work hard and long hours to solve the wrong problems.


On the other hand a lack of statistical skills can cause biased conclusions., leading our business in a complete erroneous direction. In order to become a proficient data scientist, it is imperative that our overall business intelligence and our more technical skills stay sharp and specialized to the tasks at hand.

Even the Einsteins of data science would be left helpless without proper knowledge of the domain in which they are operating.

We Shall Not Define Standards

There are no standards or criteria that make a great data scientist. It’s based on the type of work to achieve. The blends of skills is so unique to every opportunity that finding which data science professional will be best suited for a particular task is becoming a real issue.

The proof is in the pudding. You need to hire a data scientist who has solved similar issues to your own. Are you willing to hire junior profiles? Even junior data scientists should have practiced on generic datasets.

Lack of practice should no longer be an excuse thanks to services like Kaggle and Google Datasets.

Moreover, it is easy for data analysists and other data professionals to call themselves data scientists in order to surf the hype wave. Don’t get me wrong, all of the data science profession are vital in a data-driven business, but it’ll inevitably hurt your culture and performance if you have the wrong employee profile in the driver’s seat.

We Shall Not Take All the Heat (It’s Their Fault)

The fact that organizations often blindly trusts data scientists to unlock insights and to provide business direction is plain, old fashion, risky! Would you bet your company’s future on one groups analysis? Nope! Me neither.

A good data science project is a business project first. This means that, without business vision and an understanding of the business processes, the project might not even be useful at all.

Therefore, without executive and business support, a data science project might not be relevant to the business in the slightest.

However, once management and the business are onboard, if the data department is not involved in the implementation, things may be limited to a statistical model on a computer using extracted data!

Typically, to put a data science project in place, you have to connect to the data sources in real time or in frequent batches, develop an API to run the statistical model on the new data and push the API’s predictions to a separate system, such as a CRM or ERP. Sometimes, these preliminary steps can take more time than the rest of the data science project.

Guidelines to Operationalize Data Science in Your Company

Now that you understand the commandments, here is the bible.

Well Defined Roles

It is completely fine, even ideal to have a data scientist in your company. Data scientists bring innovation and a different set of skills compared to other roles. However, this role might need to change over time.

A data scientist should be more efficient working in an innovation team, helping executing with proof of concepts. It is well known that data scientists tend to seek diversity and for project ownership. Also, the data scientist needs to train the organization on how to become more data driven in general.

A common way to make this happen would be to develop a Center of Excellence or a Data Science Practice. This will enable other analysts and data developers to be proficient in data science in their own teams, while working on innovation projects.

Center of Excellence or a Data Science Practice

Don’t Fear the Business Analyst

For the regular day-to-day data science operations, business analysts should get the bulk of the responsibilities. Even if the he or she might lack certain statistical concepts and coding principles, many tools and training are available that can improve their skills to ensure the job can get done.

However, the BA’s business background and network within the company has given them an advantage, the ability to kick start projects with ease. With all the existing tools, I find that the toughest responsibility in data science is to correctly understand the problems at hand.

Albert Einstein once said:

“If I were given one hour to save the world, I would spend 59 minutes defining the problem and one minute solving it.”

Fortunately, one of the best traits of the Business Analyst is to be good at solving problems.

Deploy Machine Learning with Your Actual Developers

For more complex algorithms and machine learning deployment, things should be left to your more seasoned developers. Keras, Fast.ai, AutoML and other solutions are game changer as they are easy to understand and optimize without in depth knowledge in linear algebra or statistics.

As long as the right methodologies and techniques are used, the results will be quicker, will consists of higher quality code and will be easier to monitor and maintain. Also, it is important to know when using a machine learning algorithm and when other, more primitive methods could be used. Developers should be able use the right tools at the right times.

A common pitfall that can be avoided by working with developers are to declare boundaries for the data science project.

For instance, if the statistical model performs poorly in some cases, instead of working on it for months to make generalized (sometimes it is simply not possible), a developer might instead create different flows to deal with the uncertainty in the result.Simple Infographic

Guide to Data Science Operationalization infographics

Share this Image On Your Site

Data Science is a Team Sport

Data science is a team sport

What I mean to say in this article is that data scientists are currently being misused in a lot of companies. They are perhaps the only ones who understand the full process of insight generation and machine learning pipeline creation.

Expecting tasks that are out of a data scientist’s scope can create a lot of stress and frustrations and can lead to a higher churn rate.

To become less dependent and more efficient, it will be important to take these data science projects seriously, by having a proper structure in place, backed by management support and by the overall business’s interest. If these ingredients are present, you will most likely have quick project adoption.

Moreover, the data science practice needs to have business analysts, developers and data developers, integrated with business units and IT functions to make the most out of each projects.

Open Data Science Conference West 2018: A Summary

Open Data Science Conference West 2018 in San Francisco

I am back from Open Data Science Conference (ODSC West) in California. What a blast! Not only was I able to present my conference on democratization of AI, but I have learned a lot of very interesting stuff!

I honestly am impressed by the projects and technologies presented throughout the week. The state of data science is way more advanced in 2018 than it was a year ago.

Here’s a quick summary of what I’ve found the most interesting. I have decided to keep the summary succinct so I will only talk about some techniques and concepts presented I master the most. I’ve regrouped my learnings in two categories: business concepts and technologies and tools.

Business Concepts

How to Retain Data Scientist in Your Company

We’ve learned that a data scientist doesn’t stay in the same company for more than 18 months on average! According to Harvard Business Review, Data Scientists need support from their manager and organization, which is not often easy in the early stages of a team creation. They also need ownership. It’s complex to keep track of since Data Science is often seen as a support for organizations.

Finally, Data Scientist are also seeking for purpose. Unfortunately, companies with a lack of vision often tend to either give no direction to the data scientists or very low-level tasks to do. These behaviors often drive employee churn.

My take on this: It is critical to keep them motivated. Data Science teams are often created organically by companies that want to benefit from the high number of opportunities that you can tackle with AI. However, it is critical to keep in mind that a Data Science team, like any team, needs a vision, a solid structure, and coaches in place to make sure that the data science professionals thrive.

Data Science Goes Beyond Statistical Models

Implementing tensorflow or scikit learn models is an important ability, but a data scientist must master business and mathematic concepts to have the biggest impact on the businesses they work for.

My take on this: Being a Data Scientist is about using the right tool to solve a problem. If the Data Scientist does not understand the business problem and does not understand the statistical implications of it, it’s not going to work. Data cleanliness and data science methodology are also important concepts to master in order to have performant and relevant statistical models.

The Hype is Real

The question is not even about proving the usefulness of A.I., but rather making the outside world’s expectations more realistic.

A lot of A.I. use cases are very successful, confirming that the hype is not going to disappear. For example, being able to identify Fake News, predicting the reader’s feeling while reading a New York Times’ article, early detection of behavioral health disorders in children, and very advanced image recognition tool were among the great projects presented.

My take on this: The question is not even about proving the usefulness of A.I., but rather making the outside world’s expectations more realistic. A common trait between these projects is that current technologies can already deliver good performance. It is less about technology and more about resolving a real problem.

It Is All about Data

Data is the real competitive advantage going forward, as statistical models are shared amongst the community. As the AI research community constantly grows and is pretty much open source, it means that months and years of research work quickly become available to everyone.

Any Data Scientist can then use these tools to develop best-in-class statistical models to solve their problems. This means that statistical models tend to be similar throughout the community. However, as machine learning is a set of statistical algorithms which identify and generalize patterns from already observed data points, a voluminous and clean dataset is the best way to better exploit a statistical model than your competitors.

My take on this: This couldn’t be more relevant. However, the truth is that most companies have a looooong way to go. Still, if you want to quickly and smartly invest in your data, there are techniques discussed below to help you do so. These techniques were demonstrated to augment or clean your dataset. Some companies present at the event also offer labelling services. This is a gold mine for companies who want to get started with data science.

Technologies & Tools

Data Augmentation

Monte Carlo simulation and active learning are increasingly used successfully to prepare data in an agile fashion, or in cases where data isn’t abundant enough.

Regarding Monte Carlo Marcos Chain, the real advantage is the fact that it provides a serious alternative to augment your dataset, even if you do not have an extensive dataset already (vs a generalized adversarial network which takes already a good amount of data).

Note: it is however crucial to have all of the clusters and use cases present within the population before thinking of synthetically augmenting the dataset. The Hamiltonian MCMC using PYMC3 is a great technique as it allows multiple features while being able to converge better than other similar techniques.

My take on this: While data quality is super important and it is always better to have a big dataset, it is not always possible. Monte Carlo allows companies with smaller dataset to augment it so that they can use more advanced models, when done with care that is. Also, some use cases like forecasting and logistics simulation are more efficient with this technique.

Transfer Learning Is the Way to Go

Transfer learning is the method of adapting existing and proven models to our needs. As these models were already trained by big corporation with a lot of resources, it usually is the best method in the use cases presented so far. You simply have to retrain the last labels to your problematic et voila!

My take on this: Models that are reusable via transfer learning are especially available for image recognition and Natural Language Processing. If you use those types of models, please try transfer learning, you’ll be blown away!

(More on) Transfer Learning

Many people were talking about transfer learning during the conference but no simple framework are available online. So here is a super simplified series of steps to get you started.

  1. You label a number of observations
  2. You fit a shallow (simple) statistical model on the labeled data
  3. You predict all of your observations that was not labeled
  4. You review a number of random predicted label with a high prediction error
  5. You re-label those observations
  6. You go back to step 2 until labels are correctly labeled once you have reviewed it.

Some benefits were discussed as well. For sure, it is quicker to do this once you have labeled manually 2000 observation then having to label millions of observations by hand. Also, you ensure that labels are standardized (vs. Hiring multiple people with different criteria). Finally, this is a super way to correct your model while it is live, when performance starts to be bad.

My take on this: One of the biggest hassle for the Data Science communities is to label data to use it for supervised machine learning. Too much time is spent manually labelling customer support tickets, user profiles, pictures and texts… this is insane, especially considering that manually labelling thousands of items is boring and it is too easy to produce unstandardized results. The smart approach is to label a small batch of data points, use the technique above, and go iteratively until all data points are labeled.

Model Creation Is Just the Beginning

Without real production experience, people can think that machine learning is mainly about building a model with past observations and validate it successfully. However, experienced data scientists have many scars once it is in production. In fact, the real and complex work happens once your model is in production!

A data scientist at Tesla demonstrated how edge cases are critical and part of their testing process. Overall accuracy or lost minimization are clearly not sufficient. Tesla treats its edge cases as regular software delivery test scenarios that have to pass in order to update the models in production.

Other data scientists talked about various sampling biases that were causing very performant models using training data to be terrible. It is vital to make sure that all production use cases an clusters are present within the training dataset. Also, it is important to compare the distribution of the 2 datasets to make sure that the values were matching quite closely.

One of Google’s engineer came to discuss the importance to present the output of a model. Even though you might automate the decision making, but it is always a good idea to understand why the model predicted a case the way it did for periodical validation.

My take on this: Even if you have a live model, it is always a good idea to review how your model performs using real production data, so that all your hard work to define a good statistical model is working well once your project is live. You can be surprised how different to your expectation it might be. Also, you will quickly realize that without ways to interpret your model, you will quickly be lost trying to make sense of your model. Techniques such as LIME and SHAP will clearly help you to translate your model.

People process data technology

It’s about People, Processes, Data and Technologies

Open Data Science Conference was a great reminder of the most important aspects of data science; people, processes, data, technologies.

People

It is important to surround well the data scientists and give them all the ingredients to thrive and really make a difference. If the vision is well defined, they are well surrounded and work on interesting and strategic problems to solve, Data Science professionals will enable the organization to be more data driven.

Processes

In statistics and in computer science, processes and methodologies are critical. Without a doubt, Data Science is no exception. So far, well known methodologies are the model definition and the model validation processes. These processes are key to get valid results.

Lately, as the data science community matures, more practices get discussed such as model DevOps, which consists to validating the statistical model’s accuracy and performance while it is in production. Active learning has also been discussed. Essentially, active learning is the process of adapting a model based on feedback.

Data

The community now realizes that a performant statistical model is really dependant on its underlying data and, thus making it the most important element of data science. Simulation and active learning are interesting and creative approaches to have a bigger dataset even if you do not have access to a lot of labeled data.

Technologies

Once again, new frameworks, tools and algorithms were presented. Research is important in this field. This year though, I was happy to hear about use cases that was successful using already existing technologies. Even using some of the newest algorithms, some projects had only small gains.

It is satisfying to know that current versions are good baselines in most cases. Imagine what you would be able to achieve once you master all four aspects of data science; people, processes, data, technologies while working on resolving strategic problematics.