Productionizing Data Science: E-Learning Course on KNIME ServeradminMon, 11/25/2019 - 09:00
Authors: The Evangelism Team at KNIME
End to end data science is a phrase you might have read before, but what does it mean?
The terms “read - transform - analyze - deploy” describe the typical phases of a data science project. We often have our favorite phases of the model we enjoy focusing on, but data science is only end to end if the final deployment phase is also included. This is where data science brings value to a business, when the data science applications that have been created are subsequently productionized.
We can create data science with KNIME Analytics Platform by building workflows. KNIME Analytics Platform itself already offers us different deployment options, e.g. creating BIRT reports, applying your trained model, creating interactive composite views, writing processed data to a file or database, and so on. But what if you want to automate the deployment or collaborate in a team? Or make your workflow easily accessible via a web browser and let your data analysts interact with the workflow and see the results?
Then it makes sense to start using KNIME Server, our enterprise software for team based collaboration, automation, management, and deployment of data science workflows as analytical applications and services.
KNIME Server e-learning course
The new, free e-learning course discusses the areas of collaboration, automation, deployment, and management in three chapters, each containing a series of short videos plus questions to test your knowledge.
Chapter 1 - Collaboration
Connecting and deploying items to KNIME Server from KNIME ANalytics Platform. The two videos in chapter 1 explain how to establish your server connection and how to set access rights.
Chapter 2 - Automation and Deployment
The three videos in this chapter discuss how you can execute workflows on KNIME Server for automation and deployment. They look closely at remote execution on KNIME Server, the KNIME Remote Workflow Editor, and the KNIME WebPortal.
Chapter 3 - Management
The aim of this section is to show you additional functionality to manage your projects more effectively. The chapter is split into three videos, which explain versioning, workflow comparison, and node comparison.
Look up the course:
Have a look at our videos on our e-learning course page. And the link below takes you to the first video about how to connect to KNIME Server, in Chapter 1 - Collaboration.
KNIME Server related events happening here:
Courses: we have a whole week of KNIME courses taking place between December 9-13, 2019 in Berlin. The course for KNIME Server is on December 11. Check it out on our events page.
Meetup: there’s a new release of KNIME Server (and KNIME Analytics Platform) coming soon. Our meetup on December 12 in Berlin is all about what’s new in these releases. Find out more and register via our events page.
The Best of Both Worlds: The Case for Visual Open Source Data AnalyticsbertholdThu, 11/28/2019 - 11:00
Author: Michael Berthold (KNIME). As first published in The New Stack.
There is a big push for automation in data science today. Given how complex programming data science applications can be, that is no surprise. It takes years to truly master scripting or programming languages for data analysis — and that’s ignoring that one needs to build actual data science expertise as well. However, code-free solutions can make the nuts and bolts of data science a lot more accessible. This means that the valuable time of data science teams can be spent on actually doing data science so that organizations don’t have to rely on an external, preconfigured, and intransparent data science automation mass product.
This is great news in many respects. Visual, code-free environments open up the world and power of data analytics to more people, and their organizations benefit from a higher level of insight. Visual environments intuitively explain which steps have been performed in what configuration. Whether this amounts to a gut-check for programmers or helps a relative novice gain a better understanding, it is positive. And, in the case of many startups in which teams are stretched quite thin, code-free solutions can be huge time savers.
But, there is a flip side. Directly writing code is and always will be the most versatile way to create new analyses tailored to your organization’s specific needs. Data scientists often want to have access to the latest developments, which calls for a more hands-on approach. To get the most value out of data, experts need to be able to quickly try out a new routine, either written by themselves or by their colleagues.
The beautiful thing about modern data analytics is that, despite what you may think, it doesn’t have to be either/or. You can have the best of both code-free and custom-coded analytics, gaining ease of use and versatility at once. Here’s how.
Choose an open platform
When trying to determine how to bring meaningful data analytics into your organization, there is a lot to consider for sure — and right in the mix should be a visual open source solution. The big draw to open source platforms is that they don’t lock you into any one analytical language, and today’s open source options integrate multiple analytical languages, such as R and Python, while also incorporating visual design of SQL code, for example. It is also easy to grow from what’s available right now to incorporate the innovations of the future.
Additionally, a truly open platform allows you to choose what you and, more importantly, your data scientists are comfortable using. They can collaboratively utilize what they know best without having to learn the intricacies of every other coding paradigm implemented across your organization in order to provide value. This enables a range of possibilities that offers a great deal of customization.
Open source platforms are ideal for bridging the gap between commercial offerings and homegrown solutions to let users decide what, why, and how much they want to code.
How it can work
You likely are wondering what an open source data analytics platform could look like in practice. Let’s start with R and Python because they are the most important scripting languages for data analysis. With the right open source platform, one of your data scientists could design a workflow in which R is used to create a graphic and Python is used for the model building, just to pick an example. Those two languages work together in that workflow, which a different user can then pick up and re-use, perhaps never even looking at the underlying code pieces. Models and workflows can grow increasingly complex, but the principle is the same.
Data loading and integration is another area where an open source platform can be useful, and this is the part people don’t really talk about. Experts can write a few lines of SQL faster than putting together modules graphically, but not everyone is sufficiently fluent in SQL to do this. Those that aren’t should still be given the ability to mix and match their data. Visual open source platforms allow them access to the majority of the functionality available via SQL (while remembering the many little nuances of it for different databases).
Another example is big data. The right open source platform will enable workflows that model and control ETL operations natively in your big data environment. They can do this by using a connector to Hadoop, Spark and NoSQL databases, and it works just like running operations in your local MySQL database — only things are executed on your cluster (or in the cloud). And this is just the beginning, providing a mere flavor of how such integration can work with other distributed or cloud environments.
One last but very important example. Instead of building yet another visualization library or tightly coupling pre-existing ones, it’s possible for open source platforms to provide JavaScript nodes that allow users to quickly build new visualizations and then expose them to the user. Complex network representations can be generated using well-known libraries, and users can then display interactive visualizations and ultimately deploy web-based interactive workflow touch points. This is the really good stuff because it enables true guided analytics, meaning human feedback and guidance can be applied whenever needed, even while analysis is being conducted. It’s where interactive exchanges between data scientists, business analysts, and the machines doing the work in the middle function together to yield the best, most specific, relevant data analysis for your business.
Data analytics now and in the future
Data analytics will play an increasingly vital role in businesses moving forward. Speed, power, flexibility, and ease of use are demanded of any solution — and these requirements will grow even more complex as data proliferates at an incredible rate. The decisions you make today will influence the types of analysis and information you can glean tomorrow.
As you move forward, I would advise you to consider the data needs of your organization. Do you need very specific types of data analysis? Do you want to be in control of how that information is analyzed? Consider your team. Do you have an army of data scientists and expert coders, or do you have a shoestring crew — or maybe a healthy mix? As you weigh all of your needs, assets, and potential deficits, consider how a visual open source platform can help you and provide exactly what is required, both now and for the future.
KNIME Server Profiles Simplify DB Driver InstallationRedfieldMon, 12/02/2019 - 10:00
Authors: Jan Lindquist and Tornborg (Redfield)
In the modern business environment, companies have to support a heterogeneous combination of operating systems and technologies. Defining customizations and profiles for these combinations simplifies deployment, speeds up rolling out changes, and generally makes it easier to onboard new users.
KNIME Server has a feature that greatly simplifies the installation of database drivers. Installing drivers in KNIME Analytics Platform involves a number of steps from finding the specific driver, accepting the conditions to use it, downloading the driver, and moving it to a folder within your KNIME installation, which then has to be referred to. That means quite a large number of potential failure points, especially when users have different levels of operating system experience. Any type of manual configuration can introduce errors; it's good to make this kind of process as automatic as possible; KNIME Server profiles achieve this kind of automation.
Another important benefit of using profiles is when setting up the KNIME Server/Executor. If you are a KNIME system administrator, using profiles simplifies the configuration of the distributed executor: The knime.ini on the distributed executor simply has to point at the profiles that should be loaded again automatically without requiring installing the drivers manually.
This blog post describes the steps that make installation a snap, allowing for quick onboarding of new users.
Onboarding users
As a user of KNIME, how do I benefit from using profiles? Imagine you are building a flow and your next step of work requires that you connect to a database. Instead of manually installing the database driver you need, you can select a specific profile instead, meaning you’re quickly back to building your workflow.
Here is a simple guide to follow! The prerequisite is that your KNIME Server administrator has already created the database driver profiles. In the example in this post, the server administrator had already set up profiles for BigQuery and Oracle.
Specifying a KNIME Server
First, you need to specify the KNIME Server you want to connect to - i.e. the Server where your server administrator has already set up the respective database driver profiles.
Select Preferences from the File menu, click the chevron next to KNIME and select Customization Profiles. If there are KNIME Servers already connected you will see a list in the box (see Fig. 1) and can select the one you want. If you prefer, you can add the address to the server manually The benefit of using the direct link is that no user is required; consequently there is no need for additional users in the license.
Fig. 1: Selecting a KNIME Server from the Customization Profiles tab in the Preferences menu
Selecting a profile
Once you have specified which KNIME Server you want and it has been found, a list of profiles on this server is shown in the bottom half of the dialog. . In Fig 1, you can see profiles for BigQuery (bqcd) and Oracle. Select the profiles you intend to use and click Apply and Close.
Adding the database connector
Now you can go back to your workflow and add the appropriate database connector to it, in this example the Oracle Connector node.
Fig. 2 Adding the Oracle Connector node to a workflow
After configuring the node you’re done!
Do this by selecting “Oracle” from the Database Dialect dropdown list and specifying the name of the driver in the Driver Name field.
Fig. 3: Configurating the Oracle Connector node by going to Connection Settings and selecting "Oracle" from the Database Dialect dropdown list
Conclusion
This is a much simpler method for installing drivers. We hope that this short article will help the KNIME community when using profiles and database connectors.
Please note that in order to use profiles, you require a KNIME Server license. If you have a KNIME Server but are not using profiles, suggest trying this approach with your KNIME administrator or you can contact Redfield.
Additional infos:
This short film which gives you a brief overview of the different types of customization that can be set up.
About Redfield:
Redfield is fully focused on providing advanced analytics and business intelligence since 2003. We implement the KNIME analytics platform for our clients and provide training, planning, development, and guidance within this framework. Our technical expertise, advanced processes, and strong commitment enable our customers to achieve acute data-driven insights via superior business intelligence, machine and deep learning. We are based in Stockholm, Sweden.
IoT Anomaly Detection 101: Data Science to Predict the UnexpectedadminThu, 12/05/2019 - 10:00
Author: Rosaria Silipo (KNIME). As first published in DarkReading.
Yes! You can predict the chance of a mechanical failure or security breach before it happens. Part one of a two-part series.
Data science and artificial intelligence (AI) techniques have been applied successfully for a number of years to predict or detect all kinds of events in very different domains, including:
If you run a quick web search on "machine learning use cases," you will find pages and pages of links to documents describing machine learning (ML) algorithms to detect or predict some kind of event group in some kind of data domain.
Generally, the key to a successful machine learning based application is a sufficiently general training set. The ML model, during training, should have a sufficient number of available examples to learn about each event group. This is one of the key points to any data science project: the availability of a sufficiently large number of event examples to train the algorithm.
Applying machine learning to IoT event prediction
Can security teams apply a machine learning algorithm to predict or recognize deterioration of mechanical pieces, or to detect cybersecurity breaches? The answer is, yes! Data science techniques have already been successfully utilized in the field of IoT and cybersecurity. For example, a classic usage of machine learning in IoT is demand prediction. How many customers will visit the restaurant this evening? How many cartons of milk will be sold? How much energy will be consumed tomorrow? Knowing the numbers in advance allows for better planning.
Healthcare is another very common area for usage of data science in IoT. There are many sports fitness applications and devices to monitor our vital signs, making available an abundance of data available in near real time that can be studied and used to assess a person's health condition.
Another common case study in IoT is predictive maintenance. The capability to predict if and when a mechanical piece will need maintenance leads to an optimum maintenance schedule and extends the lifespan of the machinery until its last breath. Considering that many machinery pieces are quite sophisticated and expensive, this is not a small advantage. This approach works well if a dataset is available — and even better if the dataset has been labeled. Labeled data means that each vector of numbers describing an event has been preassigned to a given class of events.
Anomaly discovery: looking for the unexpected
A special branch of data science, however, is dedicated to discovering anomalies. What is an anomaly? An anomaly is an extremely rare episode, hard to assign to a specific class, and hard to predict. It is an unexpected event, unclassifiable with current knowledge. It's one of the hardest use cases to crack in data science because:
The current knowledge is not enough to define a class
More often than not, no examples are available in the data to describe the anomaly
So, the problem of anomaly detection can be easily summarized as looking for an unexpected, abnormal event of which we know nothing and for which we have no data examples. As hopeless as this may seem, it is not an uncommon use case.
Fraudulent transactions, for example, rarely happen and often occur in an unexpected modality
Expensive mechanical pieces in IoT will break at some point without much indication on how they will break
A new arrhythmic heart beat with an unrecognizable shape sometimes shows up in ECG tracks
A cybersecurity threat might appear and not be easily recognized because it has never been seen before
In these cases, the classic data science approach, based on a set of labeled data examples, cannot be applied. The solution to this problem is a twist on the usual algorithm learning from examples.
Fig. 1 Anomaly detection problems do not offer a classic training set with labeled examples for classes: a signal from a normally functioning system and a signal from a system with an analogy. In this case, we can only train a machine learning model on a training set with "normal" examples and use a distance measure between the original signal and the predicted signal to trigger an anomaly alarm.
In IoT data, signal time series are produced by sensors strategically located on or around a mechanical component. A time series is the sequence of values of a variable over time. In this case, the variable describes a mechanical property of the object, and it is measured via one or more sensors.
Usually, the mechanical piece is working correctly. As a consequence, we have tons of examples for the piece working in normal conditions and close to zero examples for the piece failure. This is especially true if the piece plays a critical role in a mechanical chain because it is usually retired before any failure happens and compromises the whole machinery.
In IoT, a critical problem is to predict the chance of a mechanical failure before it actually happens. In this way, we can use the mechanical piece throughout its entire life cycle without endangering the other pieces in the mechanical chain. This task of predicting possible signs of mechanical failure is called anomaly detection in predictive maintenance.
Author: Rosaria Silipo (KNIME). As first published in DarkReading.
The challenge is to identify suspicious events in training sets where no anomalies are encountered. Part two of a two-part series.
The problem of anomaly detection is not new, and a number of solutions have already been proposed over the years. However, before starting with the list of techniques, let's agree on a necessary premise: All anomaly detection techniques must involve a training set where no anomaly examples are encountered. The challenge consists of identifying suspicious events, even in the absence of examples.
We talk in this case of a training set formed of only "normal" events. The definition of "normal" is, of course, arbitrary. In the case of anomaly detection, a "normal" event refers just to the events represented in the training set. Here are some common approaches.
Statistical methods
Everything that falls outside of the statistical distribution calculated over the training set is considered an anomaly.
The simplest statistical method is the control chart. Here the average and standard deviation for each feature is calculated on the training set. Thresholds are then defined around the average value as k*std deviation where k is an arbitrary coefficient, usually between 1.5 and 3.0, depending on how conservative we want the algorithm to be. During deployment, a point trespassing the thresholds in both directions is a suspicious candidate for an anomaly event.
Such methods are easy to implement and understand, fast to execute, and fit both static and time series data. However, they might be too simple to detect more subtle anomalies.
Clustering
Other proposed methods are often clustering methods. Since the anomaly class is missing from the training set, clustering algorithms might sound suitable for the task.
The concept here is clear. The algorithm creates a number of clusters on the training set. During deployment, the distance between the current data point and the clusters is calculated. If the distance is above a given threshold, the data point becomes a suspicious candidate for an anomaly event. Depending on the distance measure used and on the aggregation rules, different clustering algorithms have been designed and different clusters are created.
This approach, however, does not fit time series data since a fixed set of clusters cannot capture the evolution in time.
Supervised machine learning
Surprised? Supervised machine learning algorithms can also be used for anomaly detection. They would even cover all data situations since supervised machine learning techniques can be applied to static classification as well as to time series prediction problems. However, since they all require a set of examples for all involved classes, we need a little change in perspective.
In the case of anomaly detection, a supervised machine learning model can only be trained on "normal" data — i.e., on data describing the system operating in "normal" conditions. The evaluation of whether the input data is an anomaly can only happen during deployment after the classification/prediction has been made.There are two popular approaches for anomaly detection relying on supervised learning techniques.
The first one is a neural autoassociator (or autoencoder). The autoassociator is trained to reproduce the input pattern onto the output layer. The pattern reproduction works fine as long as the input patterns are similar to the examples in the training set — i.e., “normal.” Things do not work quite as well when a new, different shape vector appears at the input layer. In this case, the network will not be able to adequately reproduce the input vector onto the output layer. If a distance is calculated between the input and the output of the network, the distance value will be higher for an anomaly rather than for a "normal" event. Again, defining a threshold on this distance measure should find the anomaly candidates. This approach works well for static data points but does not fit time series data.
The second approach uses algorithms for time series prediction. The model is trained to predict the value of the next sample based on the history of previous n samples on a training set of "normal" values. During deployment, the prediction of the next sample value will be relatively correct — i.e., close to the real sample value, if the past history comes from a system working in "normal" conditions. The predicted value will be farther from reality if the past history samples come from a system not working in "normal" conditions anymore. In this case, a distance measure calculated between the predicted sample value and the real sample value would isolate candidates for anomaly events.
Resources: Learn more by trying out the example workflows on the KNIME Hub by entering "anomaly detection" into the search field. Find the results here
Introducing More KNIME Hub FeaturesadminMon, 12/16/2019 - 10:00
Authors: Kathrin Melcher, Tobias Schmidt, Christian Dietz (KNIME).
Do you remember the blog post “The KNIME Hub - Share and Collaborate”, where we introduced the new KNIME Hub and its first features? Since then our developers have been implementing a lot of additional functionality to make it even easier for you to find and share insights with the community. Learn more in this article.
Fig. 1. The KNIME Hub
Access components on the KNIME Hub
We introduced components in KNIME Analytics Platform 4.0. To recap on what a component is: Components are nodes that you can create with a KNIME workflow to reuse them for repetitive tasks in your workflows. Like a normal node they can have an interactive view and a dialog. You can use components across your own workflows or share them via KNIME Server or the KNIME Hub.
Example of a component
During our series of Data Science Learnathons in the US, we used a dataset where missing values are represented via 9999 instead of the red question mark, which is the placeholder for missing values in KNIME Analytics Platform. To impute missing values with the Missing Value node our learnathon participants often build a loop to replace the placeholder value with the red question mark. This is a common task as many different values are used as placeholder across different dataset like n/a, NA, -99, -999, empty string, or any other placeholder. To make this less time consuming to solve, we built a reusable component, which allows you to define the placeholder value in the dialog. You can see it here on the KNIME Hub.
Documenting component meta information
When your component is shared, it’s important to make sure that it’s well documented. By adding meta information, the people you’re sharing the component with can locate it easily and quickly understand what it’s for. Meta information is the term used to refer to the component title, component description, the (customized) icon, its category (color), as well as information about the component’s ports.
To enter or edit an existing meta information, open the component (by right-clicking the component and selecting Component -> Open) and click the pencil “Edit” icon in the description view.
You can now enter a general description about what the component does, as well as information about the component’s ports, giving the ports names, and adding descriptions about them. You can also further customize your component, giving it an image and selecting its category (or color).
Fig. 2. Left: The Description view when you select a component. Right: The Description view after having clicked the Edit icon to now enter a description, custom image, category, and information about the component’s ports.
How can I upload components to the KNIME Hub and share them?
You can upload your components to the KNIME Hub to either your public or private space (more about public and private space below):
If you share them to your public space they will be visible to the entire KNIME community.
Sharing a component to your private space means that it remains visible only to you.
To share a component from your KNIME workbench, right click the component and select Share. You now get to decide whether you want to share it locally or share it to either your public or private space.
Fig. 3. Right-clicking my Set Values to Missing component in my KNIME workbench to share it. I can share it to my local node repository or to my public or private space.
What exactly are public and private spaces?
The KNIME Hub has public and private spaces for you to share and organize your workflows and components on the KNIME Hub. Workflows and components uploaded to your public space are publicly available and therefore shared with the entire KNIME community. And if you upload them to your private space (maximum 1GB) they are only visible to you. This feature is supported from KNIME Analytics Platform 4.0.
In short, it’s like having a portable box of workflows and components in one easy to reach location. For example, while I've been working on a component for dealing with missing values. It’s really handy to be able to be able to save this component in a central place - in my private space on the KNIME Hub.
In the screenshot below you can see my public and private space after logging in using my KNIME Account credentials. Amongst a whole lot of other things, you can also see my Set Values to Missing component.
Fig. 4. Screenshot of opened private and public spaces. In my private space you can see my Set Values to Missing component. When I had finished work on it, I copied it to my public space.
Open component on the KNIME Hub
By right clicking a component, workflow or workflow group in your public space and selecting “Open” ->“in KNIME Hub” you can open it directly on the hub. In the next screenshot, you can see the Set Values to Missing component and what the meta information looks like when the component is displayed on the KNIME Hub.
Fig. 5. The Set Values to Missing component shown on the KNIME Hub. Here you can see the component description as well as the information about the ports.
Find components on the KNIME Hub
If you’re already using the KNIME Hub, you’re probably used to seeing the thousands of nodes already available there and using the search function to find the ones you need. You can also search the KNIME Hub for components that have been uploaded there by KNIME and the community.
For example: Do you want to include map visualization in your workflow, but you discover there’s no specific node for the job? Well, there might be a component. Need basic building blocks for your Guided Automation workflow or for setting missing values? Looks like there could be some components on the Hub to get you started.
Enter what you’re looking for into the search box and then filter your search by clicking the “Components” tab.
Fig. 6. Searching the KNIME Hub for components that handle missing values. It produced a list of 61 results related to my term “missing values”.
Drag and drop components
You can make use of components on the KNIME Hub by dragging and dropping them from the KNIME Hub to your workflow. If the component you want to use requires a specific extension that you don’t have, the system detects this, and KNIME Analytics Platform will automatically prompt you to install it.
Fig. 7. Video showing how to drag and drop a component from the KNIME Hub to your workbench.
Extensions on the KNIME Hub
If you want to look for specific extensions and the nodes that are part of an extension you can do this via the KNIME Hub search. In addition to the KNIME Extensions and Integrations provided by KNIME, thanks to the active KNIME Community there are lots of extensions for KNIME Analytics Platform of a wide range of topics, from deep learning, chemoinformatics, image processing to text mining. The KNIME Hub gives you the option to search for extensions and install them easily from the browser by drag & drop.
Fig. 8. Searching for image processing extensions on the KNIME Hub.
Entering “image processing” in the search field on the KNIME Hub and filtering by the Extensions tab produces a list of all the extensions available for KNIME Analytics Platform on this topic. When you click on the first entry in the list, “KNIME Image Processing”, you access a page detailing all the nodes included in the particular extension plus a list of example workflows related to this extension.
If you now decide you would like to install this extension to your version of KNIME Analytics Platform you just drag the extension to your KNIME workbench.
Fig. 9 The KNIME Image Processing Extension on the KNIME Hub. The tab Included nodes shows you a list of all the nodes in this particular extension, plus a description about what each node does. The Related workflows tab lists example workflows, which use nodes from this extension. You can download these workflows to your workbench and try them out.
-----------------------
If you have an interesting component or workflow you’re thinking about uploading to the KNIME Hub, tell us about it - maybe you’d like to write an article about it for the KNIME Blog. Write to blog@knime.com
KNIME and AWS Machine Learning Service IntegrationjfalgoutThu, 01/09/2020 - 10:00
Author: Jim Falgout (KNIME)
Organizations are using cloud services more and more to attain top levels of scalability, security, and performance. In recent years, the Amazon Web Services (AWS) cloud platform has released several services that support machine learning (ML) and artificial intelligence (AI) capabilities to enable developers and data scientists to quickly and easily build, train, and deploy machine learning models at any scale.
In our latest release of KNIME Analytics Platform, we added more functionality to our KNIME Amazon Machine Learning Integration. Think of KNIME as a quick and powerful start to AWS Services now enabling greater interaction between the various AWS services.
Today we want to focus on integrating AWS ML services with KNIME Analytics Platform to provide a simple, out of the box way to get started with AWS. The specific services we want to discuss in this article are:
Comprehend - ML based text processing
Translate - translating free text from a source to a target language
Personalize - a recommendation service using Amazon technology
The Comprehend, Translate, and Personalize Services already have an integration within KNIME. We’ll concentrate on those first and then show how to use the Python integration in KNIME to invoke other AWS ML Services such as Amazon Rekognition.
Using AWS Comprehend and Translate
KNIME Analytics Platform offers powerful text analysis capabilities. Integrating those capabilities with the AWS Comprehend and Translate services allows you to use them together quickly.
For example, KNIME supports Comprehend Syntax and Comprehend Entities functions to tag words in text. Text tagging is a common way to further analyze text by using parts of speech tags, entity tags, and other tag types for natural language processing.
Text analysis of RSS feeds relating to travel alerts
This type of analysis could be used - for example - to put together a travel risk map for corporate safety. Due to the increasing internationalization of business, more and more employees are becoming globally mobile. Providing reliable information and protecting workers abroad is the employer’s duty. This means assessing risks and implementing risk management.
You can download and try out our example workflow, the Travel Risk Map using AWS Comprehend and Translate, from the KNIME Hub here.
The following screenshot shows a fragment of our Travel Risk Map workflow. The workflow integrates multiple AWS ML Services. It captures RSS feeds relating to travel alerts. The alerts may be in different languages depending on the source. Alerts for the selected country are gathered and translated using Amazon Translate. Amazon Comprehend is then used to tag entities and discover key phrases within the alert text. Amazon Comprehend is also used to determine the sentiment of each alert.
The final result of the workflow is a word cloud with key phrases and entities from the alerts. Selecting a word in the word cloud provides the list of alerts for the selected word or phrase with its sentiment.
Fig. 1 Example workflow showing how to preprocess data for AWS ML Services and post process the results.
Several other Comprehend functions along with Translate are integrated in the current KNIME release. A quick way to find these integration nodes and any sample workflows that use them is on the KNIME Hub. Go to http://hub.knime.com and search for “Amazon” to see the full breadth of AWS integrations in KNIME. You can even drag and drop a node or component into KNIME to start using it. Any needed extensions are installed automatically.
AWS Personalize Service
The Amazon Personalize Service supports importing users, items, and interaction data into a dataset group within the service. Once the data are imported, a personalization solution can be built using the data.
Typical solutions include the capability to recommend items for a user, find items related to a particular item, and rank items by preference for a user. To use a solution, a campaign is launched that provides a scalable interface. We’ll demonstrate using KNIME Analytics Platform how to invoke a personalization campaign to provide item recommendations for a user.
Movie recommender
The KNIME workflow below demonstrates a full lifecycle of the Personalize Service. The first set of nodes load data into a dataset group within the service. The data used in the workflow are from the public Movielens dataset. The users, movies (items) and rankings (interactions) are loaded into Personalize. After loading the data, the dataset is used to create a solution. The solution uses the user personalization recipe. This recipe supports predicting which items a particular user will prefer. Once the solution is created, it is then deployed as a personalization campaign. The campaign provides an interface that supports passing in a set of users. The outcome is a set of recommended items for each user.
Once a campaign is deployed within the Amazon Personalization service it can be used over and over again to make user recommendations. Combining this with KNIME Server’s ability to invoke workflows with a REST API extends the usage of the recommendation model into production scenarios.
Using KNIME and Boto3
Boto3 is the Amazon Web Services software development kit (SDK) for Python. The Boto3 library enables Python developers to write software that can make use of any of the AWS Services. KNIME Analytics Platform includes a group of nodes providing integration with Python. When used in conjunction with the Boto3 library, the Python nodes in KNIME can be used to build interactions with AWS Services.
Earlier in this blog we discussed using KNIME nodes that interface with the Amazon Comprehend, Translate, and Personalize services. But what if you want to use a service such as Amazon Rekognition in KNIME? That’s where utilizing the KNIME Python nodes and the Boto3 library together can help.
The Amazon Rekognition service supports image analysis such as facial recognition. The facial recognition capability returns information about each face recognized such as gender, predicted age, whether glasses are being worn, position within the image, and other detailed information. Using a Python node in KNIME, code can be written in Python that invokes the Rekognition service to invoke facial recognition for an image. The information returned from Rekognition can be gathered and output by the Python node. At that point, any other KNIME node can be used to process the information. It’s up to your imagination then as to how that information is used!
Using KNIME and Python to utilize the Amazon Rekognition service is just one example. As new services roll out from Amazon, the combination can be used to quickly prototype using the new service.
KNIME as a quick and powerful start to AWS Services
KNIME Analytics Platform helps users get going with their data analysis quicker because of its visual development environment. Users put together workflows piece by piece or in KNIME-speak "node by node". The process involves limited coding or no coding at all. As a result, KNIME lowers the threshold for users across fields to participate in the building of AI environments and enables the integration of multiple fields.
KNIME is a great quick start with AWS Services and enables customers to achieve greater speed to value. And with greater functionality now included in the integration to AWS Machine Learning and Artificial Intelligence Services, users have increased interaction between KNIME and Amazon Web Services.
Tune in to more articles in this series on KNIME and cloud connectivity. We are excited to hear what you think about it. Send your comments to blog@knime.com
References:
KNIME Amazon Machine Learning Integration: this feature contains nodes for interacting with AWS AI/ML Services: AWS Comprehend, Translate, and Personalize. See the complete list of nodes on the KNIME Hub https://kni.me/e/gWvsHS9JzMox0Ccj
Machine learning algorithms and the art of hyperparameter selectionadminThu, 01/16/2020 - 10:00
A review of four optimization strategies
By: Rosaria Silipo and Mykhailo Lisovyi. As first published in The Next Web.
Machine learning algorithms are used everywhere from smartphones to spacecraft. They tell you the weather forecast for tomorrow, translate from one language into another, and suggest what TV series you might like next on Netflix.
These algorithms automatically adjust (learn) their internal parameters based on data. However, there is a subset of parameters that is not learned. These parameters have to be configured by an expert. Such parameters are often referred to as “hyperparameters” — and they have a big impact on our lives as the use of AI increases.
For example, the tree depth in a decision tree model and the number of layers in an artificial neural network are typical hyperparameters. The performance of a model can drastically depend on the choice of its hyperparameters. A decision tree can yield good results for moderate tree depth and have very bad performance for very deep trees.
The choice of the optimal hyperparameters is more art than science, if we want to run it manually. Indeed, the optimal selection of the hyperparameter values depends on the problem at hand.
Since the algorithms, the goals, the data types, and the data volumes change considerably from one project to another, there is no single best choice for hyperparameter values that fits all models and all problems. Instead, hyperparameters must be optimized within the context of each machine learning project.
In this article, we’ll start with a review of the power of an optimization strategy and then provide an overview of four commonly used optimization strategies:
Grid search
Random search
Hill climbing
Bayesian optimization
The optimization strategy
Even with in-depth domain knowledge by an expert, the task of manual optimization of the model hyperparameters can be very time-consuming. An alternative approach is to set aside the expert and adopt an automatic approach. An automatic procedure to detect the optimal set of hyperparameters for a given model in a given project in terms of some performance metric is called an optimization strategy.
A typical optimization procedure defines the possible set of hyperparameters and the metric to be maximized or minimized for that particular problem. Hence, in practice, any optimization procedure follows these classical steps:
1) Split the data at hand into training and test subsets
2) Repeat optimization loop a fixed number of times or until a condition is met:
a. Select a new set of model hyperparameters
b. Train the model on the training subset using the selected set of hyperparameters
c. Apply the model to the test subset and generate the corresponding predictions
d. Evaluate the test predictions using the appropriate scoring metric for the problem at hand, such as accuracy or mean absolute error. Store the metric value that corresponds to the selected set of hyperparameters
3) Compare all metric values and choose the hyperparameter set that yields the best metric value
Now, the question is how to pass from step 2d back to step 2a for the next iteration; that is, how to select the next set of hyperparameters, making sure that it is actually better than the previous set. We would like our optimization loop to move toward a reasonably good solution, even though it may not be the optimal one. In other words, we want to be reasonably sure that the next set of hyperparameters is an improvement over the previous one.
A typical optimization procedure treats a machine learning model as a black box. That means at each iteration for each selected set of hyperparameters, all we are interested in is the model performance as measured by the selected metric. We do not need (want) to know what kind of magic happens inside the black box. We just need to move to the next iteration and iterate over the next performance evaluation, and so on.
The key factor in all different optimization strategies is how to select the next set of hyperparameter values in step 2a, depending on the previous metric outputs in step 2d. Therefore, for a simplified experiment, we omit the training and testing of the black box, and we focus on the metric calculation (a mathematical function) and the strategy to select the next set of hyperparameters. In addition, we have substituted the metric calculation with an arbitrary mathematical function and the set of model hyperparameters with the function parameters.
In this way, the optimization loop runs faster and remains as general as possible. One further simplification is to use a function with only one hyperparameter to allow for an easy visualization. Below is the function we used to demonstrate the four optimization strategies. We would like to emphasize that any other mathematical function would have worked as well.
f(x) = sin(x/2) + 0.5⋅sin(2⋅x) +0.25⋅cos(4.5⋅x)
This simplified setup allows us to visualize the experimental values of the one hyperparameter and the corresponding function values on a simple x-y plot. On the x axis are the hyperparameter values and on the y axis the function outputs. The (x,y) points are then colored according to a white-red gradient describing the point position in the generation of the hyperparameter sequence.
Whiter points correspond to hyperparameter values generated earlier in the process; redder points correspond to hyperparameter values generated later on in the process. This gradient coloring will be useful later to illustrate the differences across the optimization strategies.
The goal of the optimization procedure in this simplified use case is to find the one hyperparameter that maximizes the value of the function.
Let’s begin our review of four common optimization strategies used to identify the new set of hyperparameter values for the next iteration of the optimization loop.
Grid search
This is a basic brute-force strategy. If you do not know which values to try, you try them all. All possible values within a range with a fixed step are used in the function evaluation.
For example, if the range is [0, 10] and the step size is 0.1, then we would get the sequence of hyperparameter values (0, 0.1, 0.2, 0.3, … 9.5, 9.6, 9.7, 9.8, 9.9, 10). In a grid search strategy, we calculate the function output for each and every one of these hyperparameter values. Therefore, the finer the grid, the closer we get to the optimum — but also the higher the required computation resources.
Figure 1: Grid search of the hyperparameter values on a [0, 10] range with step 0.1. The color gradient reflects the position in the generated sequence of hyperparameter candidates. Whiter points correspond to hyperparameter values generated earlier on in the process; red points correspond to hyperparameter values generated later on.As Figure 1 shows, the range of the hyperparameter is scanned from small to large values.
The grid search strategy can work well in the case of a single parameter, but it becomes very inefficient when multiple parameters have to be optimized simultaneously.
Random search
For the random search strategy, the values of the hyperparameters are selected randomly, as the name suggests. This strategy is typically preferred in the case of multiple hyperparameters, and it is particularly efficient when some hyperparameters affect the final metric more than others.
Again, the hyperparameter values are generated within a range [0, 10]. Then, a fixed number N of hyperparameters is randomly generated. The fixed number N of predefined hyperparameters to experiment with allows you to control the duration and speed of this optimization strategy. The larger the N, the higher the probability to get to the optimum — but also the higher the required computation resources.
Figure 2: Random search of the hyperparameter values on a [0, 10] range. The color gradient reflects the position in the generated sequence of hyperparameter candidates. Whiter points correspond to hyperparameter values generated earlier on in the process; red points correspond to hyperparameter values generated later on.As expected, the hyperparameter values from the generated sequence are used in no decreasing or increasing order: white and red dots mix randomly in the plot.
Hill climbing
The hill climbing approach at each iteration selects the best direction in the hyperparameter space to choose the next hyperparameter value. If no neighbor improves the final metric, the optimization loop stops.
Note that this procedure is different from the grid and random searches in one important aspect: selection of the next hyperparameter value takes into account the outcomes of previous iterations.
Figure 3: Hill climbing search of the hyperparameter values on a [0, 10] range. The color gradient reflects the position in the generated sequence of hyperparameter candidates. Whiter points correspond to hyperparameter values generated earlier on in the process; red points correspond to hyperparameter values generated later on.Figure 3 shows that the hill climbing strategy applied to our function started at a random hyperparameter value, x=8.4, and then moved toward the function maximum y=0.4 at x=6.9. Once the maximum was reached, no further increase in the metric was observed in the next neighbor, and the search procedure stopped.
This example illustrates a caveat related to this strategy: it can get stuck in a secondary maximum. From the other plots, we can see that the global maximum is located at x=4.0 with corresponding metric value of 1.6. This strategy does not find the global maximum but gets stuck into a local one. A good rule of thumb for this method is to run it multiple times with different starting values and to check whether the algorithm converges to the same maximum.
Bayesian optimization
The Bayesian optimization strategy selects the next hyperparameter value based on the function outputs in the previous iterations, similar to the hill climbing strategy. Unlike hill climbing, Bayesian optimization looks at past iterations globally and not only at the last one.
There are typically two phases in this procedure:
During the first phase, called warm-up, hyperparameter values are generated randomly. After a user-defined number N of such random generations of hyperparameters, the second phase kicks in.
In the second phase, at each iteration, a “surrogate” model of type P(output | past hyperparameters) is estimated to describe the conditional probability of the output values on the hyperparameter values from past iterations. This surrogate model is much easier to optimize than the original function. Thus, the algorithm optimizes the surrogate and suggests the hyperparameter values at the maximum of the surrogate model as the optimal values for the original function as well. A fraction of the iterations in the second phase is also used to probe areas outside of the optimal region. This is to avoid the problem of local maxima.
Figure 4: Bayesian optimization of the hyperparameter values on a [0, 10] range. The color gradient reflects the position in the generated sequence of hyperparameter candidates. Whiter points correspond to hyperparameter values generated earlier on in the process; red points correspond to hyperparameter values generated later on. The gray points are generated in the first random phase of the strategy. Figure 4 demonstrates that the Bayesian optimization strategy uses the warm-up phase to define the most promising area and then selects the next values for the hyperparameters in that area.
You can also see that intense red points are clustered closer to the maximum, while pale red and white points are scattered. This demonstrates that the definition of the optimal region is improved with each iteration of the second phase.
Summary
We all know the importance of hyperparameter optimization while training a machine learning model. Since manual optimization is time-consuming and requires specific expert knowledge, we have explored four common automatic procedures for hyperparameter optimization.
In general, an automatic optimization procedure follows an iterative procedure in which at each iteration, the model is trained on a new set of hyperparameters and evaluated on the test set. At the end, the hyperparameter set corresponding to the best metric score is chosen as the optimal set. The question is how to select the next set of hyperparameters, ensuring that this is actually better than the previous set.
We have provided an overview of four commonly used optimization strategies: grid search, random search, hill climbing, and Bayesian optimization. They all have pros and cons, and we have briefly explained the differences by illustrating how they work on a simple toy use case. Now you are all set to go and try them out in a real-world machine learning problem.
Spotfire Web Application for KNIME: SWAKadminMon, 01/27/2020 - 10:00
Authors: Lionel Colliandre & Eric Le Roux, Discngine
Introduction
The authors of this article, Lionel Colliandre and Eric Le Roux are both from Discngine, a company that operates in the life science field. The company serves the needs of chemists and biologists in research, helping organize their data, provide informatics services, screening, data acquisition, and enable better decision making.
Topic. Spotfire Web Application for KNIME: encompassing KNIME capabilities in Spotfire
So far, they have focused on technologies such as Pipeline Pilot® and TIBCO Spotfire®. However, in response to seeing a lot of traction among customers towards using KNIME Analytics Platform, they wanted to find out more! One focus of Lionel and Eric’s work with KNIME is building connectors. In this article, they explain how their new connector bridges Spotfire and KNIME and what kind of benefits this offers.
Their Spotfire Web Application for KNIME – SWAK – uses the Discngine Connector API to combine the two software platforms by way of a web interface. The purpose of SWAK is to give users an interface to control KNIME workflows that will design instructions to create and/or update Spotfire documents. These instructions, defined by the scientist and parameterized by the end user, are mediated through the Discngine Connector API.
Here, Lionel and Eric describe their SWAK and what kind of communication takes place between KNIME Server and Spotfire through the web application and the Discngine Connector.
What would data science be without visualization? Nothing!
TIBCO Spotfire® is widely used for visualizing massive amounts of data in a fast and interactive way. By encompassing KNIME capabilities inside Spotfire we can construct a dynamic platform that makes managing and visualizing data much easier for end users with little or no coding knowledge.
Fig. 1. Screenshot of the SWAK mashup with the left panel allowing control of the execution of KNIME workflows. The right-hand section shows the Spotfire document
In our experience, a common use of TIBCO Spotfire® begins with users loading their raw data files directly. Raw data files can, however, be difficult to fully exploit. Using KNIME Analytics Platform to pre-process the data is possible, but this means having to go through two steps, each one independent of the other. More interactivity would speed up the process particularly with regard to the type of analyses that are carried out again and again. Our Spotfire Web Application for KNIME addresses precisely the topic of interactivity.
KNIME capabilities inside Spotfire
The central piece of our SWAK solution is the Discngine Client Automation API, which is part of the Discngine Connector. This Javascript API allows you to programmatically interact and control Spotfire in both Analyst and Web Player clients.
The SWAK is a mashup application including web content written in React and the Spotfire client. The Javascript API allows KNIME developers to:
manage authentication, workflow registration, workflow execution, and input/output data management
dynamically create content within the Spotfire document: import data, create and update visuals inside the Spotfire client, etc.
In the backend, KNIME workflows are stored on KNIME Server. The Javascript API controls execution of the workflows using REST API. Data output is saved in the Spotfire format through the Spotfire nodes included in KNIME. (The KNIME Spotfire Integration can be downloaded and installed from the KNIME Hub). A second output consist of the generated Javascript instructions for the Discngine Client Automation API. These instructions are finally executed to create content within the Spotfire document based on the preprocessed data in KNIME.
Fig. 2. Use of the Discngine Client Automation API to add an interactive web panel inside Spotfire, allowing KNIME workflows to be called in the backend.
A way for KNIME developers to share their work
SWAK has been designed to serve as a central repository for Spotfire-enabled KNIME workflows. On the one side, KNIME developers create workflows based on the needs of their end users. They are shared in SWAK with preconfigured options. At the other end of the process, Spotfire users can execute recurrent data analyses with their own data based on the registered workflows in the central repository. The main advantage is that users can create and/or update Spotfire visuals and data on the fly according to the rules implemented in the workflows.
The graphic below shows how easy it is to share KNIME workflows in SWAK, leading to the creation of a catalog of KNIME workflows dedicated to Spotfire.
Fig. 3. Share your KNIME workflows and make them accessible from within the Spotfire platform in a few steps.
Main features
SWAK improves the interactivity you can have in KNIME and/or Spotfire alone. It opens up a lot of possibilities, including being able to:
Use standardized preprocesses to load data dynamically into Spotfire from a KNIME execution
Create and show predefined pages and visuals in Spotfire
Use Spotfire as a data source for KNIME workflows: marked rows, selected columns, or full data tables
Display advanced forms in the SWAK Web Panel: File upload, molecular sketchers
Write back data based on users’ actions
Conclusion
The SWAK (Spotfire Web Application for KNIME) is a platform KNIME developers can use to share their workflows with their colleagues and make them accessible from within the Spotfire platform. In turn, Spotfire users can benefit from KNIME Analytics Platform’s capabilities without the need to go through export/import processes.
To discover all possibilities around the SWAK, you can request a demo via this form
References:
Try out the KNIME Spotfire Integration here on the KNIME Hub.
Data Science: How to Successfully Create and Productionize Across the EnterprisebertholdThu, 01/30/2020 - 10:00
By Michael Berthold, KNIME. As first published in Techopedia.
There is a lot of talk about data science these days, and how it affects essentially all types of businesses. Concerns are raised by management teams about the lack of people to create data science, and promises are made left and right on how to simplify or automate this process.
Yet, little attention is paid to how the results can actually be put into production in a professional way.
Takeaway: Optimizing data science across the entire enterprise requires more than just cool tools for wrangling and analyzing data.
Obviously, we can simply hardcode a data science model or rent a pre-trained predictive model in the cloud, embed it into an application in-house and we are done. But that is not giving us the true value data science can provide: continuously adjusting to new requirements and data, applicable to new or variations of existing problems, and providing new insights that have profound impact on our business. (Read more on the Data Science job role here.)
In order to truly embed data science in our business, we need to start treating data science like other business-critical technologies and provide a professional path to production using reliable, repeatable environments for both the creation and the productionization of data science.
End to end data science: creation and productionization
Many of the processes we need to establish in order to support high quality, data science throughout an enterprise are similar to professional software development: solid design principles, release pipelines, and agile processes ensure quality, sharing, and reproducibility while maintaining the ability to react quickly to new requirements. (Read Enterprise Cloud 101.)
Applying these concepts to data science enables continuous and fast delivery of new or updated data science applications and services as well as prompt incorporation of user feedback.
The typical four stages of end-to-end data science need to be tightly coupled and yet flexible enough to allow for such an agile delivery and feedback loop:
Data definition, accessing, and wrangling:
This is the classic domain of data architects and data engineers. Their focus lies on defining how and where to store data, providing repeatable ways to access old and new data, extracting and combining the right data to use for a particular project, and transforming the final data into the right format.
Parts of these activities can be addressed with a solid data warehouse strategy, but in reality, the hybrid nature of most organizations does not allow for such a static setup. Combining legacy data with in-house and cloud databases, accessing structured and unstructured data, enriching the data with other data sources (e.g. social media data, information available from online providers) continuously poses new challenges to keep projects up to date.
This is also the reason why most of this function needs to be part of the overall data science practice and cannot be owned solely by IT — the success of many data projects relies on quick adjustments to changes in data repositories and the availability of new data sources.
Automation here can help with learning how to integrate data and making some of the data wrangling easier, but ultimately, picking the right data and transforming them “the right way” is already a key ingredient for project success.
Data analysis and visualization:
This is where all those topical buzzwords come in: Artificial Intelligence (AI), Machine Learning (ML), Automation, plus all the “Deep” topics currently on everybody’s radar. However, statistical data analysis, standard visualization techniques, and all those other classic techniques must still be part of the analysis toolbox.
Ultimately, the goal remains the same: creating aggregations/visualizations, finding patterns, or extracting models that we can use to describe or diagnose our process or predict future events, so as to prescribe appropriate actions. The excitement for modern technologies has often led to people ignoring the weakness of applying black box techniques, but recently, increasing attention is being paid to the interpretability and reliability of these approaches.
The more sophisticated the method, the less likely it is that we can understand how the model reaches specific decisions and how statistically sound that decision is.
In all but the simplest cases, however, this stage of the data science process does not operate in isolation. Inspecting aggregations and visualizations will trigger requests of more insights that require other types of data, extracted patterns will demand different perspectives, and predictions will initially be mostly wrong until the expert has understood the reasons why the model is “off” and has fixed data issues, adjusted transformations, and explored other models and optimization criteria.
Here, a tight feedback loop to the data wrangling stage is critical — ideally, the analytics expert can, at least partially, change some of the data access and transformation directly. And, in an ideal world, of course, all this work is done in collaboration with other experts, building on their expertise instead of continuously reinventing the wheel.
Organizing the data science practice:
Having a team of experts work on projects is great. Ensuring that this team works well together and their results are put into production easily and reliably is the other half of the job of whoever owns “data science” in the organization — and that part is often still ignored.
How do we keep those experts happy? We should enable them to focus on what they do best:
Solving data wrangling or analysis problems using their favorite environment.
Instead of forcing and locking them all into a proprietary solution, an integrative data science environment allows different technologies to be combined and enables the experts to collaborate instead of compete. This, of course, makes managing that team even more of a challenge.
The data science practice leader needs to ensure that collaboration results in the reuse of existing expertise, that past knowledge is managed properly, and best practices are not a burden but really do make people’s lives easier. This is probably still the biggest gap in many data science toolkits. Requiring backwards compatibility beyond just a few minor releases, version control, and the ability to audit past analyses are essential to establishing a data science practice and evolving from the “one-shot solutions” that still prevail.
The final piece in this part of the puzzle is a consistent, repeatable path to deployment. Still too often, the results of the analysis need to be ported into another environment, causing lots of friction and delays, and adding yet another potential source of error.
Ideally, deploying data science results — via dashboards, data science services, or full-blown analytical applications — should be possible within the very same environment that was used to create the analysis in the first place.
Creating business value:
Why are we doing all of the above? In the end, it is all about turning the results into actual value. Surprisingly, however, this part is often decoupled from the previous stages. Turning around quickly to allow the business owner to inject domain knowledge and other feedback into the process, often as early as what type of data to ingest, is essential.
In an ideal world this can either directly affect the analytical service or application that was built (and, preferably, without having to wait weeks for the new setup to be put in place) or the data science team has already integrated interactivity into the analytical application, which allows the domain user’s expertise to be captured.
Being able to mix & match these two approaches allows the data science team to deliver an increasingly flexible application, perfectly adjusted to the business need. The similarity to agile development processes becomes even most obvious here:
The end user’s feedback needs to truly drive what is being developed and deployed.
Standardization, automation, or custom data science?
The last questions are:
Does every organization need the four personas above? Do we really need in-house expertise on every aspect of the above?
And the answer is most often:
Probably not.
The ideal data science environment provides the flexibility to mix & match. Maybe data ingestion only needs to be automated or defined just once with the help of outside consultants, while your in-house data science team provides business critical insights that need to be refined, updated and adjusted on a daily basis.
Or your business relies less on analytical insights and you are happy to trust automated or prepackaged ML but your data sources keep changing and growing continuously and your in-house data wrangling team needs full control over which data are going to be integrated and how.
Or you are just at the beginning of the data science journey and are focusing on getting your data in shape and creating standard reports. Still, investing in a platformthat does cover the entire data science life cycle, when the time is ripe, sets the stage for future ambitions. And even if, right now, you are the data architect, wrangler, analyst, and user all-in-one person — preparing for the time when you add colleagues for more specialized aspects may be a wise move.
Productionizing data science
Successfully creating and productionizing data science in the real world requires a comprehensive and collaborative end-to-end environment that allows everybody from the data wrangler to the business owner to work closely together and incorporate feedback easily and quickly across the entire data science lifecycle.
This is probably the most important message to all stakeholders.
Even though these roles have existed in organizations before, the real challenge is to find an integrative environment that allows everybody to contribute what they do and know best. That environment covers the entire cycle and at the same time allows to pick & choose: standard components here, a bit of automation there, and custom data science where you need it.
And finally, we need to be ready for the future. We need an open environment that allows us to add new data sources, formats, and analysis technologies to the mix quickly.
After all: do you know what kind of data you will want to digest in a few years? Do you know what tools will be available and what the newest trends will be?
Jupyter Notebooks offer an incredible potential to disseminate technical knowledge thanks to its integrated text plus live code interface. This is a great way of understanding how specific tasks in the Computer-Aided Drug Design (CADD) world are performed, but only if you have a basic coding expertise. While users without a programming background can simply execute the code blocks blindly, this rarely provides any useful feedback on how a particular pipeline works. Fortunately, more visual alternatives like KNIME workflows are better suited for this kind of audience.
Our team put together these tutorials for (a) ourselves as scientists who want to learn about new topics in drug design and how to actually apply them practically to data using Python/KNIME, (b) new students in the group who need a compact but detailed enough introduction to get started with their project, and (c) for the classroom where we can use the material directly or build on top of it.
Fig. 1: The visual capabilities of the KNIME Platform are evident. This is not a diagram of the TeachOpenCADD KNIME workflows, but the actual project as rendered in KNIME itself. Each box can be accessed individually for further configuration and workflow details.
The pipeline is illustrated using the epidermal growth factor receptor (EGFR), but can easily be applied to other targets of interest. Topics include how to fetch, filter and analyze compound data associated with a query target. The bundled project including all workflows is freely available on KNIME Hub. The Hub also lists the individual workflows for separate downloads if desired. Further details are given in the following sections.
Note: The screenshots shown below are taken from the individual workflows, which resemble the complete workflow but have different input and output sources. Double click the screenshots to see a larger display of the image.
Workflow 1: Acquire compound data from ChEMBL
Information on compound structure, bioactivity, and associated targets are organized in databases such as ChEMBL, PubChem, or DrugBank. Workflow W1 shows how to obtain and preprocess compound data for a query target (default target: EGFR) from the ChEMBL web services.
Workflow 2: Filter datasets by ADME criteria
Not all compounds are suitable starting points for drug development due to undesirable pharmacokinetic properties, which for instance negatively affect a drug's absorption, distribution, metabolism, and excretion (ADME). Therefore, such compounds are often excluded from data sets for virtual screening. Workflow W2 shows how to remove less drug-like molecules from a data set using Lipinski's rule of five.
Workflow 3: Set alerts based on unwanted substructures
Compounds can contain unwanted substructures that may cause mutagenic, reactive, or other unfavorable pharmacokinetic effects or that may lead to non-specific interactions with assays (PAINS). Knowledge on unwanted substructures in a data set can be integrated in cheminformatics pipelines to either perform an additional filtering step before screening or - more often - to set alert flags to compounds being potentially problematic (for manual inspection by medicinal chemists). Workflow W3 shows how to detect and flag such unwanted substructures in a compound collection.
Workflow 4: Screen compounds by compound similarity
In virtual screening (VS), compounds similar to known ligands of a target under investigation often build the starting point for drug development. This approach follows the similar property principle stating that structurally similar compounds are more likely to exhibit similar biological activities. For computational representation and processing, compound properties can be encoded in the form of bit arrays, so-called molecular fingerprints, e.g. MACCS and Morgan fingerprints. Compound similarity can be assessed by measures such as the Tanimoto and Dice similarity. Workflow W4 shows how to use these encodings and comparison measures. VS is here conducted based on a similarity search.
Workflow 5: Group compounds by similarity
Clustering can be used to identify groups of similar compounds, in order to pick a set of diverse compounds from these clusters for e.g. non-redundant experimental testing or to identify common patterns in the data set. Workflow W5 shows how to perform such a clustering based on a hierarchical clustering algorithm.
Workflow 6: Find the maximum common substructure in a collection of compounds
In order to visualize shared scaffolds and thereby emphasize the extent and type of chemical similarities in a compound cluster, the maximum common substructure (MCS) can be calculated and highlighted. In Workflow W6, the MCS for the largest cluster from previously clustered compounds (W5) is calculated using the FMCS algorithm.
Workflow 7: Screen compounds using machine learning methods
With the continuously increasing amount of available data, machine learning (ML) gained momentum in drug discovery and especially in ligand-based virtual screening to predict the activity of novel compounds against a target of interest. In Workflow W7, different ML models (RF, SVM and NN) are trained on the filtered ChEMBL dataset to discriminate between active and inactive compounds with respect to a protein target.
Workflow 8: Acquire structural data from PDB
The PDB database holds 3D structural data and meta information on experimentally resolved proteins. Workflow W8 shows how structural data can be automatically fetched from the PDB and processed.
Requirements
All the workflows have been tested on KNIME v4 and v4.1. In addition to some extensions provided by the KNIME team, TeachOpenCADD also requires:
Will They Blend? KNIME meets the Semantic WebMartynaMon, 02/10/2020 - 10:00
Author: Martyna Pawletta (KNIME)
Today: Ontologies – or let’s see if we can serve pizza via the semantic web and KNIME Analytics Platform. Will they blend?
Ontologies study concepts that directly relate to “being” i.e. concepts that relate to existence and reality as well as the basic categories of being and their relations. In information science, an ontology is a formal description of knowledge as a set of concepts within a domain. In an ontology we have to specify the different objects and classes and the relations - or links - between them. Ultimately, an ontology is a reusable knowledge representation that can be shared.
Fun reference: The Linked Open Data Cloud has an amazing graphic showing how many ontologies (linked data) there are available in the web.
The Challenge
The Semantic Web and the collection of related Semantic Web technologies like RDF (Resource Description Framework), OWL (Web Ontology Language) or SPARQL (SPARQL Protocol and RDF Query Language) offer a bunch of tools where linked data can be queried, shared and reused across applications and communities. A key role in this area is played by ontologies and OWLs.
So where does the OWL come into this? Well, no - we don’t mean the owl as a bird here – but you see the need of ontologies right? An OWL can have different meanings and this is one of the reasons why creating ontologies for specific domains might make sense.
Ontologies can be very domain specific and not everybody is an expert in every domain - but it’s a relatively safe bet to say that we’ve all eaten pizzas at some point in time - so let’s call ourselves pizza experts. Today’s challenge is to extract information from an OWL file containing information about pizza and traditional pizza toppings, store this information in a local SPARQL Endpoint, and execute SPARQL queries to extract some yummy pizza, em - I mean data. Finally, this data will be displayed in an interactive view which allows you to investigate the content.
Topic. KNIME meets the Semantic Web.
Challenge. Extract information from a Web Ontology Language (OWL) file.
Access Mode / Integrated Tool. KNIME Semantic Web/Linked Data Extension.
The ontology used in this blog post and demonstrated workflow is an example ontology that has been used in different versions of the Pizza Tutorial run by Manchester University. See more information on Github here.
The Experiment
Reading and querying an OWL file
In the first step the Triple File Reader node extracts the content of the pizza ontology in the OWL file format and reads all triples into a Data Table. Triples are a collection of three columns containing a subject(URI), a predicate(URI) and an object(URI or literal), short: sub, pred, obj. The predicate denotes relationships between the subject and the object. As shown in the screenshot below (Fig.1), in the example we see that the Pizza FruttiDiMare is a subClassOf the class NamedPizza and has two labels: a preferred and an alternative one.
Figure 1. Screenshot showing the output of the Triple File Reader node containing a subject, predicate and object column.
Once the Triple File Reader is executed, a SPARQL Endpoint can be created using the Memory Endpoint together with the SPARQL Insert node. This allows the execution of SPARQL queries. Note that our Triple File Reader does not officially support the OWL format. KNIME can read RDF files and consequently because OWL files are very similar we can read these files too. However not all information is necessarily retrieved as OWL can have additional parameters.
The example in Figure 2 shows a SPARQL query node that contains a query to extract a basic list with all pizzas included in the owl file.
A recommendation here: if the ontology you want to query is new to you – I would highly recommend exploring the structure and classes first quickly in another tool like Protége. This makes it easier later to create and write SPARQL queries.
Figure 2. Example workflow that shows how to read an OWL file, insert extracted triples into a SPARQL endpoint and execute a SPARQL query to extract all kinds of pizzas from the pizza ontology.
The SPARQL query node has a checkbox on the top right (see Fig. 2) saying “Preserve N-Triples format”. Selecting this makes a difference in terms of what the output data will look like. The N-Triples format needs to be kept if the triples will be inserted into an endstore.
The example below shows the effect of not checking (top) or checking (bottom) the N-triples checkbox. In case of URIs the angled brackets are not preserved, in terms of literals quotes and type (here @en) will be removed if nothing has been selected.
Visualization
There are different ways in KNIME to visualize data. In the case of ontologies it’s really depending on what you are aiming to do. Here we will extract a bit more information than in the first example and create an interactive view within a component that allows us to explore the content of the pizza ontology.
Additionally to the pizza labels now using two SPARQL query nodes (see Fig. 3), further information like toppings per pizza type or its spiciness was extracted. Also, we query for pizza toppings that are a subclass of the class VegetableToppings and create a flag if the topping is a vegetable or not using the Constant Value Column node.
Figure 3. Example workflow showing how the basic example from Fig 2. Can be extended and an interactive view created.
Finally we create an interactive view where the extracted data can be explored (see Fig.4). To open the interactive view, right click the “Interactive View” Component + select Interactive View.
Figure 4. Interactive view showing extracted data
Is it real?!
When I first looked at the dataset using the view I saw the “Sloppy Giuseppe” Pizza and directly had to google it as it was something completely new to me. I saw the toppings but was wondering if this is something really Italian? This brought me to the idea of adding another feature here in addition to the tables and charts.
If you now click on the Pizza name, a new window will open showing Google search results for that specific pizza type. I did this using the String Manipulation node, which creates a link. To make sure the link opens in a new window and not in your current view the “target=_blank” option needs to be included.
The Results
We showed today how to extract data from an OWL file, create a SPARQL Endpoint and SPARQL query. Finally we generated a view where the content can be explored.
After playing with such yummy data… hungry now? Let’s order a pizza then 😉
The example workflow shown in this article, Exploring a Pizza Ontology with an OWL file, can be downloaded from the KNIME Hub here.
Will They Blend? Experiments in Data & Tool Blending
In the Will They Blend blog series we experiment with the most interesting blends of data and tools. Whether it’s mixing traditional sources with modern data lakes, open-source devops on the cloud with protected internal legacy tools, SQL with noSQL, web-wisdom-of-the-crowd with in-house handwritten notes, or IoT sensor data with idle chatting, we’re curious to find out: will they blend? Want to find out what happens when website texts and Word documents are compared?
Follow us here and send us your ideas for the next data blending challenge you’d like to see at willtheyblend@knime.com.
How to keep bias out of your AI modelsadminThu, 02/13/2020 - 10:00
Artifical Intelligence models are empty, neutral machines.
They will acquire a bias when trained with biased data
By Rosaria Silipo, KNIME. As first published in InfoWorld.
Bias in artificial intelligence (AI) is hugely controversial these days. From image classifiers that label people’s faces improperly to hiring bots that discriminate against women when screening candidates for a job, AI seems to inherit the worst of human practices when trying to automatically replicate them.
The risk is that we will use AI to create an army of racist, sexist, foul-mouthed bots that will then come back to haunt us. This is an ethical dilemma. If AI is inherently biased, isn’t it dangerous to rely on it? Will we end up shaping our worst future?
Artificial Intelligence and Bias is one of the topics planned for our new Birds of a Feather sessions during the KNIME Spring Summit in Berlin. Take part to meet KNIMErs and other attendees with similar interests.
Read more about our annual Spring Summit on March 30 - April 3, 2020 and send us your BoF topics when you sign up here
Machines will be machines
Let me clarify one thing first: AI is just a machine. We might anthropomorphize it, but it remains a machine. The process is not dissimilar from when we play with stones at the lake with our kids, and suddenly, a dull run-of-the-mill stone becomes a cute pet stone.
Even when playing with your kids, we generally do not forget that a pet stone, however cute, is still just a stone. We should do the same with AI: However humanlike its conversation or its look, we should not forget that it still is just a machine.
Some time ago, for example, I worked on a bot project: a teacher bot. The idea was to generate automatic informative answers to inquiries about documentation and features of the open source data science software KNIME Analytics Platform. As in all bot projects, one important issue was the speaking style.
There are many possible speaking or writing styles. In the case of a bot, you might want it to be friendly, but not excessively so — polite and yet sometimes assertive depending on the situation. The blog post “60 Words to Describe Writing or Speaking Styles” lists 60 nuances of different bot speaking styles: from chatty and conversational to lyric and literary, from funny and eloquent to formal and, my favorite of all, incoherent. Which speaking style should my bot adopt?
I went for two possible styles: polite and assertive. Polite to the limit of poetic. Assertive to bordering on impolite. Both are a free text generation problem.
As part of this teacher bot project, a few months ago I implemented a simple deep learning neural network with a hidden layer of long short-term memory (LSTM) units to generate free text.
The network would take a sequence of M characters as input and predict the next most likely character at the output layer. So given the sequence of characters “h-o-u-s” at the input layer, the network would predict “e” as the next most likely character. Trained on a corpus of free sentences, the network learns to produce words and even sentences one character at a time.
I did not build the deep learning network from scratch, but instead (following the current trend of finding existing examples on the internet) searched the KNIME Hub for similar solutions for free text generation. I found one, where a similar network was trained on existing real mountain names to generate fictitious copyright-free mountain-reminiscent candidate names for a line of new products for outdoor clothing. I downloaded the network and customized it for my needs, for example, by transforming the many-to-many into a many-to-one architecture.
The network would be trained on an appropriate set of free texts. During deployment, a trigger sentence of M=100 initial characters would be provided, and the network would then continue by itself to assemble its own free text.
Fig. 1 LSTM-based deep learning network for free text generation.
An example of AI bias
Just imagine a customer or user has unreasonable yet firmly rooted expectations and demands the impossible. How should I answer? How should the bot answer? The first task was to train the network to be assertive — very assertive to the limit of impolite. Where can I find a set of firm and assertive language to train my network?
I ended up training my deep learning LSTM-based network on a set of rap song texts. I figured that rap songs might contain all the sufficiently assertive texts needed for the task.
What I got was a very foul-mouthed network; so much so that every time I present this case study to an audience, I have to invite all minors to leave the room. You might think that I had created a sexist, racist, disrespectful — i.e., an openly biased — AI system. It seems I did.
Below is one of the rap songs the network generated. The first 100 trigger characters were manually inserted; these are in red. The network-generated text is in gray. The trigger sentence is, of course, important to set the proper tone for the rest of the text. For this particular case, I started with the most boring sentence you could find in the English language: a software license description.
It is interesting that, among all possible words and phrases, the neural network chose to include “paying a fee,” “expensive,” “banks,” and “honestly” in this song. The tone might not be the same, but the content tries to comply with the trigger sentence.
Fig. 2 An example of an AI-generated rap song. The trigger sentence in red is the start of a software license document.
More details about the construction, training, and deployment of this network can be found in the article “AI-Generated Rap Songs.”
The language might not be the most elegant and formal, but it has a pleasant rhythm to it, mainly due to the rhyming. Notice that for the network to generate rhyming text, the length M of the sequence of past input samples must be sufficient. Rhyming works for M=100 but never for M=50 past characters.
Removing bias from AI
In an attempt to reeducate my misbehaving network, I created a new training set that included three theater pieces by Shakespeare: two tragedies (“King Lear” and “Othello”) and one comedy (“Much Ado About Nothing”). I then retrained the network on this new training set.
After deployment, the network now produces Shakespearean-like text rather than a rap song — a definite improvement in terms of speech cleanliness and politeness. No more profanities! No more foul language!
Again, let’s trigger the free text generation with the start of the software license text and see how Shakespeare would proceed according to our network. Below is the Shakespearean text that the network generated: in red, the first 100 trigger characters that were manually inserted; in gray, the network-generated text.
Even in this case, the trigger sentence sets the tone for the next words: “thief,” “save and honest,” and the memorable “Sir, where is the patience now” all correspond to the reading of a software license. However, the speaking style is very different this time.
Now, keep in mind that the neural network that generated the Shakespearean-like text was the same neural network that generated the rap songs. Exactly the same. It just trained on a different set of data: rap songs on the one hand, Shakespeare’s theater pieces on the other. As a consequence, the free text produced is very different — as is the bias of the texts generated in production.
Fig. 3 An example of an AI-generated rap song. The trigger sentence in red is the start of a software license document.
Summarizing, I created a very foul-mouthed, aggressive, and biased AI system and a very elegant, formal, almost poetic AI system too — at least as far as speaking style goes. The beauty of it is that both are based on the same AI model — the only difference between the two neural networks is the training data. The bias was really in the data and not in the AI models.
Bias in, bias out
Indeed, an AI model is just a machine, like a pet stone is ultimately just a stone. It is a machine that adjusts its parameters (learns) on the basis of the data in the training set. Sexist data in the training set produce a sexist AI model. Racist data in the training set produce a racist AI model. Since data are created by humans, they are also often biased. Thus, the resulting AI systems will also be biased. If the goal is to have a clean, honest, unbiased model, then the training data should be cleaned and stripped of all biases before training.
Seeing the Forest for the Trees - Cohort AnalysisadminMon, 02/17/2020 - 10:00
By Felix Kergl-Räpple and Maarit Widmann (KNIME)
How cohort analysis reveals a comprehensive view of our business
A marketing campaign or publication of a new release can make numbers of customers boom for a while. But what are the effects in the long run? Do the customers stay or churn? Do any of them return at some point? Does the overall revenue increase?
Patterns can be detected in customer behavior: for example, patterns in the regular behavior of customers i.e. behavior that is not affected by our marketing campaigns or other similar actions. We might want to try to find a critical contract duration that determines whether customers will become loyal customers or not. Another scenario worth analysing is customer groups - to see if different conditions in contracts with different starting points cause these different customer groups to show equal loyalty - or not.
To answer these questions, we can analyze our sales data over time by time-based cohorts.
Identifying long-term customer behavior
Cohort analysis provides long-term feedback on our business decisions and customer engagement. Therefore, we use time as one dimension in the analysis. The second dimension is the metric that we analyze: contract value, customer count, number of orders, or anything else that quantifies the behavior of our customers. This “customer value” is shown separately for different cohorts. Cohorts are groupings in the data that share similar characteristics based on time, segment, or size. Given these three dimensions, we can identify patterns and trends that wouldn’t be visible in the individual records, thus providing a more complete look at our business.
Before starting the cohort analysis, we have to define:
The cohorts that we consider in our data. They could be, for example, customers who started doing business with you within the same time frame (time-based cohorts), customers buying similar products (segment-based cohorts), or medium- and large-size companies (size-based cohorts).
The information that we want to show for the different cohorts over time. It could be an established metric, such as annually recurring revenue or churn rate, or anything else that answers our questions about the customers, and serves the final goal to improve our business.
In this blog post we want to concentrate on time-based cohorts. In the next sections we introduce the steps you go through to build a cohort chart (Figure 1), from formatting the data to visualizing the selected metric by time and cohort.
Fig. 1: An example of a cohort chart to analyze customer count, or any other metric such as ARR, by time and time-based cohort
Example: analyze the number, value, and duration of contracts
Let’s start by having a look at an example cohort analysis and see how the company’s business is doing. The company issues contracts - software licenses, mobile contracts, or magazine subscriptions, for example. Based on the starting time of the contract, we assign each customer to a time-based cohort.
The results of the cohort analysis enable us to answer questions such as:
Can we detect a positive trend in terms of more customers and more revenue? Is the trend stable?
Do customers who entered into a contract in a particular year generate more revenue than customers who started in other years?
Which year(s) show the greatest customer churn?
Does the value of the contracts remain stable over time?
Has the average revenue per customer increased or decreased?
Data
In our example, the data contains information about contracts upsells, downgrades, and churn events. There are 45 contracts and 12 customers. Contract periods range from January 2015 to December 2019. Each row in the data shows the start and end time of the contract period, the contract value, and an ID that identifies the customer. You can see a sample of the data in Figure 2.
Fig. 2: Data containing information on contract IDs, values, and periods. In the first step of the cohort analysis, this dataset is transformed into time series by assigning recurring values to single months within the periods.
The first step in the cohort analysis is to assign recurring values to the single months within the contract periods. Recurring values exclude one-time events, that is, they only consider the services that are constantly provided over a limited time period: subscriptions to software, support, content, etc.
Step 1: Calculate recurring values
When we calculate recurring values, we format the original contract data into time series data where each row contains a single month, a recurring value, and an ID. We can do this calculation with the “Calculate Recurring Values” component shown in Figure 3. The component is available for download on the KNIME Hub.
Fig. 3: Transforming contract data into time series data where each row contains a single month, a value, and an ID. The “Calculate Recurring Values” component which performs the calculation is available on the KNIME Hub.
Input table
An example of an input table for the “Calculate Recurring Revenue” component is shown in Figure 2. The table must contain two columns that define the start and end date of each contract period, one column for the contract value, and one column for the ID.
Output table
The output table of the component shows each individual month within the contract period, and the recurring values for each month. For example, if we had a row for a contract with a value of EUR 30,000 and a contract period of 12 months, the output table would show 12 rows for this contract, one for each month, and a monthly recurring value of EUR 2,500 (Figure 4).
Fig. 4: Example input and output data of the Calculate Recurring Values component that converts contracts data into time series data: records by contract period and ID are expanded to monthly recurring values by single month and ID.
Now, after formatting the data, we are ready to move on to the next step where we build the cohort chart. We can subsequently use the chart to answer the questions about the state of our business.
Step 2: Inspect revenue and customer count by time and cohort
In this second step, we calculate the selected metric separately for each month and time-based cohort. For example, we could have two customers who started in 2018, one with a monthly recurring revenue of EUR 2,000 and the other with monthly recurring revenue of EUR 3,000. These two customers would then constitute a single time-based cohort called “Started 2018”. The monthly recurring revenue for this cohort is therefore EUR 5,000 until at least one of the two customers upgrades, downgrades, or churns.
The time series data could also come from any other source. It could be the daily sales coming from subscriptions or grocery stores, for example. Regardless of what the data actually show, note that in order to perform cohort analysis, each record must contain a timestamp, identifier, and a value.
The workflow in Figure 5 (which you can download from the KNIME Hub here) shows two components that enable you to analyze time-based cohorts using the following metrics:
annually/monthly recurring revenue (ARR/MRR)
annually/monthly recurring revenue relative to customer count
Example outputs of these components are shown in Figure 6. The line plot on the left shows the ARR for each cohort over time. The stacked area chart on the right shows the cumulative customer count for the different cohorts over time. The metric, the granularity of the cohorts, and the chart type can be defined in the configuration dialogs of the components.
From the line plot on the left in Figure 6 we can see that the ARR develops differently for the four time-based cohorts:
The customers who started in 2015 (blue line) increase their ARR value in the first year, but reach a low in the second half of 2016. Their ARR starts increasing again in 2017 and returns to its original value at the beginning of 2018.
The customers who started in 2016 (orange line) show an increase in ARR by the end of 2017. Their ARR starts to decline through to the beginning of 2019, where it then settles down to a constant value.
The customers who started their contracts in 2017 (green line) have a constant ARR value over the whole time period from the beginning of 2017 to the end of 2019.
The customers who started in 2018 (red line) increase their revenue until it sets to a constant value at the beginning of 2019.
From the stacked area chart on the right in Figure 6 we can see that the decreasing ARR of the “Started 2015” cohort (blue area) causes a low in total ARR at the end of 2016 but it starts increasing again due to the additional ARR coming from the “Started 2016” cohort (orange area) and the constant ARR coming from the “Started 2017” cohort (green area). The decreasing ARR for the “Started 2016” cohort cannot be compensated by the ARR coming from the “Started 2018” cohort (red area). This means that the maximum total ARR is reached at the beginning of 2018 before the ARR of “Started 2016” starts to decline.
Shared components
Now, it’s your turn to analyze your own customer data and build the cohort charts. Drag and drop the components from the KNIME Hub, and follow the steps as described above. You can use the configuration dialogs of the components to customize your cohort analysis: create time series with daily recurring values, extract cohorts based on the starting month, calculate the churn rate, or some of the other available metrics. If you want, you can also change the functionality of the components for your purpose: add new metrics and charts, for example.
Cohort analysis gives us a robust and comprehensive view of the state of our business. Cohort analysis gives us feedback over a long cycle of business. It smooths occasional fluctuations, giving us perspective on our customers’ behavior in the long term. Cohort analysis can reveal patterns in customer behavior that are only visible when we analyze customers by groups. For example, an increase in actual numbers could still mean a decrease in loyalty.
The steps in building a cohort chart from contract data include blending data, filling gaps in time series and checking for zero values, sorting, and pivoting, along with other data preprocessing operations. The components introduced in this blog post automate these steps, yet they let us define the key settings, such as the granularity of the cohorts and the metric to analyze.
Interested in more analysis of customer behavior? Learn more at our next Data Talks: Handling Customer Data meetup in Zurich on March 11, 2020 starting at 6:30 PM.
The talks:
Automating Inferences out of Customer Data: An Example of Fraud Detection in Credit Cards, by Maarit Widmann (KNIME)
Three New Techniques for Data Dimensionality Reduction in Machine LearningadminThu, 02/20/2020 - 10:00
Authors: Maarit Widmann and Rosaria Silipo (KNIME). As first published in The New Stack.
The full big data explosion has convinced us that more is better. While it is of course true that a large amount of training data helps the machine learning model to learn more rules and better generalize to new data, it is also true that an indiscriminate addition of low-quality data and input features might introduce too much noise and, at the same time, considerably slow down the training algorithm.
So, in the presence of a dataset with a very high number of data columns, it is good practice to wonder how many of these data features are actually really informative for the model. A number of techniques for data-dimensionality reduction are available to estimate how informative each column is and, if needed, to skim it off the dataset.
Back in 2015, we began publishing a review1 of the seven most commonly used techniques for data-dimensionality reduction, including:
Those are traditional techniques commonly applied to reduce the dimensionality of a dataset by removing all of the columns that either do not bring much information or no new information. Since then, we have started to use three additional techniques, also quite commonly used, and have decided to add them to the list as well.
Let’s start with the three techniques recently added and then move backwards in time with a review of the seven original techniques.
The Dataset
In our first review of data dimensionality reduction techniques, we used the two datasets from the 2009 KDD Challenge - the large dataset and the small dataset. The particularity of the large dataset is its very high dimensionality with 15,000 data columns. Most data mining algorithms are implemented columnwise, which makes them slower and slower as the number of data columns increases. This dataset definitely brings out the slowness of a number of machine learning algorithms.
The 2009 KDD Challenge small dataset is definitely lower dimensional than the large dataset but is still characterized by a considerable number of columns: 230 input features and three possible target features. The number of data rows is the same as in the large dataset: 50,000. In this review, for computational reasons, we will focus on the small dataset to show just how effective the proposed techniques are in reducing dimensionality. The dataset is big enough to prove the point in data-dimensionality reduction and small enough to do so in a reasonable amount of time.
Let’s proceed now with the (re)implementation and comparison of 10 state-of-the-art dimensionality reduction techniques, all currently available and commonly used in the data analytics landscape.
Three More Techniques for Data Dimensionality Reduction
Let’s start with the three newly added techniques:
A number m of linear combinations (discriminant functions) of the n input features, with m < n, are produced to be uncorrelated and to maximize class separation. These discriminant functions become the new basis for the dataset. All numeric columns in the dataset are projected onto these linear discriminant functions, effectively moving the dataset from the n-dimensionality to the m-dimensionality.
In order to apply the LDA technique for dimensionality reduction, the target column has to be selected first. The maximum number of reduced dimensions m is the number of classes in the target column minus one, or if smaller, the number of numeric columns in the data. Notice that linear discriminant analysis assumes that the target classes follow a multivariate normal distribution with the same variance but with a different mean for each class.2
Autoencoder
An autoencoder is a neural network, with as many n output units as input units, at least one hidden layer with munits where m <n, and trained with the backpropagation algorithm to reproduce the input vector onto the output layer. It reduces the numeric columns in the data by using the output of the hidden layer to represent the input vector.
The first part of the autoencoder — from the input layer to the hidden layer of m units — is called the encoder. It compresses the n dimensions of the input dataset into an m-dimensional space. The second part of the autoencoder — from the hidden layer to the output layer — is known as the decoder. The decoder expands the data vector from an m-dimensional space into the original n-dimensional dataset and brings the data back to their original values.3
This technique reduces the n numeric columns in the dataset to fewer dimensions m (m < n) based on nonlinear local relationships among the data points. Specifically, it models each high-dimensional object by a two- or three-dimensional point in such a way that similar objects are modeled by nearby points and dissimilar objects are modeled by distant points in the new lower dimensional space.
In the first step, the data points are modeled through a multivariate normal distribution of the numeric columns. In the second step, this distribution is replaced by a lower dimensional t-distribution, which follows the original multivariate normal distribution as closely as possible. The t-distribution gives the probability of picking another point in the dataset as a neighbor to the current point in the lower dimensional space. The perplexity parameter controls the density of the data as the “effective number of neighbors for any point.” The greater the value of the perplexity, the more global structure is considered in the data. The t-SNE technique works only on the current dataset. It is not possible to export the model to apply it to new data.4
Seven Previously Applied Techniques for Data Dimensionality Reduction
Here is a brief review of our original seven techniques for dimensionality reduction:
Missing Values Ratio. Data columns with too many missing values are unlikely to carry much useful information. Thus, data columns with a ratio of missing values greater than a given threshold can be removed. The higher the threshold, the more aggressive the reduction.
Low Variance Filter. Similar to the previous technique, data columns with little changes in the data carry little information. Thus, all data columns with a variance lower than a given threshold can be removed. Notice that the variance depends on the column range, and therefore normalization is required before applying this technique.
High Correlation Filter. Data columns with very similar trends are also likely to carry very similar information, and only one of them will suffice for classification. Here we calculate the Pearson product-moment correlation coefficient between numeric columns and the Pearson’s chi-square value between nominal columns. For the final classification, we only retain one column of each pair of columns whose pairwise correlation exceeds a given threshold. Notice that correlation depends on the column range, and therefore, normalization is required before applying this technique.
Random Forests/Ensemble Trees. Decision tree ensembles, often called random forests, are useful for column selection in addition to being effective classifiers. Here we generate a large and carefully constructed set of trees to predict the target classes and then use each column’s usage statistics to find the most informative subset of columns. We generate a large set (2,000) of very shallow trees (two levels), and each tree is trained on a small fraction (three columns) of the total number of columns. If a column is often selected as the best split, it is very likely to be an informative column that we should keep. For all columns, we calculate a score as the number of times that the column was selected for the split, divided by the number of times in which it was a candidate. The most predictive columns are those with the highest scores.
Principal Component Analysis (PCA). Principal component analysis (PCA) is a statistical procedure that orthogonally transforms the original n numeric dimensions of a dataset into a new set of n dimensions called principal components. As a result of the transformation, the first principal component has the largest possible variance; each succeeding principal component has the highest possible variance under the constraint that it is orthogonal to (i.e., uncorrelated with) the preceding principal components. Keeping only the first m < n principal components reduces the data dimensionality while retaining most of the data information, i.e., variation in the data. Notice that the PCA transformation is sensitive to the relative scaling of the original columns, and therefore, the data need to be normalized before applying PCA. Also notice that the new coordinates (PCs) are not real, system-produced variables anymore. Applying PCA to your dataset loses its interpretability. If interpretability of the results is important for your analysis, PCA is not the transformation that you should apply.
Backward Feature Elimination. In this technique, at a given iteration, the selected classification algorithm is trained on n input columns. Then we remove one input column at a time and train the same model on n-1 columns. The input column whose removal has produced the smallest increase in the error rate is removed, leaving us with n-1 input columns. The classification is then repeated using n-2 columns, and so on. Each iteration k produces a model trained on n-k columns and an error rate e(k). By selecting the maximum tolerable error rate, we define the smallest number of columns necessary to reach that classification performance with the selected machine learning algorithm.
Forward Feature Construction. This is the inverse process to backward feature elimination. We start with one column only, progressively adding one column at a time, i.e., the column that produces the highest increase in performance. Both algorithms, backward feature elimination and forward feature construction, are quite expensive in terms of time and computation. They are practical only when applied to a dataset with an already relatively low number of input columns.
Comparison in Terms of Accuracy and Reduction Rate
We implemented all 10 described techniques for dimensionality reduction, applying them to the small dataset of the 2009 KDD Cup corpus. Finally, we compared them in terms of reduction ratio and classification accuracy. For dimensionality reduction techniques that are based on a threshold, the optimal threshold was selected by an optimization loop.
For some techniques, final accuracy and degradation depend on the selected classification model. Therefore, the classification model is chosen from a bag of three basic models as the best-performing model:
Multilayer feedforward neural networks
Naive Bayes
Decision tree
For such techniques, the final accuracy is obtained by applying all three classification models to the reduced dataset and adopting the one that performs best.
Overall accuracy and area under the curve (AUC) statistics are reported for all techniques in Table 1. We compare these statistics with the performance of the baseline algorithm that uses all columns for classification.
Table 1: Number of input columns, reduction rate, overall accuracy, and AuC value for the 7 + 3 dimensionality reduction techniques based on the best classification model trained on the KDD challenge 2009 small dataset.
A graphical comparison of the accuracy of each reduction technique is shown in Figure 1 below. Here all reduction techniques are reported on the x-axis and the corresponding classification accuracy on the y-axis, as obtained from the best-performing model of the three basic models proposed above.
Fig. 1. Accuracies of the best performing models trained on the datasets that were reduced using the 10 selected data dimensionality reduction techniques.
The receiver operating characteristic (ROC) curves in Figure 2 show a group of best-performing techniques: missing value ratio, high correlation filter and the ensemble tree methods.
Fig. 2: ROC Curves showing the performances of the best classification model trained on the reduced datasets: each dataset was reduced by a different dimensionality reduction technique.
Implementation of the 7+3 Techniques
The workflow that implements and compares the 10 dimensionality reduction techniques described in this review is shown in Figure 3. In the workflow, we see 10 parallel branches plus one at the top. Each one of the 10 parallel lower branches implements one of the described techniques for data-dimensionality reduction. The first branch, however, trains the bag of classification models on the whole original dataset with 230 input features.
Each workflow branch produces the overall accuracy and the probabilities for the positive class by the best-performing classification model trained on the reduced dataset. Finally, the positive class probabilities and actual target class values are used to build the ROC curves, and a bar chart visualizes the accuracies produced by the best-performing classification model for each dataset.
You can inspect and download the workflow from the KNIME Hub.
Fig. 3: Implementation of the ten selected dimensionality reduction techniques. Each branch of this workflow outputs the overall accuracy and positive class probabilities produced by the best performing classification model. An ROC Curve and bar chart then compare the performance of the classification models trained on the reduced datasets. The workflow can be downloaded and inspected from the KNIME Hub.
Summary and Conclusions
In this article we have presented a review of ten popular techniques for data dimensionality reduction. We have actually expanded a previous existing article describing seven of them (ratio of missing values, low variance in the values of a column, high correlation between two columns, principal component analysis (PCA), candidates and split columns in a random forest, backward feature elimination, forward feature construction)by adding three additional techniques.
We trained a few basic machine learning models on the reduced datasets and compared the best-performing ones with each other via reduction rate, accuracy and area under the curve.
Notice that dimensionality reduction is not only useful to speed up algorithm execution but also to improve model performance.
In terms of overall accuracy and reduction rate the random, the Random Forest based technique proved to be the most effective in removing uninteresting columns and retaining most of the information for the classification task at hand. Of course, the evaluation, reduction and consequent ranking of the ten described techniques are applied here to a classification problem; we cannot generalize to effective dimensionality reduction for numerical prediction or even visualization.
Some of the techniques used in this article are complex and computationally expensive. However, as the results show, even just counting the number of missing values, measuring the column variance and the correlation of pairs of columns — and combining them with more sophisticated methods — can lead to a satisfactory reduction rate while keeping performance unaltered with respect to the baseline models.
Indeed, in the era of big data, when more is axiomatically better, we have rediscovered that too many noisy or even faulty input data columns often lead to unsatisfactory model performance. Removing uninformative — or even worse — misinformative input columns might help to train a machine learning model on more general data regions, with more general classification rules, and overall with better performances on new, unseen data.
Tuning the Performance and Scalability of KNIME WorkflowsadminMon, 02/24/2020 - 10:00
Authors: Iris Adä and Phil Winters (KNIME)
Want a workflow that uses available in-DB capabilities and moves to a production Spark setup? At the same time it should use special Google services before comparing a KNIME Random Forest to an H2O Random Forest and then automatically choose the correct model to create data that are automatically added to your favourite CRM - so that the new score is placed back into the CRM? No problem in KNIME.
Or you want to use AWS and Azure ML services together along with KNIME nodes to provide a focused Guided Analytics application to end users? Again, a straightforward build of nodes in a workflow can, after execution, be one click deployed to KNIME Server and made instantly available via the KNIME WebPortal. Similarly, only minimal effort is needed to create a RESTful Web service out of that same workflow on the KNIME Server, which enables you to make your new achievements callable from existing applications.
However, if you have options, how do you make a decision if performance and scalability are key? There is now a KNIME white paper providing background around choice and tuning options, as well as an approach and sample workflows for determining the “right” combination for your specific requirements.
Choice and Tuning Options
The white paper first reviews the four major areas that can be managed to obtain optimal workflow performance and provide scalability options in KNIME:
KNIME setup and workflow options
Hardware and resource options
KNIME extensions
Additional capabilities provided by KNIME Server such as distributed executors
A Six Step Approach to Performance and Scalability
Thanks to the huge choice offered in KNIME Analytics Platform, there is no single “best” recommended approach for scaling a workflow. Instead, the recommendation is to build different scenarios with KNIME and execute and compare them so as to choose the best for your given situation.
In the white paper, a six phase approach is detailed for evaluating and identifying an optimal workflow configuration for your requirements. The six steps are:
Create your workflows using native KNIME nodes.
Identify relevant capabilities
Define possible scenarios
Deal with environment contexts
Set environment and run
Measure and compare
Nodes and Example Workflows
Within KNIME, there is a robust series of benchmark nodes provided to capture the performance statistics of every aspect of a workflow. The concept is straightforward and shown in Figure 2. You begin your workflow with a Benchmark Start (memory monitoring) node. You end it with a Benchmark End (memory monitoring) node. When executed, the nodes capture such statistics as run time and memory usage. The nodes can be configured so that they go down to the individual node level, even within a metanode or component. In this way you can collect all relevant performance statistics
Fig. 1: Benchmarking nodes that wrap around the workflow to be measured
The nodes are KNIME community nodes provided by KNIME trusted partner Vernalis Research Ltd. It's worth investigating the configuration options of the nodes. You can find the nodes on the KNIME Hub.
A series of example workflows is available to show you how to use the nodes:
Fig. 2: Overall control workflow example for running three different scenarios.
Conclusion
With this six step process for testing performance and scalability of scenarios in KNIME, you have an extremely powerful way to make your choices and come up with the best way to achieve a performant and scalable workflow for your data science problem.
From a Single Decision Tree to a Random ForestadminThu, 02/27/2020 - 10:00
Authors: Kathrin Melcher, Rosaria Silipo (KNIME). As first published in Dataversity.
Decision trees represent a set of very popular supervised classification algorithms. They are very popular for a few reasons: They perform quite well on classification problems, the decisional path is relatively easy to interpret, and the algorithm to build (train) them is fast and simple.
There is also an ensemble version of the decision tree: the random forest. The random forest essentially represents an assembly of a number N of decision trees, thus increasing the robustness of the predictions.
In this article, we propose a brief overview of the algorithm behind the growth of a decision tree and discuss its quality measures, the tricks to avoid overfitting the training set, and the improvements introduced by a random forest of decision trees.
What's a decision tree?
A decision tree is a flowchart-like structure made of nodes and branches (Fig. 1). At each node, a split on the data is performed based on one of the input features, generating two or more branches as output. More and more splits are made in the upcoming nodes and increasing numbers of branches are generated to partition the original data. This continues until a node is generated where all or almost all of the data belong to the same class and further splits — or branches — are no longer possible.
This whole process generates a tree-like structure. The first splitting node is called the root node. The end nodes are called leaves and are associated to a class label. The paths from root to leaf produce the classification rules. If only binary splits are possible, we talk about binary trees. Here, however, we want to deal with the more generic instance of non-binary decision trees.
Let's go sailing
Let’s visualize this with an example. We collected data about a person’s past sailing plans, i.e., whether or not the person went out sailing, based on various external conditions — or “input features” — e.g., wind speed in knots, maximum temperature, outlook, and whether or not the boat was in winter storage. Input features are often also referred to as non-target attributes or independent variables. We now want to build a decision tree that will predict the sailing outcome (yes or no). The sailing outcome feature is also known as either the target or the dependent variable.
If we knew about the exact classification rules, we could build a decision tree manually. But this is rarely the case. What we usually have are data: the input features on the one hand and the target feature to predict on the other. A number of automatic procedures can help us extract the rules from the data to build such a decision tree, like C4.5, ID3 or the CART algorithm (J. Ross Quinlan). In all of them, the goal is to train a decision tree to define rules to predict the target variable. In our example the target variable is whether or not we will go sailing on a new day.
Fig. 1. Example of a decision tree (on the right) built on sailing experience data (on the left) to predict whether or not to go sailing on a new day.
Building a decision tree
Let’s explore how to build a decision tree automatically, following one of the algorithms listed above.
The goal of any of those algorithms is to partition the training set into subsets until each partition is either “pure” in terms of target class or sufficiently small. To be clear:
A pure subset is a subset that contains only samples of one class
Each partitioning operation is implemented by a rule that splits the incoming data based on the values of one of the input features
To summarize: A decision tree consists of three different building blocks: nodes, branches and leaves. The nodes identify the splitting feature and implement the partitioning operation on the input subset of data; the branches depart from a node and identify the different subsets of data produced by the partitioning operation; and the leaves, at the end of a path, identify a final subset of data and associate a class with that specific decision path.
In the tree in figure 1, for example, the split in the first node involves the “Storage” feature and partitions the input subset into two subsets: one where Storage is “yes” and one where Storage is “no.” If we follow the path for the data rows where Storage = yes, we find a second partitioning rule. Here the input dataset is split in two datasets: one where “Wind” > 6 and one where “Wind” <= 6. How does the algorithm decide which feature to use at each point to split the input subset?
The goal of any of these algorithms is to recursively partition the training set into subsets until each partition is as pure as possible in terms of output class. Therefore, at each step, the algorithm uses the feature that leads to the purest output subsets.
Quality measures
At each iteration, in order to decide which feature leads to the purest subset, we need to be able to measure the purity of a dataset. Different metrics and indices have been proposed in the past. We will describe a few of them here, arguably the most commonly used ones: information gain, Gini index, and gain ratio.
During training, the selected quality measure is calculated for all candidate features to find out which one will produce the best split.
Entropy
Entropy is a concept that is used to measure information or disorder. And, of course, we can use it to measure how pure a dataset is.
If we consider the target classes as possible statuses of a point in a dataset, the entropy of a dataset can be mathematically defined as the sum over all classes of the probability of each class multiplied by the logarithm of it. For a binary classification problem, thus, the range of the entropy falls between 0 and 1.
Where p is the whole dataset, N is the number of classes, and pi is the frequency of class i in the same dataset.
To get a better understanding of entropy, let’s work on two different example datasets, both with two classes, respectively represented as blue dots and red crosses (Fig. 2). In the example dataset on the left, we have a mixture of blue dots and red crosses. In the example of the dataset on the right, we have only red crosses. This second situation — a dataset with only samples from one class — is what we are aiming at: a “pure” data subset.
Fig. 2. Two classes: red crosses and blue dots. Two different datasets. A dataset with a mix of points belonging to both classes (on the left) and a dataset with points belonging to one class only (on the right).
Let’s now calculate the entropy for these two binary datasets.
For the example on the left, the probability is 7/13 for the class with red crosses and 6/13 for the class with blue dots. Notice that here we have almost as many data points for one class as for the other. The formula above leads to an entropy value of 0.99.
For the example on the right, the probability is 13/13 = 1.0 for the class with the red crosses and 0/13 = 0.0 for the class with the blue dots. Notice that here we have only red cross points. In this case, the formula above leads to an entropy value of 0.0.
Entropy can be a measure of purity, disorder or information. Due to the mixed classes, the dataset on the left is less pure and more confused (more disorder, i.e., higher entropy). However, more disorder also means more information. Indeed, if the dataset has only points of one class, there is not much information you can extract from it no matter how long you try. In comparison, if the dataset has points from both classes, it also has a higher potential for information extraction. So, the higher entropy value of the dataset on the left can also be seen as a larger amount of potential information.
The goal of each split in a decision tree is to move from a confused dataset to two (or more) purer subsets. Ideally, the split should lead to subsets with an entropy of 0.0. In practice, however, it is enough if the split leads to subsets with a total lower entropy than the original dataset.
Fig. 3. A split in a node of the tree should move from a higher entropy dataset to subsets with lower total entropy.
Information gain (ID3)
In order to evaluate how good a feature is for splitting, the difference in entropy before and after the split is calculated.
That is, first we calculate the entropy of the dataset before the split, and then we calculate the entropy for each subset after the split. Finally, the sum of the output entropies - weighted by the size of the subsets - is subtracted from the entropy of the dataset before the split. This difference measures the gain in information or the reduction in entropy. If the information gain is a positive number, this means that we move from a confused dataset to a number of purer subsets.
Where “before” is the dataset before the split, K is the number of subsets generated by the split, and (j, after) is subset j after the split.
At each step, we would then choose to split the data on the feature with the highest value in information gain as this leads to the purest subsets. The algorithm that applies this measure is the ID3 algorithm. The ID3 algorithm has the disadvantage of favoring features with a larger number of values, generating larger decision trees.
Gain ratio (C4.5)
The gain ratio measure, used in the C4.5 algorithm, introduces the SplitInfo concept. SplitInfo is defined as the sum over the weights multiplied by the logarithm of the weights, where the weights are the ratio of the number of data points in the current subset with respect to the number of data points in the parent dataset.
The gain ratio is then calculated by dividing the information gain from the ID3 algorithm by the SplitInfo value.
Where “before” is the dataset before the split, K is the number of subsets generated by the split, and (j, after) is subset j after the split.
Gini index (CART)
Another measure for purity — or actually impurity — used by the CART algorithm is the Gini index.
The Gini index is based on Gini impurity. Gini impurity is defined as 1 minus the sum of the squares of the class probabilities in a dataset.
Where p is the whole dataset, N is the number of classes, and pi is the frequency of class i in the same dataset.
The Gini index is then defined as the weighted sum of the Gini impurity of the different subsets after a split, where each portion is weighted by the ratio of the size of the subset with respect to the size of the parent dataset.
For a dataset with two classes, the range of the Gini index is between 0 and 0.5: 0 if the dataset is pure and 0.5 if the two classes are distributed equally. Thus, the feature with the lowest Gini index is used as the next splitting feature.
Where K is the number of subsets generated by the split and (j, after) is subset j after the split.
Identifying the splits
For a nominal feature, we have two different splitting options. We can either create a child node for each value of the selected feature in the training set, or we can make a binary split. In the case of a binary split, all possible feature value subsets are tested. In this last case, the process is more computationally expensive but still relatively straightforward.
In numerical features, identifying the best split is more complicated. All numerical values could actually be split candidates. But this would make computation of the quality measures too expensive an operation! Therefore, for numerical features, the split is always binary. In the training data, the candidate split points are taken in between every two consecutive values of the selected numerical feature. Again, the binary split producing the best quality measure is adopted. The split point can then be the average between the two partitions on that feature, the largest point of the lower partition or the smallest point of the higher partition.
Size and overfitting
Decision trees, like many other machine learning algorithms, are subject to potentially overfitting the training data. Trees that are too deep can lead to models that are too detailed and don’t generalize on new data. On the other hand, trees that are too shallow might lead to overly simple models that can’t fit the data. You see, the size of the decision tree is of crucial importance.
Fig. 4. The size of the decision tree is important. A tree that is large and too detailed (on the right) might overfit the training data, while a tree that is too small (on the left) might be too simple to fit the data.
There are two ways to avoid an over-specialized tree: pruning and/or early stopping.
Pruning
Pruning is applied to a decision tree after the training phase. Basically, we let the tree be free to grow as much as allowed by its settings, without applying any explicit restrictions. At the end, we proceed to cut those branches that are not populated sufficiently so as to avoid overfitting the training data. Indeed, branches that are not populated enough are probably overly concentrating on special data points. This is why removing them should help generalization on new unseen data.
There are many different pruning techniques. Here, we want to explain the two most commonly used: reduced error pruning and minimum description length pruning, MDL for short.
In reduced error pruning, at each iteration, a low populated branch is pruned and the tree is applied again to the training data. If the pruning of the branch doesn’t decrease the accuracy on the training set, the branch is removed for good.
MDL pruning uses description length to decide whether or not to prune a tree. Description length is defined as the number of bits needed to encode the tree plus the number of bits needed to encode the misclassified data samples of the training set. When a branch of the tree is pruned, the description lengths of the non-pruned tree and of the pruned tree are calculated and compared. If the description length of the pruned tree is smaller, the pruning is retained.
Early stopping
Another option to avoid overfitting is early stopping, based on a stopping criterion.
One common stopping criterion is the minimum number of samples per node. The branch will stop its growth when a node is created containing fewer or an equal number of data samples as the minimum set number. So a higher value of this minimum number leads to shallower trees, while a smaller value leads to deeper trees.
Random forest of decision trees
As we said at the beginning, an evolution of the decision tree to provide a more robust performance has resulted in the random forest. Let’s see how the innovative random forest model compares with the original decision tree algorithms.
Many is better than one. This is, simply speaking, the concept behind the random forest algorithm. That is, many decision trees can produce more accurate predictions than just one single decision tree by itself. Indeed, the random forest algorithm is a supervised classification algorithm that builds N slightly differently trained decision trees and merges them together to get more accurate and stable predictions.
Let’s stress this notion a second time. The whole idea relies on multiple decision trees that are all trained slightly differently and all of them are taken into consideration for the final decision.
Fig. 5. The random forest algorithm relies on multiple decision trees that are all trained slightly differently; all of them are taken into consideration for the final classification.
Bootstrapping of training sets
Let’s focus on the “trained slightly differently.”
The training algorithm for random forests applies the general technique of bagging to tree learners. One decision tree is trained alone on the whole training set. In a random forest, N decision trees are trained each one on a subset of the original training set obtained via bootstrapping of the original dataset, i.e., via random sampling with replacement.
Additionally, the input features can also be different from node to node inside each tree, as random subsets of the original feature set. Typically, if m is the number of the input features in the original dataset, a subset of randomly extracted [square root of m] input features is used to train each node in each decision tree.
Fig. 6. The decision trees in a random forest are all slightly differently trained on a bootstrapped subset of the original dataset. The set of input features also varies for each decision tree in the random forest.
The majority rule
The N slightly differently trained trees will produce N slightly different predictions for the same input vector. Usually, the majority rule is applied to make the final decision. The prediction offered by the majority of the N trees is adopted as the final one.
The advantage of such a strategy is clear. While the predictions from a single tree are highly sensitive to noise in the training set, predictions from the majority of many trees are not — providing the trees are not correlated. Bootstrap sampling is the way to decorrelate the trees by training them on different training sets.
Out-Of-Bag (OOB) error
A popular metric to measure the prediction error of a random forest is the out-of-bag error.
Out-of-bag error is the average prediction error calculated on all training samples xᵢ, using only the trees that did not have xᵢ in their bootstrapped training set. Out-of-bag error estimates avoid the need for an independent validation dataset but might underestimate actual performance improvement.
Conclusions
In this article, we reviewed a few important aspects of decision trees: how a decision tree is trained to implement a classification problem, which quality measures are used to choose the input features to split, and the tricks to avoid the overfitting problem.
We have also tried to explain the strategy of the random forest algorithm to make decision tree predictions more robust; that is, to limit the dependence from the noise in the training set. Indeed, by using a set of N decorrelated decision trees, a random forest increases the accuracy and the robustness of the final classification.
Now, let’s put it to use and see whether we'll be going sailing tomorrow!
Fig. 7. Workflow to implement the training and evaluation of a decision tree and of a random forest of decision trees.
References
Download the workflow for training and evaluating a decision tree and a random forest of decision trees. It's available on the KNIME Hub
Tutorial: Importing Bike Data from Google BigQueryadminMon, 03/02/2020 - 10:00
Takeaway: learn how to grant access and connect to Google BigQuery, as well as upload data back to Google BigQuery from KNIME
Author: Emilio Silvestri (KNIME)
To match the increasing number of organizations turning to cloud repositories and services to attain top levels of scalability, security, and performance, KNIME provides connectors to a variety of cloud service providers. We recently published an article about KNIME on AWS, for example. Continuing with our series of articles about cloud connectivity, this blog post is a tutorial introducing you to KNIME on Google BigQuery.
BigQuery is the Google response to the Big Data challenge. It is part of the Google Cloud Console and is used to store and query large datasets using SQL-like syntax. Google BigQuery has gained popularity thanks to the hundreds of publicly available datasets offered by Google. You can also use Google BigQuery to host your own datasets.
Note. While Google Big Query is a paid service, Google offers 1 TB of queries for free. A paid account is not necessary to follow this guide.
Since many users and companies rely on Google BigQuery to store their data and for their daily data operations, KNIME Analytics Platform includes a set of nodes to deal with Google BigQuery, which is available from version 4.1.
In this tutorial, we want to access the Austin Bike Share Trips dataset. It contains more than 600 k of bike trips during 2013-2019. For every trip it reports the timestamp, the duration, the station of departure and arrival, plus information about the subscriber.
In Google: Grant access to Google BigQuery
In order to grant access to BigQuery:
Navigate to the Google Cloud Console and sign in with your Google account (i.e. your gmail account).
Once you’re in, either select a project or create a new one. Here are instructions to create a new project, if you're not sure how.
After you have created a project and/or selected your project, the project dashboard opens (Fig. 1), containing all the related information and statistics.
Fig. 1. Dashboard tab in Google Cloud Platform main page.
Now let's access Google BigQuery:
From the Google Cloud Platform page click the hamburger icon in the upper left corner and select API & Services > Credentials.
Click the blue menu called +Create credentials and select Service account (Fig. 2)
Fig. 2. Creating credentials. From the menu on the left, select API & Services > Credentials > Create credentials
Now let’s create the service account (Fig. 3):
In the field “Service account name” enter the service account name (of your choice).
In this example we used the account name KNIMEAccount.
Google now automatically generates a service account ID from the account name you provide. This service account ID has an email address format. For example in Figure 3 below you can see that the service account ID is: knimeaccount@rock-loop-268012.iam.gserviceaccount.com
Note. Remember this Service Account ID! You will need it later in your workflow.
Click Create to proceed to the next step.
Select a role for the service account. We selected Owner as the role for KNIMEAccount.
Click Continue to move on to the next step.
Scroll down and click the Create key button
In order to create the credential key, make sure that the radio button is set to P12. Now click Create.
The P12 file, containing your credentials is now downloaded automatically. Store the P12 file in a secure place on your hard drive.
Fig. 3. Creating a Service account. Step 1: Select Service Account name
Fig. 4. Creating a Service account. Step 2: Select Service Account permissions
Fig. 5. Creating a Service account. Step 3: Generate a P12 authentication key
In KNIME: Connect to Google BigQuery
Uploading and configuring the JDBC Driver
Currently, (KNIME Analytics Platform version 4.1) the JDBC driver for Google BigQuery isn’t one of the default JDBC drivers, so you will have to add it to KNIME.
Unzip the file and save it to a folder on your hard disk. This is your JDBC driver file.
Add the new driver to the list of database drivers:
In KNIME Analytics Platform, go to File > Preferences > KNIME > Databases and click Add
The “Register new database driver” window opens (Fig. 4).
Enter a name and an ID for the JDBC driver (for example name = bigQueryJDBC and ID=dbID)
In the Database type menu select bigquery.
Complete the URL template form by entering the following string jdbc:bigquery://:;ProjectId=;
Click Add directory. In the window that opens, select the JDBC driver file (see item 2 of this step list)
Click Find driver class, and the field with the driver class is populated automatically
Click OK to close the window
Now click Apply and close.
Fig. 6. Adding the JDBC Driver to KNIME database drivers
Extracting Data from Google BigQuery
We are now going to start building our KNIME workflow to extract data from GoogleBigQuery. In this section we will be looking at the Google Authentication and Austin Bikeshare customer query parts of this workflow:
Fig. 7. Final workflow. The upper branch performs Google BigQuery connection and query
We start by authenticating access to Google: In a new KNIME workflow, insert the Google Authentication (API Key) node.
This is the Google Authentication (API Key) node. It's part of the KNIME Twitter and Google Connectors extension, available on the KNIME Hub.
How to configure the Google Authentication (ALI Key) node:
The information we have to provide when configuring the node here is:
The service account ID, in the form of an email address, which was automatically generated when the service account was created; in our example it is:
And the P12 key file
Now that we have been authenticated, we can connect to the database, so add the Google BigQuery Connector node to your workflow.
This is the Google BigQuery Connector node. It's part of the KNIME BigQuery extension, available on the KNIME Hub.
How to configure the Google BigQuery Connector node:
Under “Driver Name” select the JDBC driver, i.e. the one we named BigQueryJDBC.
Provide the hostname, in this case we’ll use bigquery.cloud.google.com, and the database name. As database name here, use the project name you created/selected on the Google Cloud Platform.
Click OK to confirm these settings and close the window
Fig. 8. Google BigQuery Connector node configuration dialog. Provide the driver, the hostname and the name of your project on Google Cloud Platform
BigQuery has essentially become a remote database. Therefore, we can now use the DB nodes available in KNIME Analytics Platform. In these nodes you can either write SQL statements or fill GUI-driven configuration windows to implement complex SQL queries. The GUI-driven nodes can be found in the DB -> Query folder in the Node Repository.
Fig. 9. The DB->Query folder in the Node Repository
Now that we are connected to the database, we want to extract a subset of the data according to a custom SQL query.
We are going to access the austin_bikeshare trips database within a specific time period.
Let’s add the DB Table Selector node, just after the BigQuery Connector node.
This is the DB Selector node. It's part of the KNIME Database extension, which you can find and download from the KNIME Hub.
How to configure the DB Table Selector node:
Open the configuration, click Custom Query and enter the following SQL statement in the field called SQL Statement:
SELECT*,EXTRACT(DAY FROM start_time AT TIME ZONE "US/Central") as day,EXTRACT(MONTH FROM start_time AT TIME ZONE "US/Central") as month,EXTRACT(YEAR FROM start_time AT TIME ZONE "US/Central") as yearFROM `bigquery-public-data.austin_bikeshare.bikeshare_trips`
Basically, we are retrieving the entire bikeshare_trips table, stored in the austin_bikeshare schema which is part of the bigquery-public-data project offered by Google. Moreover, we already extracted the day, month and year from the timestamp, according to the Austin timezone. These fields will be useful in the next steps.
Remember: When typing SQL statements directly, make sure you use the specific quotation marks (``) required by BigQuery.
We can refine our SQL statement by using a few additional GUI-driven DB nodes. In particular, we added a Row Filter to extract only the days in [2013, 2017] year range and a GroupBy node to produce the trip count for each day.
Finally, we append the DB Readernode to import the data locally into the KNIME workflow.
Uploading Data back to Google BigQuery
After performing a number of operations, we would like to store the transformed data back on Google BigQuery within the original Google Cloud project.
First create a schema where the data will be stored.
Go back to the Cloud Platform console and open the BigQuery application from the left side of the menu
On the left, click the project name in the Resources tab.
On the right, click Create dataset.
Give a meaningful name to the new schema and click Create dataset. For example, here we called it “mySchema” (Fig. 9)
Note that, for the free version (called here “sandbox”), the schema can be stored on BigQuery only for a limited period of maximum 60 days.
Fig. 10. Creating a personal dataset on Google BigQuery
In your KNIME workflow now add the DB Table Creator node to the workflow and Connect it to the Google BigQuery Connector node.
The DB Table Creator node. You can find it on the KNIME Hub.
How to configure the DB Table Creator node:
Insert the name of the previously created schema. As above, we filled in “mySchema”
Provide the name of the table to create. This node will create the empty table where our data will be placed, according to the selected schema. We provided the name “austin_bike”.
Note: be careful to delete all the space characters from the column names of the table you are uploading. They would be automatically renamed during table creation and this will lead to conflict, since column names will no longer match.
Fig. 11. DB Table Creator node configuration. The schema name and a name for the new table are required
Add the DB Loader node and connect it to the DB Table Creator and the table whose content you want to load.
The DB Loader node. Find it and download it from the KNIME Hub.
How to configure the DB Loader node:
In the configuration window insert the schema and the name of the previously created table. Still, for our example, we filled in “mySchema” as schema name and “austin_bike” as table name.
Fig. 12. DB Loader node configuration. Provide the same schema and table names
If all the steps are correct, executing this node will copy all the table content into our project schema on Google BigQuery.
Data Visualizaton 101: Five Easy Plots to Get to Know Your DataadminThu, 03/05/2020 - 16:00
Here are five different methods of sharing your data analysis with key stakeholders
Author: Paolo Tamagnini (KNIME). As first published in DevPro Journal.
There are many different scenarios when building a data science workflow. No matter how complex the data analysis, every data scientist needs to deal with an important final step: communicating their findings to the different stakeholders — decision-makers, managers, or clients. This final step is vital because if the findings cannot be understood, trusted or valued, then the entire analysis will be discarded and forgotten.
Besides the usual set of soft skills, data scientists can use data visualization to send a clear message in just a few slides. Data visualization uses colors, shapes, position, and other visual channels to encode information so that humans can understand data way faster than by reading some text or looking at an Excel spreadsheet.
Below you will find my personal top five preferred charts to visualize data. These charts have been generated using KNIME Analytics Platform.
1. Scatter Plot
A scatter plot represents input data rows as points in a two-dimensional plot. It is useful for bivariate visual exploration as you can easily display in a two-dimensional space a strong relationship between two features (columns) in the data. Interactively experimenting with different input columns on the x-y axis and with different graphical properties can be an efficient strategy to find those relationships.
2. Sunburst Chart
A sunburst chart displays categorical features through a hierarchy of rings. Each ring is sliced according to the nominal values in the corresponding feature and to the selected hierarchy. This is a powerful chart for multivariate analysis.
3. Stacked Area Chart
The stacked area chart plots multiple numerical features on top of each other using the previous line as the base reference. The areas in between the lines are colored for easier comparison. This chart is commonly used to visualize trending topics.
5. Bar Chart
A bar chart visualizes one or more aggregated metrics for different data partitions with rectangular bars where the heights are proportional to the metric values. The partitions are defined by the values in a categorical feature.
5. Line Plot
The line plot maps numerical values in one or more data features (y-axis) against values in a reference feature (x-axis). Data points are connected via colored lines. If the reference column on the x-axis contains sorted time values, the line plot graphically represents the evolution of a time series.
I have shown you my personally preferred charts to visualize data: scatter plot, sunburst chart, stacked area chart, bar chart, and line plot charts. They are very basic charts but yet very powerful. Interesting information can be understood from these charts about bivariate analysis and the relationship between pairs of input features (scatter plot), multivariate analysis of nominal input features (sunburst chart), feature evolution over time (stacked area chart) such as topic trending, comparison of aggregated metrics instead of investigating the single data points (bar chart), and finally evolution over time of time series (line plot).
These are, of course, not the only available charts to visualize and gain insights about the data we are analyzing. What are your preferred charts to visualize data?
"Considering the plethora of articles, applications, web tutorials and challenges on the data science subject that we’re seeing in the last 3-5 years, it can be pretty surprising to find only a few of them dedicated to time series analysis and forecasting. We’re living in the golden era of data analytics, with plenty of data and algorithms of any kind... but topics like deep learning, artificial intelligence and NLP are attracting basically all of the attention of the practitioners, while the concept of Time Series forecasting is often neglected. Since many forecasting methods date back decades, it’s quite easy to say "nothing particularly interesting there... Let’s focus on that brand-new machine learning algorithm!”. But this could be a great mistake, since more accurate and less biased forecasts can be one of the most effective drivers of performance in business, operations, finance and science (and by the way, to be clear, the innovation in Time Series Analysis is nowadays still animated, if you dig deeper under the surface). Knowing the basics of Time Series Analysis is one essential step in the data science world that enables important capabilities in dealing with sequence and temporal dynamics, in order to move beyond the typical cross-sectional data analysis. Facing the fundamentals of forecasting with time series data, focusing on important concepts like seasonality, autocorrelation, stationarity, etc is a key part of this type of analysis."
-Professor Daniele Tonini
Forecasting can feel like, and in many ways truly is, a completely different beast than other data science problems such as classification or numeric prediction. It fits into a different paradigm than the traditional “data in; prediction out” of the prior cases.
In conjunction with Professor Daniele Tonini of the University of Bocconi in Milan Italy we’ve built a set of components in KNIME to help get started on this task. A component is a KNIME node that encapsulates the functionality of a KNIME workflow, such as training an ARIMA model. Components can be reused and shared locally, via KNIME Server, or on the KNIME Hub.
The components for time series analysis cover various tasks from aggregating and inspecting seasonality in time series to building an AutoRegressive Integrated Moving Average (ARIMA) model and checking the model residuals. These components use the KNIME Python integration, extending the analytical capabilities of KNIME Analytics Platform for time series analysis with the statsmodels module in Python. However, the code only executes in the background, and you can define the settings for each task, as for any other KNIME node: in the component’s configuration dialog.
The time series components are available on the KNIME Hub. Drag and drop them into your workflow editor and start building your KNIME workflows for time series analysis!
Figure 1: Accessing a time series component on the KNIME Hub: drag and drop the component into your workflow editor.
In this blog post we want to take a moment to introduce just a few of these new components and talk about how they slot together into a Time Series Analysis pipeline.
Steps in Time Series Analysis
Let's say you have a sensor attached to something, maybe the electrical meter on your house, and you want to plan your budget. You need to forecast how much power you’ll use in the future.
Time Aggregation
The first step we might take after accessing this sensor data is to reduce it into a manageable shape. Aggregating this data, perhaps to the hour, will not only reduce the data size substantially, but also smooth out some of the noise.
For this operation, we’ll need the Aggregation Granularity component that generates daily total energy consumption from hourly data. Many granularities and aggregation methods are available, and we could also generate, for example, monthly average values from daily data. (Figure 2)
Figure 2: Aggregating time series by selected granularity and aggregation method using the Aggregation Granularity component. In this example we calculate daily total values of the “cluster_26” column.
Timestamp Alignment
Next we need to verify our data is clean. One assumption time series models require is continuously spaced data. But what if our sensor was busted or we used no power for an hour?
If we find gaps in our time series, we fill them in. We just need to define the granularity of the spacing, which determines a gap as a missing minute, hour, or day, for example. The inserted timestamps will be populated with missing values that can be replaced by, for example, linear interpolation of the time series. (Figure 3)
Figure 3: Filling gaps in time series using the Timestamp Alignment component. In this example, the energy consumption for the last hour on March 24, 2010 is not reported. Therefore, the timestamp is added to the time series and populated with a missing value.
Inspecting Seasonality and Trend
Ok, so we’ve got aggregated, cleaned data. Before we get into modeling it’s always good to explore it visually. Many popular models assume a stationary time series, this means, its statistics remain the same over time. Therefore, we decompose the time series into its trend and seasonalities and finally fit the model into its irregular part.
We can inspect seasonality in time series in an autocorrelation (ACF) plot. Regular peaks and lows in the plot tell about seasonality in the time series, which can be removed by differencing the data at the lag with the greatest correlation. To find this local maximum, we use the Inspect Seasonality component (Figure 4). To remove the seasonality at the local maximum, we use the Remove Seasonality component. The second, third, etc. seasonalities can be removed by repeating this procedure. (Figure 5.)
Figure 4: Inspecting seasonality in time series using the Inspect Seasonality component. The highest correlation is found between the current value and the value 24 hours before.
Figure 5: Removing seasonality in time series using the Remove Seasonality component. In this example, the first (daily) seasonality is removed by differencing the data at the lag 24, which is the first (daily) seasonality given by the Inspect Seasonality component. The autocorrelation plot shows now the second (weekly) seasonality, which can be removed using the Remove Seasonality component another time.
Automatically Decomposing a Signal
In the previous section we inspected our signal and differenced it. Then we inspected the second seasonality and differenced it. This is a perfectly good strategy and gives you complete control over the processing of your data. We’ve also published a component called Decompose Signal. It will automatically check your signal for trend, and two levels of seasonality; returning the decomposed elements and the residual of your signal for further analysis (Figure 6). Perhaps with our ARIMA components in the next section?
Figure 6: The output view of the Decompose Signal component. The view shows the progress of the Signal as it progresses through the decomposition stages as well as the respective ACF Plots. In this example, we remove a (very minor) trend, and two layers of seasonality shown in the line plots on the left. The two first seasonalities are daily (24) and weekly (168).
Building ARIMA Models
We’re nearly there now, we’ve satisfied the stationarity requirement of models like the ARIMA. We can build this model easily with the new (Auto) ARIMA Learner components.
When we build an ARIMA model, we need to define three parameters that define which temporal structures are captured by the model. These parameters tell the relationship between the current value and the lagged values (AR order), between the current forecast error and the lagged forecast errors (MA order), and the degree of differencing required to make the time series stationary (I order). The best AR, I, and MA orders can be defined by exploring the (partial) autocorrelation plots of the time series.
With the AR, I, and MA orders defined, we can then train the ARIMA model. However, defining the parameter values is not always straightforward. In such a case, we can give the maximum values for these parameters, train multiple ARIMA models, and select the best performing one based on an information criterion AIC or BIC. We can do this with the Auto ARIMA Learner component. Both components output the models, their summary statistics and in-sample forecasting residuals. (Figure 7.)
Figure 7: Training an ARIMA model using the ARIMA Learner and Auto ARIMA Learner components. The ARIMA Learner component trains a model with pre-defined AR, I, and MA orders. The Auto ARIMA Learner component trains multiple models with different combinations of the AR, I, and MA orders within the defined range, and produces the best performing model in its output. The data outputs of the components show the model summary statistics and in-sample forecasting residuals.
Forecasting by ARIMA Models
Finally we’re ready to model, we need to generate that forecast so we can fill in the energy line item on our budget!
The ARIMA Predictor component applies the model to the training data and produces both in-sample and out-of-sample forecasts. For the out-of-sample forecasts we need to define the forecast horizon. In our case here, we select a forecast horizon “1” if we’re interested in the estimated energy consumption in the next hour, 24 for the next day, 168 for the next week, and so on. For the in-sample forecasts we can either use the forecast values to generate new in-sample forecasts (dynamic prediction), or use the actual values for all in-sample forecasts. In addition, if the I order of the ARIMA model is greater than zero, we define whether we forecast the original (levels), or differenced (linear) time series. (Figure 8.)
Figure 9: Producing in-sample and out-of-sample forecasts with the ARIMA Predictor component. We can control firstly the number of out-of-sample forecasts, secondly whether actual or forecast values (dynamic prediction) are used for in-sample forecasting, and thirdly whether the original or differenced time series is forecast, if the model’s I order is greater than zero.
Analyze Forecast Residuals
Before we deploy our model and really take advantage of it, it’s important we are confident in it. KNIME analytics platform has many tools for analyzing model quality, from ROC curves to confusions matrices with the scorer nodes to statistical scores like MSE. We’ve also added a component specifically built for ARIMA residuals: The Analyze ARIMA Residuals component. This component will apply the Ljung-Box test of autocorrelation for the first 10 lags and report if the stationarity assumption is rejected, or not, based on a 95% confidence level (Figure 9). It will also show the ACF Plot for the residuals, and a few other diagnostic plots.
Figure 9: Output view of the Analyze ARIMA Residuals component that shows the ACF Plot as well as the LB-Test statistics, and if you scroll further down the residual time plot and normality measures. To use the Analyze ARIMA Residuals component simply select the residual data column and the proper degrees of freedom for your input model.
Summing up!
In this blog post, we have introduced a few steps in time series analysis and how to perform the different operations from preprocessing to visualizing, decomposing, modeling, and forecasting time series using the time series components available on the KNIME Hub.
Of course, the required operations in preprocessing time series are often more than just aggregating time series and inspecting seasonality. Besides ARIMA models, time series can also be forecast using classical methods, machine learning based methods, and neural networks.
Our time series course provides a more comprehensive view of properties, descriptive analytics, and forecasting methods for time series. In the course we also show how to forecast time series using a rolling window and thus enhance the prediction accuracy for larger forecast horizons.
Keep an eye out for our Time Series Analysis courses coming up near you in the future on your Events page.