Motifs and Mutations - The Logic of Sequence LogosadminMon, 03/16/2020 - 10:00

Author: Franziska Rau (KNIME)

In a previous blog article, Blast from the Past we traveled back in time to investigate ancient DNA. As we learned in the article, DNA consists of a sequence of nucleotides, which can be viewed as very large strings like this: AGTCGCAGAGT...

Decoding this sequence of different species uncovered that humans and chimpanzees share perfect identity with 96 percent of their DNA sequence¹.

Why do we look so different then? Little differences with huge effects can appear in regulatory regions of our genome. A regulatory sequence is a segment of a DNA to which specific proteins can bind, thereby influencing gene expression (synthesis of a functional gene product). These sequences are often conserved within a species, as small changes can have deleterious effects. Short conserved sequence patterns with a biological significance are called motifs. What happens if changes appear in these motifs? And how can we find out? This is where bioinformaticians come into play.

Topic: examining motifs using multiple sequence alignment in KNIME with the SeqAn Community extension
Keywords: sequence motif, sequence logo, multiple sequence alignment, beta thalassemia

In this blog post, we have selected a motif - in which changes can lead to an inherited blood disorder, known as beta thalassemia - and want to take a closer look at it. We do this by introducing you to one of the most fundamental bioinformatics methods: multiple sequence alignment. To realize this in KNIME Analytics Platform, we make use of community extensions that allow us to easily analyze biological sequences. In order to visualize the results, we create a sequence logo using a Generic Javascript View. A sequence logo is a frequently used graphical representation of the sequence conservation of nucleotides from alignments.

It is of course also possible to visualize non-DNA letters, should you want to show people a sequence of your interest like this:

Introduction

Aligning multiple sequences is one of the most common tasks in the field of bioinformatics, as it allows these sequences to be systematically compared. A multiple sequence alignment (MSA) can provide information about related sequences while taking mutations, insertions, deletions, and rearrangements into account². It is possible to align either nucleotide or protein sequences with the goal of finding motifs or conserved regions, analyzing domains, or detecting phylogenetic relationships.

Often many sequences are compared with each other. This makes it difficult to immediately recognize patterns or conserved regions. To simplify this, a sequence logo can be used, which allows for a compressed representation of multiple sequences without any loss of information.

In this example, we will have a look at the promoter region of the HBB (Hemoglobin Subunit Beta)³gene. A promoter region is the part of a DNA sequence that is important for the initiation of transcription of a gene. This, in turn, affects the production of specific proteins, as in the case here of the beta-globin protein. Beta-globin is a subunit of hemoglobin, a larger protein located within red blood cells with the job of transporting oxygen throughout the body.

Mutations in the HBB gene can lead to triggering certain diseases. The promoter region we are looking at in this example is the so-called TATA- or ATA-box, to which a protein called TATA-binding protein binds. This interaction plays an important role in the initiation of the transcription. If transcription is negatively affected by mutations, this can decrease or even stop production of beta-globin altogether⁴. As a result, the beta-thalassemia⁵ condition can be incurred, in which the number of red blood cells is lower than normal. This can lead to minor symptoms such as pale skin, weakness, or fatigue. In worse cases, blood transfusions are required, which can lead to an abundance of iron in the body. This results in problems with the heart, liver, and hormone levels.

Analyzing DNA motifs using KNIME Analytics Platform

To satisfy your curiosity as to how mutations in motifs can help us learn more about the genetic basis of specific diseases, we created an example workflow (see fig. 1), which shows just how it works. You can download the Seqan Tcoffee (Multiple Alignment) and Sequence Logo workflow from the KNIME Hub.

First, we load the different sequences containing the mutations as a FASTA file, using the Input File node. In the next step, we insert the SeqanTcoffee node from the SeQan Community extensions to create a multiple alignment. If you’re not sure how to install these extensions, refer to the website: Seqan nodes in KNIME.

We now take this multiple alignment and create a sequence logo using the Generic Javascript View. This pinpoints the position at which mutations in the motif have occurred.

Fig. 1. This example workflow shows how to handle FASTA files and create a multiple sequence alignment from several sequences. The results are visualized in a sequence logo created with the Generic Javascript View node located in the Sequence Logo component.

To give you a more detailed insight into the individual steps, we will describe the nodes we used in figure 1 in the following sections. Stay tuned!

Biological Sequence Format - FASTA

In bioinformatics, the FASTA⁶ file format is commonly used for representing either nucleotide or amino acid sequences. In our example, we used a multi-FASTA file with different ATA-box motifs as the input. The ATA-box motifs shown in fig. 2belong to people who are suffering from beta-thalassemia.

Fig. 2. The input FASTA file, containing different ATA-box motifs, shows that it is not easy to see the conserved regions at first glance.

The file begins with a single line description of the sequence followed by the sequence itself. The description line contains for example gene name, species or just a comment and it always starts with ‘>’, which can be recognised by several algorithms and tools. KNIME Analytics Platform provides functionality to read FASTA files as well by using either the Input File node or the Load FASTA Files node from the Vernalis Community Extension. This FASTA file shown in figure 2, including the corresponding sequences, serves as the input for the T-Coffee multiple sequence alignment in the next step.

Multiple Sequence Alignment - T-Coffee

To analyze that specific ATA-box motif, we use the SeqanTcoffee node from the Seqan Community Extensions. T-Coffee⁷ (Tree-Based Consistency Objective Function for alignment Evaluation) is a method that is based on a progressive approach to increase the accuracy of aligning multiple sequences. The first step of the algorithm is to generate primary libraries, which contain sets of pairwise alignments. By default, two libraries are generated: global pairwise alignments using ClustalW⁸ and local alignments using Lalign from the FASTA package⁹. It is also possible to calculate the pairwise alignments beforehand and to use common libraries such as BLAST¹⁰ and MUMmer¹¹. That’s why there are 3 input ports for the T-Coffee node. The first receives a multi-fasta file as input and the other two optional ports can read in already aligned sequences in different file formats.

In the next step the initial libraries are combined into a single primary library. A distance matrix is calculated from that library. This distance matrix is used to compute a guide tree, which represents the relationships between the sequences. In order build the tree clustering, methods such as Neighbor-Joining¹² or UPGMA¹³ are used. In the final step the multiple sequence alignment is built from the guide tree by adding the sequences sequentially, beginning with the most similar pair and progressing to the most distantly related.

While the sequences are added sequentially, the alignments are scored. Gap-open and gap-extension penalties are used for this. Since gap penalties were already applied when calculating the pairwise scores for the primary library, gap-open and gap-extension penalties are set to low values in the progressive alignment by default. These values can be adjusted, depending on the purpose. If your interest is to find closely related matches, a higher gap penalty should be used to reduce gap openings.

Generic Javascript View

KNIME provides a number of possibilities for visualizations via Javascript. In case the built-in JavaScript views are not sufficient for your use case, you can always use customized JavaScript views with the Generic JavaScript View node to implement your own visualizations. In a recent blog post, From A for Analytics to Z for Zika Virus, we discussed how to create your own interactive views using the Generic JavaScript View node. In today’s example, we use the Generic Javascript View to create a sequence logo that can be used in combination with other views such as the Table View. We can get a useful, interactive view by combining both nodes in a component, as can be seen in figure 3. You can download the shared Sequence Logo component from the KNIME Hub.

Fig. 3. The component contains the Generic JavaScript Viewand a Table View to create an interactive composite view.

It uses the output of the multiple alignment to create a logo that shows how well nucleotides are conserved at each position. Highly conserved nucleotides should be displayed as large letters; if we find many gaps or different nucleotides at a position, we want those to be represented by small letters. To achieve that, we calculate the maximal entropy for each position in the sequence. To calculate the individual height of each nucleotide per position, we multiply the maximal entropy with the relative frequencies. The unit that is typically used to measure entropy is bit, basic unit of information.

This logo can be used to visualize certain motifs that occur repeatedly in multiple sequences. It simplifies the evaluation of the results, because we can easily spot where changes have occurred.

A very important feature of the JavaScript nodes is that they all support interactivity between the different visualizations in the component view. This makes it also possible to click on the nucleotides in the sequence logo and see in which sample they occur in that position by using a JavaScript Table View. You can easily use the created code on your own data, adjust it and enjoy the view!

Result

Let’s have a look at the result of the Generic JavaScript View, the sequence logo of the promoter region of the HBB gene. Figure 4 shows the graphical representation of the ATA-box motifs from healthy individuals. The repeating sequence of the ATA box, which typically consists of the nucleotide sequence 5'-ATAAAA-3 ' is clearly recognizable, especially because these nucleotides are displayed the largest.

Fig. 4. The sequence logo shows the wildtype ATA-box motif, which consists of the typical repeating sequences 5'-ATAAAA-3'.

So far, so good - but what changes in this motif will lead to beta-thalassemia? When we look at the logo in figure 5, we see that other nucleotides occur at the same positions as the ATA-box motif.

Fig. 5. Sequence logo of samples associated with beta-thalassemia. In this sequence logo we can see that the typical ATA-box motif changed in size and other nucleotides also appear at the same positions as compared to wildtype.

This means that at some point, one nucleotide has been replaced by a different one. In most cases this is harmless and happens constantly in our body, but in our case we observe these changes in patients with beta-thalassemia. This can lead to the hypothesis of a connection between the observed mutations and the disease. Indeed, it has been experimentally verified that these nucleotide changes hinder effective binding of the TATA-binding protein that is needed for the synthesis of HBB.

Summing up

This was a small example of how alignments and visualization tools can be used in KNIME Analytics Platform. You can easily build upon that workflow and adjust it to your needs. Our goal was to show what a comparison could look like between motifs from healthy people and people who are suffering from a disease.

In the first step, we created a multiple sequence alignment of the sequences from healthy individuals and people affected by beta-thalassemia by using the tool T-Coffee. More specifically, we used sequences of a regulatory region, the ATA-box motif, of the gene HBB (Hemoglobin Subunit Beta). The resulting alignment served as input for a Generic JavaScript View in which we created a sequence logo to visualize the results. The reason why this kind of logo is often used in bioinformatics is because it enables us to quickly see where in the sequence changes have occurred. If we assume that sometimes hundreds of sequences are compared with each other, this logo is a simplification to provide us with a quick and fast overview. The sequence logo made it possible for us to detect mutations in the motifs in a simple way and thereby derive hypotheses about the genetic basis of beta-thalassemia.

References

^1.New Genome Comparison Finds Chimps and Humans Very Similar at DNA Level

^2.Multiple sequence alignment modeling: methods ... - Oxford Journals

^3.HBB gene - Genetics Home Reference - NIH

^4.The Mechanism by which TATABox Polymorphisms ... - ASSA

^5.Beta thalassemia - Genetics Home Reference - NIH

^6.FASTA format - The Yang Zhang Lab - University of Michigan

^7.T-Coffee: A Novel Method for Fast and Accurate Multiple Sequence ....

^8.Clustal W and Clustal X Multiple Sequence ....

^9.UVA FASTA Server

^10.BLAST: Basic Local Alignment Search Tool

^11.The MUMmer Home Page

^12.Neighbor-joining method: a new method for reconstructing ....

^13.UPGMA Method

Blog

KNIME Blog: general

Exploring a Chemistry Ontology with KNIMEadminMon, 03/23/2020 - 10:00

Author: Martyna Pawletta (KNIME)

We are often asked if it’s possible to work with ontologies in KNIME Analytics Platform.

Exploring Chemistry Ontologies with KNIME

With “work with ontologies” people can mean many different things but let's focus today on one particular ontology and basic tasks including reading and querying ontologies to create an interactive tool at the end. For this purpose, today, we dive into the world of chemistry to use the ChEBI ontology (Chemical Entities of Biological Interest).

Even if chemistry is not a domain of interest for you, this blog post can still be of high value as we, for example, show how to read an OWL file, how to create queries in SPARQL as well as different possibilities for visualizing ontology content in an interactive composite view. How you adopt this to your own use case, ontology, and extracted dataset, we will leave to you and your imagination.

ChEBI

Especially in the Life Sciences area, ontologies are very popular and frequently used for different purposes such as data integration, curation, defining standards, or labeling. Just how important ontologies are is illustrated by facts that sources like the BioPortal ¹ contain already more than 800 ontologies in their repositories.

With the workflow described in this blog post, we will demonstrate a way to explore ChEBI ² , which is a freely available ontology containing a classification of chemical compounds as well as information about the role of compounds like their application or the biological and chemical role. It contains in general three main classifications like chemical entity, role, and subatomic particle. In this workflow, we use molecular entity and role class (See Fig. 1).

ChEBI can be downloaded in different file formats - today we will work with an OWL file which can be downloaded here.

Exploring Chemistry Ontology using KNIME

Fig. 1: Overview of ChEBI classes (screenshot taken from here). The yellow boxes show which parts will be used and explored in the workflow. The remaining parts are ignored.

Let’s start!

How to basically read and query ontologies stored in the OWL format is described in the blog post Will They Blend? KNIME Meets the Semantic Web. In the example described in this article, we used a pizza ontology to show how easy it is to explore that type of data.

With the following example workflow, we play with the terms and content of the ChEBI ontology while combining searches, results and data in order to create interactive views where the content can be explored. We hope to learn about compounds, their biological and chemical roles, as well as definitions and other sources that contain references to a particular compound.

This analysis was realized in the workflow depicted in Fig. 2 and contains the following main steps:

Step 1. Reading the OWL file into a SPARQL Endpoint

Step 2. Substructure search & selecting a role of a chemical compound

Step 3. View compounds matching the substructure search and role. Here one compound needs to be selected

Step 4. Show the selected compound in a network with all their parent classes, hierarchies and roles. Select a disease in the Tag Cloud to merge some more data in the next step

Step 5. Viewing results from selection in Step 4

Fig. 2: Example workflow showing how to explore the ChEBI ontology stored in OWL format.

Step 1. Reading the OWL file into a SPARQL Endpoint

Analogous to the previously described use case of a pizza ontology, we use the Triple File Reader node to read the OWL file and insert the list of triples into a SPARQL Endpoint which is connected to a Memory Endpoint Node (See Fig. 2, Step 1). With this in place, and successfully executed, we now have the basis to start writing and executing SPARQL queries as well as filtering information from the list of Triples.

Quick reminder here: RDF triple - also known as semantic triple - always contains three columns: subject, predicate, and object. Read more here.

Fig. 3: Schema showing how to interpret RDF triples.

Step 2. Substructure search paired with role information

Imagine a scientist is investigating a new compound in development. She knows the chemical structure and the application of that compound but is curious to see if there are other compounds in ChEBI with similar properties. In this example workflow a SMILES for a substructure search can be added and the application or biological/chemical role of the compound can be selected (see Fig. 4).

Therefore, the “Enter Search Options” component will be used to create a search query in order to add the above mentioned properties (Right click the component → select interactive view).

To allow insertion of a SMILES, we have used the String Widget node. Here, in Fig. 4., we added a Phenothiazine.

Let’s select a dopaminergic antagonist as the role. This is frequently used in anti-psychotic drugs for treating schizophrenia, bipolar disorders or psychosis stimulants.

Fig. 4: The “Enter Search Options” component contains two options to enter input for a substructure search as well as a search for compounds; a specific role is assigned in the ontology.

Little hint here: As an alternative to the String Widget, a Molecule String Input node could also be used. This would give you the opportunity to draw a chemical structure instead of pasting a SMILES string.

Step 3. Let’s view first results!

In the following view (“Result View” component in the example workflow) you can inspect the results of the substructure search. Here the Tile View and RDKit Highlighting were used to show all compounds matching the entered search options. All these compounds are dopaminergic antagonists as selected in the previous view.

In this example we selected a compound called Promazine to see more information.

Fig. 5: Result view showing chemical compounds with the highlighted substructure using the Tile View node.

Step 4. Show a network with class information

In this example workflow, we also would like to show a way to visualize an ontology as a network. We do this with the Network Mining Extension. The view in this step is also interactive, which means whenever a node in the network is selected, the table under the network will show more information for the extracted entity, such as the definition, for example.

Fig. 6: Network view showing the selected compound as well as a network including the subClassOf connections as “has role” and “is a”. Additionally, more information such as definitions and synonyms can be selected and made visible for a node in the network.

If you scroll down, a second network becomes visible. This network contains entities starting from the selected compound, here Promazine, and shows how it has been classified in ChEBI. It shows all “is a” links from the compound through to the chemical entity. Additionally the “has role” links were added to show also here which roles are linked (blue nodes).

Fig. 7: Network view showing the selected compound as well as a network including the subClassOf connections as “has role” and “is a”.

Step 5. Show compounds sharing two roles

By looking at the network view in Fig. 6 and investigating the roles of a compound - let’s say you can now see another interesting role and you would like to see more compounds with that role - you can go one step further and select two different roles from the table like for example the already known dopaminergic antagonist in combination with the H1-receptor antagonist that plays a role in relieving allergic reactions.

In the last component “View Compounds Sharing Selected Roles” (Step 5) we see all those compounds containing both selected roles and the additional information, such as definitions, synonyms, or references to other ontologies, databases, and sources.

Fig. 8: Last interactive view of the workflow showing compounds with additional information having both selected “roles” from the network view.

Extensions needed to run the workflow:

KNIME Semantic Web/Linked Data. To read more about that extension, you might find this blog article Integrating One More Data Source: The Semantic Web interesting, too.
RDKit KNIME Integration is used to perform substructure searches and highlighting
Network Mining Extension is used to create a network view

Instructions on how to install extensions can be found here.

Wrapping up

We started with an OWL file containing the ChEBI ontology and went through different steps of data exploration and visualization. We showed how to read an OWL file, how to create queries in SPARQL and presented different possibilities for visualizing ontology content in an interactive composite view. With this, we learned about the Promazine compound and the biological and chemical roles of that compound. We also discovered more about similar compounds and their roles, definitions, and synonyms.

The resulting data extracted from the ChEBI ontology can be directly explored using KNIME Analytics Platform. The workflow can also be deployed to KNIME Server where an expert of a certain research domain who is maybe not a KNIME or an ontology expert can analyze the data in the WebPortal without the need to write SPARQL queries.

The workflow described in this blog post, the ChEBI Ontology Explorer can be downloaded here from the KNIME Hub.

References

^1.Whetzel PL, Noy NF, Shah NH, Alexander PR, Nyulas C, Tudorache T, Musen MA. BioPortal: enhanced functionality via new Web services from the National Center for Biomedical Ontology to access and use ontologies in software applications. Nucleic Acids Res. 2011 Jul;39(Web Server issue):W541-5. Epub 2011 Jun 14.

^2.Hastings J, Owen G, Dekker A, Ennis M, Kale N, Muthukrishnan V, Turner S, Swainston N, Mendes P, Steinbeck C. (2016). ChEBI in 2016: Improved services and an expanding collection of metabolites. Nucleic Acids Res.

Blog

KNIME Blog: general

Seven things to do after installing KNIME Analytics PlatformadminThu, 03/26/2020 - 10:00

You have just downloaded KNIME Analytics Platform, what next?

Author: Rosaria Silipo (KNIME)

Seven things to do after installing KNIME

Here are seven steps for a fast and practical, learning-by-doing start to using it. After you’ve got started, take a look at more educational material, like for example one of our e-learning courses, onsite courses, cheatsheets, e-books, videos, local meetup events, and more. Our “sat nav” for finding the educational resources that most suit your skills and time constraints is here in the blog article Get on Board and Navigate the Learning Options at KNIME!

The seven things to do

Explore the welcome page
Learn from a pre-built workflow
Get familiar with the workbench of KNIME Analytics Platform
Find more resources on the KNIME Hub or EXAMPLES server
Install extensions
Customize a workflow in the LOCAL workspace
Become part of the KNIME community!

1. Explore the welcome page

Fig. 1. First-time KNIMEr welcome page. This special welcome page opens when you start KNIME Analytics Platform.

Start the KNIME application from the folder where KNIME Analytics Platform was installed or from the shortcut on the desktop or in the start menu. Select the workspace folder (the default folder “knime-workspace” is automatically created in the installation folder) that stores all your work in KNIME Analytics Platform and click Launch!

When you start the software for the first time, you are asked whether or not to allow KNIME to collect anonymous usage data. These data are used exclusively to improve the usability of the platform (particularly the Workflow Coach - see step 3 for more information).

After that, the KNIME workbench opens to show the Welcome to KNIME page (in the middle). The very first time you open KNIME this welcome page shows tiles that lead you to resources specifically designed to help someone who is using KNIME for the first time. Via this page you can:

Get started: explore a pre-built demonstrative workflow, to familiarize with the concept of workflows and nodes for a practical start
Find many more examples: investigate the KNIME Hub to find the example workflow that best fits your needs. The KNIME Hub is an external space provided to the KNIME community to host and share workflows and other resources. It is a great place to search for starting examples.
Try some Guided Onboarding: sign up for some basic tutoring from Emil, our teacher bot.

The next time you open KNIME Analytics Platform, we assume you have already explored the basics, and the welcome page now points you to interesting blog articles, news, new updates to extensions, and useful tips & tricks instead!

If you are reading this post, you are probably a first-time KNIMEr. So, we will ignore the welcome page for repeat KNIMErs and follow the steps proposed in the welcome page, as you can see above in Fig. 1.

2. Learn from a pre-built workflow

Fig. 2. First-time KNIMEr welcome page: Get started.

Get started. Let’s start from the first advised step: let’s click the Open workflow button in the Get started tile and learn from a pre-built workflow.

After clicking Open workflow, the welcome page closes and an example workflow opens in the centre of the workbench in your so-called Workflow Editor. As you see, KNIME Analytics Platform is not script based, but GUI-based; that is a graphical user interface helps you to build your application (a workflow), i.e. a pipeline or sequence of operations, which we call nodes.

Each node carries out a task to implement a particular operation: read data, visualize data, train a machine learning algorithm, change the data structure, normalize the data, and so on. Nodes are those colorful blocks you see in the workflow editor panel.

Can you see the traffic lights below each node? Each light represents the status of the node, starting with red on the left, through to yellow in the centre, and green on the right.

	Red: the node is not yet configured. It has to be configured or properly connected in order to be executed.
	Yellow: the node has been configured correctly and can be executed at any time.
	Green: the node has been executed (run) successfully and the data are now available to any downstream nodes.
	Yellow triangle: executing the node produced warnings.
	Red cross: an error occurred and execution of the node is interrupted.

In the example workflow we have prepared for you, all the nodes you can see are already successfully executed, and therefore show the green status. If you were building the workflow from scratch, the status of each new node would be red, since it is neither configured to run a task nor successfully or unsuccessfully executed. Let’s see now how to change a node status.

Right-click a node in the workflow, to open its context menu with a number of useful commands (Fig. 3).

Let’s right-click the first node in the pipeline: the File Reader node. In the context menu you can see the Reset option. This returns the node to its state before it had been executed. Select Reset. The node status should return to yellow: configured but not executed. Notice that resetting the first node in the workflow, resets all subsequent nodes as well.

The next command to select in the context menu is Configure. Configure opens the node configuration dialog where you can make the settings required for the node task. For example, open the context menu of the Color Manager node and select “Configure”. The node configuration dialog opens. Try changing some settings, for example the color map.

Selecting Execute runs the node’s task. If it executes successfully, the node status changes from yellow to green. Notice that only configured nodes can be executed.

The last items in the context menu differ depending on the node, but all of them lead to the output data table(s) produced when you execute the node. In the File Reader node, for example, the last menu item is File Table and shows the data table read from the input file. In the Color Manager node the menu item is Table with Colors and shows the input table with the assigned color to each data row. Try to change the color map in the configuration window of the Color Manager node and see how the output data table changes.

Notice that this is a very useful debugging option. You can execute each individual node and check here whether the node has produced the data according to the workflow design.

The menu items Delete, Cut, Copy, and Paste do what their name suggests: delete, cut, copy, and paste the selected node.

Fig. 3. The workbench of KNIME Analytics Platform. The Getting started workflow is open in the workflow editor. Notice the options in the context menu (right-click) of a node.

3. Get familiar with the KNIME Workbench

As we mentioned in the last step, the Workflow Editor is a part of the KNIME workbench. Let’s explore this some more. All you need to analyze your data is probably here. It is important to discover where everything is.

In the top left corner is the KNIME Explorer. It displays the list of workflows available in the selected local workspace (LOCAL), in the KNIME Hub spaces (private and public), and the list of available KNIME Servers you can connect to.

LOCAL refers to the content in the selected workspace. If you just started it is probably empty besides the folder Example Workflows. This folder contains a few basic examples for generic data science tasks and common case studies. This is a great resource to learn more about what KNIME Analytics Platform can do.
My-KNIME-Hub is the space automatically assigned to you when you set up an account at knime.com. After you have logged in to your account, you can upload your workflows to this public space and make them available to the community via the KNIME Hub, or simply store them in your private space for easy remote access.
The only server available that you will see in the KNIME Explorer view the first time you start KNIME is the EXAMPLES Server, a public server with many example workflows produced at KNIME as examples for specific functionality or customizable solutions to case studies (in 50_Applications). Note that all of the example workflows you see on this server are also available via the KNIME Hub.

Using the KNIME Explorer

Double-click a workflow in the KNIME Explorer view to open it in the workflow editor. If the workflow is hosted on a server, you will need to save it locally (i.e. save it to your local workspace via File -> Save As...) if you want to save any changes you make in it.

Right-click a workflow to open its context menu. Here you have the options to import/export/deploy, and reset/execute your workflow (plus more). This is also where you find the options to create a new workflow and a new workflow group (folder).

The Workflow Coach

Underneath the KNIME Explorer, you will find the Workflow Coach. This is a recommendation engine. When you select a node in your workflow, the Workflow Coach will suggest the next most likely node to add, based on the world-wide statistics of KNIME users. You can add nodes from the workflow coach to the workflow editor in the same way as you would from the node repository - by drag and drop, or by a double-click.

Node Repository

Below the Workflow Coach is the Node Repository view. It contains all the nodes available for this installation of KNIME Analytics Platform. Nodes are organized by categories, from IO to Analytics, from Scripting to Views, and Workflow Control. The category KNIME Labs deserves a few additional words. This category contains all the most recently developed nodes.They are fully functional, but still in their infancy, in their 1.0 version: they might change. This is where you can preview new features and plug-ins before they are added to the full version of KNIME.

The Node Repository contains a very high number of nodes. The search box at the top helps you find them, either via exact match (default) or via fuzzy match (after clicking the lens on its left).

Description

On the right side of the workbench is the Description view. It gives you information about the currently active workflow, or about an individual node selected either in the Node Repository or in the workflow editor. So, if you encounter a mysterious node, do not despair! The Description view explains what the node does, the settings required in the configuration dialog, the data specs for the node’s input and output, and the scientific reference for the algorithm implemented (if any).

KNIME Hub Search

Under the Description, you find one more reference to the KNIME Hub: the KNIME Hub Search box. This allows you to search for workflows on the KNIME Hub from within the workbench.

Console

Finally, the Console view hosts all warnings and errors related to your workflow execution and configuration and the Outline view shows a full picture of your workflow

If you have been following my instructions so far, there is a good chance that you have been clicking around randomly and involuntarily closing a view or two. No worries! Go to the View menu which you’ll find in the horizontal main menu bar at the very top of the workbench.

Fig. 4. The View menu selected in the horizontal main menu bar in KNIME Analytics Platform.

Here you can find the missing view and reinstate it into the workbench. The item Reset Perspective… brings the views in the KNIME workbench to their default layout.

Now that you are there, explore all the other commands of the Main menu. In particular, under File, notice Import Workflow and Export Workflow to import workflows created by other users and export your workflows for further usage.

Tool Bar

Right below the Main Menu is the Tool Bar.

Fig. 5. The tool bar at the top of KNIME Analytics Platform.

Here you’ll find the tool buttons for creating a new workflow, saving an existing one, executing selected nodes, executing all nodes, resetting selected nodes, and resetting all nodes. Also worthy of notice is the grid button (penultimate button): it’s responsible for the grid and its properties in the workflow editor.

4. Find more resources on the KNIME Hub or EXAMPLES server

Fig. 6. First-time KNIMEr welcome page: Find many more examples.

By now you have heard "KNIME Hub" quite a few times. The KNIME Hub is a public repository where you can find and download workflows, nodes, components, and extensions produced and shared by the KNIME community. It is a great resource to jump start your practice in KNIME Analytics Platform.

Notice that the KNIME Hub requires a username and password to log in. Use the username and password you set, when you sign up for the KNIME Forum.

Another great source of example workflows is the EXAMPLES server, available in the KNIME Explorer panel inside the workbench.

How to search for examples on the KNIME Hub

From a web browser

The easiest way to access the KNIME Hub is to go to the URL https://hub.knime.com/ from any web browser. There you can type the terms for your search. Entering “my first workflow”, for example, will take you to all the available first-time example entities. Searchable entities are (for now): nodes, components, workflows, and extensions. To restrict the beam of your search, select the entity-tab you are interested in from the top bar, for example “Workflows” (Fig. 7). The search now returns you a list of workflows related to your search term. Select the workflow you are interested in. This takes you to the workflow’s page. Click Open workflow or Download workflow to respectively open the workflow in KNIME Analytics Platform or download the .knwf file to your machine.

Try entering “time series” or “logistic regression” in the search box and then explore all related nodes, components, workflows, or extensions.

Fig. 7. Searching for workflows according to key-terms “my first workflow” on the KNIME Hub.

From within KNIME Analytics Platform

Within KNIME Analytics Platform, you can type your search terms into the KNIME Hub Search box under the Description view. The search query will then open a web browser to show the results in the KNIME Hub page.

From the first-time KNIMer Welcome Page

You can access the KNIME Hub from the first-time KNIMEr welcome page. The Find many more examples tile takes you straight there.

How to share resources on the KNIME Hub

This is a tip for when you will want to share your experience and knowledge with other KNIMErs in the community. You can share your work from the folder My-KNIME-Hub -> Public in the KNIME Explorer view. Just place your workflows or components in that folder and they will be automatically available to others for searching, viewing, and downloading on the KNIME Hub.

How to find examples on the EXAMPLES server

In the KNIME Explorer view, double-click EXAMPLES. The EXAMPLES server now opens in read-only mode, offering hundreds of example workflows. Most of them describe a function in KNIME Analytics Platform. However, in folder 50_Applications you can find solutions to real-world use cases. A search box is available at the top of the KNIME Explorer panel, allowing you to search for workflows on specific topics. Type for example “customer” to get all example workflows related to customer analysis tasks.

Want to find a churn prediction workflow? Navigate to “50_Applications/_18_ChurnPrediction”
Want to build a graph to visualize a social network? Navigate to “08_Other_Analytics_Types/05_NetworkMining/07_Pubmed_Author_Network_Analysis”.
Interested in Market Basket Analysis? Navigate to “50_Applications/_16_MarketBasketAnalysis”.
Does the problem you are trying to solve concern sentiment analysis? Navigate to “08_Other_Analytics_Types/01_Text_Processing/03_SentimentClassification”

Now: drag & drop (or copy & paste) the example workflow to your LOCAL workspace in the KNIME Explorer panel. Double-click the newly created copy in the LOCAL workspace to open it and change it accordingly.

As in the KNIME Hub, you can search for the workflow that is closest to your current task, download it by drag & drop to your LOCAL workspace, and from there adapt it to your data and your business problem.

Fig. 8. Workflow 08_Other_Analytics_Types/01_Text_Processing/03_SentimentClassification also available on the KNIME Hub.

5. Install Extensions

The basic KNIME Analytics Platform does not include all the nodes that you might see in more complex applications. Those nodes are part of the Extensions packages and are usually installed separately. Unless at installation time the package containing all free extensions was selected, you will need to install the KNIME Extensions now. In KNIME Analytics Platform, go to the horizontal menu bar and select File -> Install KNIME Extensions, and then follow the instructions.

You will be presented with a list of extensions you can install. The most essential packages are KNIME & Extensions, KNIME Labs Extensions and KNIME Community Contributions –Other. The search box at the top of this window enables you to search for more specific extensions.

Prompts to download required extensions

After you have downloaded a workflow from either the KNIME Hub or the EXAMPLES server, you are ready to customize it to fit your data and your business case. Save the workflow to your LOCAL workspace in the KNIME Explorer, for example the sentiment analysis workflow in 08_Other_Analytics_Types/01_Text_Processing/03_SentimentClassification on the EXAMPLES server.

This workflow requires the Text Processing extension, which is not part of the core installation. So now you need to install at least the KNIME Labs Extensions/KNIME TextProcessing extension. Note that when you open a workflow, KNIME Analytics Platform will alert you about any missing extension and ask if you want to install it on the spot.

Fig. 9. Window to select extension packages to install in the core KNIME Analytics Platform.

6. Customize a Workflow in the LOCAL Workspace

Now you have your workflow open in the workflow editor, for example a copy of the workflow in 08_Other_Analytics_Types/01_Text_Processing/03_SentimentClassification or on the KNIME Hub (Fig. 8). Let’s customize it!

Change the file path in the configuration window of the File Reader node to point to your own data and adjust other parameters – such as headers, comment lines, presence of short lines, locale, etc … – if needed.

The gray nodes after the File Reader node are metanodes. A metanode is a container of other nodes that can be created to hide the complexity of the analysis. The metanodes named Document Creation and Preprocessing contain all the required text cleaning procedures. Double-click them to see their content. If your data need more or less cleaning, just remove or add the corresponding Text Processing nodes.

To create a new node, drag & drop or double-click the node in the Node Repository view. To connect the newly created node to existing nodes, click the output port of the preceding node and release the mouse at the input port of the following node.

After the File Reader node you can change the decision tree model with another machine learning method of your choice. You can look for the relevant nodes either in the Node Repository view or on the KNIME Hub. You can import nodes from the KNIME Hub directly into your workflow by drag&drop.

An entity that is similar to a metanode is the component. The difference is that components have their own configuration dialog and a view, formed by the configuration and view items of special nodes, contained inside the component. To learn more about components, check the videos What is a Component? and Sharing and Linking Components.

7. Become part of the KNIME Community!

Fig. 10. First-time KNIMEr welcome page: Guided Onboarding.

During this blog post, we have played with the KNIME workbench, discovered where all examples are, downloaded and altered an existing workflow, and got to know what is where.

The seventh and last thing to do is to become part of the KNIME community! The KNIME community is very rich in information and very active in providing support. If you’d like to become part of the community and benefit from all the really useful resources it provides, we recommend the following:

Sign up for Emil’s Guided Onboarding emails. As suggested in the Guided Onboarding tile on the first-time KNIMEr welcome page, register to receive a few introductory emails about how KNIME works (Fig. 10). The emails we send are kept short and sweet: just a few useful emails to get started.
Become a KNIME Forum member. The KNIME Forum is the place to go to ask questions and look for answers provided by the community. Set up your own forum account, with a username and password.
Do an introductory e-Learning course. Start the e-Learning course Introductory Course to Data Science available on the KNIME website.
Register for an in-person KNIME course. KNIME also offers online and onsite courses with teachers. You can explore the full list of courses and attend the one that fits your needs.

Notes on our course levels

Notice that all KNIME courses are organized by level of complexity.

L1 (level 1) courses cover basic concepts of KNIME Analytics Platform;
L2 (level 2) courses cover more advanced functionalities of KNIME Analytics Platform;
L3 (level 3) courses cover deployment options;
L4 (level 4) courses dig deeper into more specialized fields

More course information:

All of our online courses are listed on our Events page. You can filter the search by clicking Online courses.
See also our webpage about the KNIME Certification exams

Wrapping up

We have reached the end of the seven things we recommend doing after installing KNIME Analytics Platform. If you have followed all of them, you now have the basic skills to move around KNIME Analytics Platform and build your first workflows. We hope to meet you soon in the KNIME community with your questions and answers, initially learning from other KNIMErs and soon offering your own advice and tips & tricks.

Blog

KNIME Blog: general

Guided Labeling Blog Series - Episode 1: An Introduction to Active LearningpaolotamagMon, 03/30/2020 - 10:00

Author: Paolo Tamagnini (KNIME)

Guided Labeling - 1 An Introduction to Active Learning

One of the key challenges of utilizing supervised machine learning for real world use cases is that most algorithms and models require lots of data with quite a few specific requirements.

First of all you need to have a sample of data that is large enough to represent the actual reality your model needs to learn. Nowadays there are lots of thoughts regarding the harm generated by biased models. Such models are often trained with biased data. Usually, a rough rule of thumb is that the more data you have, the less biased your data might be. The size of the sample not only impacts the righteousness of your model, but of course its performance, too. This is especially significant if you are dealing with deep learning which requires more data than other machine learning algorithms.

Assuming you have access to all of these data, you now need to make sure they are labeled. These labels, also called the ground truth class, will be used as the target variable in the training of your predictive model.

There are a couple of strategies for labeling your data. If you get lucky you can join your organization data with a publicly available online dataset. For example by using platforms like kaggle.com, connecting to a domain specific database via a free API service or accessing one of the many government open data portals, you can add the missing label you need to your dataset. Good luck with that!

While there are tons of publicly available datasets, there are also many potential users out there ready to go through your data and label it for you. This is where it starts to get expensive. Crowdsourcing is only useful for simple labeling tasks where generic users can complete the task. One of the most common platforms for crowdsourcing labeling tasks is MTurk, where for each row of your dataset you can pay a fee for some random person to label it. Labeling images of cars, buses, and traffic lights are a true classic in the crowdsource domain. CAPTCHA and reCAPTCHA are all about cheating people into labeling huge datasets (and of course also to prove the user is a real person surfing the web).

Even if you have access to such technologies, your data might not be shareable outside your organization, or specific domain expertise is required and you need to make sure the user labeling your rows can be trusted. That is an expensive, irreplaceable, soon-to-be-bored domain expert labeling thousands and thousands of data points/rows on their own. Each row of the dataset could contain any kind of data and could be displayed to the expert in very different formats. For example as a chart, document or image, literally anything. A long, painful, and expensive process awaits a business that needs its data labeled. So how can we efficiently improve the labeling process to save money and time then? Well, with a technique called active learning!

Active Learning Sampling

In the active learning process, the human is brought back into the process - the human is brought back into the loop and helps guide the algorithm. The idea is simple: Not all examples are equally valuable for learning, so first, the process picks the examples it deems to be most valuable for learning and the human labels them, enabling the algorithm to learn from them. This cycle - or loop - continues until the learned model converges or the user decides to quit the application.

To initialize this iterative process we need a few starting labels, but as we have no labels at all, there is little we can use in order to select which rows should be labeled first. The system picks a few random rows and it shows them to our expert and gets the manually applied labels in return. Now, based on just a small number of labels, we can train a first model. This initial model is probably quite biased, as it is trained on so few samples. But this is only the first step. Now we are ready to improve the model iteration by iteration.

With our trained model we can score all the rows for which we still have missing labels and start the first iteration of the active learning cycle. The next step in this cycle is called active learning sampling. Active learning sampling is about selecting what the human-in-the-loop should be labeling next to best improve the model. It is carried out during each iteration of the human-in-the-loop cycle.

To select a subset of rows, we rank them using a metric and then show the top ranked rows to the expert. The expert can browse the rows in decreasing rank and either label them or skip them, one after the other. Once the end of the provided sample is reached, the expert can tell the application to retrain the model adding the new labels to the training set and then repeat the human-in-the-loop cycle again.

Guided Labeling - Introduction to Active Learning

Fig. 1: A diagram depicting the human-in-the-loop cycle of active learning. The domain expert, referred to here as the "oracle" as is typical in active learning literature, labels data points at each iteration. The model retrains and active learning sampling re-ranks any rows where labels are still missing labels. In the next iteration of this cycle the user labels the top ranked rows and the model trains again. Using a good active learning sampling technique you can achieve good model performance with less labels than you would usually need.

All clear right? Of course not! We still need to see how to perform active learning sampling. In the next active learning articles in this series, we will show how the rows are ranked and selected for labeling in each iteration. This is what is at the real core of the active learning strategy. If such sampling is not effective, then it is simply better to label randomly as many rows as you can and then train your model - without having to go through this sophisticated human-in-the-loop process. So how do we intelligently select what to label after each model retraining?

The two strategies we'll look at in the upcoming active learning blog articles to perform our active sampling are:

Episode 2: Label Density. This is based on the distribution of the columns of the entire dataset in comparison with one of the already labeled rows
Episode 3: Model Uncertainty.This is based on the prediction probabilities of the model on the still unlabeled rows.

In future articles, we show how all of this can be implemented in a single KNIME workflow, i.e. both strategies are used in the same workflow. The result is a web-based application where a sequence of iterating interactive views guides the expert through the active learning process of training a model.

Even more articles are sure to follow, so stay tuned for the next articles in our series of Guided Labeling Blog Posts.

Blog

KNIME Blog: general

Interactive exploration and analysis of scientific datasets using Google BigQuery & KNIME Analytics PlatformMartynaMon, 04/06/2020 - 10:00

Accessing scientific datasets in Google Bigquery

The availability of scientific datasets in Google BigQuery opens new possibilities for the exploration and analysis of public life sciences data. Especially the Google Cloud Platform (GCP) provides a place where SQL queries can be easily and intuitively created in order to explore huge datasets extremely fast. Here we present a practical example of how you can work with them effectively, on BigQuery stored datasets, using the open-source KNIME Analytics Platform.

In this blog post we will cover a use case relevant for life sciences research. We will focus on answering some questions from the area of pharmaceutical research by linking and querying different datasets stored in BigQuery.

But don’t worry - even you're not a life science expert, you still might find it useful to see how easy it can be to connect to BigQuery, construct complex queries without needing to write SQL, and explore the results of the queries using KNIME Analytics Platform.

SciWalker Open Data

This example was inspired by the SciWalker Open Data sets that were added to Google BigQuery and announced at the American Chemical Society meeting in San Diego this year. You can find the abstract in the Chemical Information Bulletin, page 86/87 here.

SciWalker is a comprehensive resource that contains chemistry related data like molecules, nucleotides and peptide sequences (overall 211 million unique molecules) that are linked to additional scientific information. The datasets also include clinical and drug related data with links to different ontologies that allow us to compare data coming from different data sources using different wording.

Set up a BigQuery Account first! You'll find a detailed description on how to set up your BigQuery account in this blog article, by Emilio Silvestri (KNIME) - Tutorial: Importing Bike Data from Google BigQuery.

Once your BigQuery account is configured, you can create your first query using the other KNIME Database nodes as demonstrated in the short example below. These nodes let you create SQL queries in a visual way, without needing to write SQL yourself (although you can add SQL if you want/need to).

To learn more about nodes provided for databases check out our KNIME Hub where you'll also find more example workflows shared either by KNIME or the KNIME community.
Additionally you will find documentation, the KNIME Database Extension Guide, here.

Selecting and downloading data

In the short workflow below we select data from two tables: one contains general information about clinical trials and the other references to literature that has been linked to those clinical trials. They can be joined using the DB Joiner node on the nct_id column and filtered for certain columns like id’s, title, study phase and the PubMed id from the reference table using the DB Column Filter node. Additionally we Group the data according to nct_id and count how many PubMed references have been registered per study.

In the last step the DB Reader node is used in order to execute the query and download the data into a KNIME table.

Interactive exploration and analysis of scientific datasets using Google BigQuery and KNIME

Fig. 1 The workflow to select data from two tables: one contains general information about clinical trials and the other references to literature that has been linked to those clinical trials

Time to play

Now that you've connected to a BigQuery resource and queried it with KNIME's database nodes, we will demonstrate how to interactively explore the data in a few simple steps. In each step you can use an interactive view to select the data you're interested in, which are then used to create further queries and pull the matching data from BigQuery - and all this without writing code!

Fig. 2. The workflow Explore Scientific Data Stored on BigQuery using KNIME

Step 1

In the very first step of our exploration journey we retrieve a list of diseases that are listed in the clinical data (clinicaltrials.gov) datasets and standardized according to the disease ontology that is part of the SciWalker data collection. We then use this list to create an autocomplete menu which we can use to select the disease we want to investigate further. For example here we will investigate Schizophrenia.

Step 2

Selecting a disease brings us, after some data querying, joining, wrangling and preprocessing, to the next step where we can explore compounds that have been registered for clinical studies on schizophrenia. We calculate some chemical properties and merge the data with additional information about the clinical trial. In a second table PubMed references from each study are visible.

To make the view even more interactive, we added web links to the study and reference IDs that will bring you directly to the web pages describing those studies/references.

Let’s select “methotrexate” here, which is known as a chemotherapy agent and immune system suppressant and see what happens in the next step.

Fig. 3. Interactive view, with additional web links to the study and reference IDs that bring you directly to the web pages describing those studies/references

Step 3.

Here we once again take advantage of the ontologies available in SciWalker.

The view below shows which chemical classes “methotrexate” belongs to along with how many other compounds from each of those chemical classes have been registered for clinical studies. Here one class should be selected to go to the next step. We selected “pteridines” which seems to be not that popular (with only 21 compounds registered for clinical studies).In the next step, let's check which 21 compounds those are and for which diseases the studies have been conducted.

Fig. 4. View showing which chemical classes "methotrexate" belongs to, plus how many other compounds from each of those chemical classes have been registered for clinical studies.

Step 4.

This view shows a tag cloud with disease and condition names for which studies have been registered for compounds in the selected compound class (here: pteridines). When you select a disease from the Tag Cloud, the list of compounds in the selected class that are associated with that disease are displayed in the table below.

When we select “Rheumatoid arthritis” we see that within the class of pteridines three compounds are linked. We see that methotrexate has been tested for Schizophrenia and Rheumatoid arthritis.

Fig. 5. View showing a tag cloud with disease and condition names for which studies have been registered for compounds in the selected compound class

Step 5.

The last view shows all compounds found in the clinical trials dataset that have been tested for both schizophrenia and rheumatoid arthritis. If you are curious which compounds those are - check out the workflow, Explore Scientific Data Stored on BigQuery using KNIME, on the KNIME Hub here.

Prerequisites to run the example:

BigQuery account
Simba Driver
KNIME Analytics Platform (4.1)
KNIME Big Data Extension
KNIME Community Extensions - Cheminformatics (including RDKit)

Wrapping up

In this blog post we highlighted how to interactively explore and analyze scientific data using Google BigQuery and KNIME Analytics Platform together. We showed that combining these two tools allows us to take advantage of the breadth of data available in BigQuery using the interactive query construction, data analysis, and visualization capabilities in KNIME Analytics Platform. Maybe this sparks further ideas or questions or even allows you to create new hypotheses?

Though we’ve focused on life-sciences data here, the combination of KNIME and Google BigQuery can be applied in many different fields, so feel free to give it a try no matter what your use case or industry!

If this makes you curious, just set up KNIME and start playing with the workflow demonstrated today or look for other examples here on the KNIME Hub.

If you want to explore and do more experiments using freely available scientific datasets on Google BigQuery - check out the Marketplace. There is a lot more data to explore!

This blog article was written by Martyna Pawletta & Greg Landrum (KNIME).

Blog

KNIME Blog: general

Guided Labeling Blog Series - Episode 2: Label DensitypaolotamagTue, 04/14/2020 - 10:00

Guided Labeling. Episode 2 - Label Density

The Guided Labeling series of blog posts began by looking at when labeling is needed - i.e. in the field of machine learning when most algorithms and models require huge amounts of data with quite a few specific requirements. These large masses of data need to be labeled to make them usable. Data that are structured and labeled properly can then be used to train and deploy models.

In the first episode of our Guided Labeling series, An Introduction to Active Learning, we looked at the human-in-the-loop cycle of active learning. In this cycle, the system starts by picking examples it deems most valuable for learning and the human labels them. Based on these initial labeled data, a first model is trained. With this trained model, we score all the rows for which we still have missing labels and then start active learning sampling. This is about selecting or re-ranking what the human in the loop should be labeling next to best improve the model.

There are different active learning sampling strategies, and in today’s blog post, we want to look at the label density technique.

Label density

When labeling data points the user might wonder about any of these questions:

“Is this row of my dataset representative of the distribution?”
“How many other still unlabeled data points are similar to this one that I already labeled?”
“Is this row unique in the dataset - is it an outlier?”

Those are all fair questions. For example if you only label outliers then your labeled training set won’t be as representative as if you had labeled the most common cases. On the other hand, if you label only common cases of your dataset then your model would perform badly whenever it sees something just a bit exceptional to what you have labeled.

The idea behind the Label Density strategy is that when labeling a dataset you want to label where the feature space has a dense cluster of data points. What is the feature space?

Feature space

The feature space represents all the possible combinations of column values (features) you have in the dataset. For example if you had a dataset with only people’s weight and height you would have a 2-dimensional Cartesian plane. Most of your data points here will probably be around 170 cm and 70 kg. So around these values there will be a high density in the 2 dimensional distribution. To visualize this example we can use a 2D density plot.

Figure 2: A 2D density plot clearly visualizes the areas with more dense clusters of data points - here in dark blue. This type of visualization only works when you have a feature space defined only by two columns. In this case the two columns are people’s weight and height and each data point, the markers on the plot, are the different people.

In Figure 2, density is not simply concentrical to the center of the plot. There is more than one dense area in this feature space. For example in the picture there is one dense area featuring a high number of people around 62 kg and 163 cm and another area with people who are around 80 kg and 172 cm. How do we make sure we label in both dense areas and how would this work if we had dozens of columns and not just two?

The idea would be to explore and move in the dataset n-dimensional feature space from dense area to dense area until we have prioritized all the most common feature combinations in the data. To measure the density of the feature space we compute a distance measure between a given data point and all the others surrounding it using a certain radius.

Euclidean distance measure

In this example we use the Euclidean distance measure on top of the weighted mean subtractive clustering approach (Form. 1), but other distance measures can be used too. By means of this average distance measure to data points in the proximity, we can rank each data point by density. If we take the example in Fig. 2 again, we can now locate which data point is in a dark blue area of the plot simply by using Formula 1. This is powerful because it will also work no matter how many columns you have.

Formula 1: To measure the density score at the iteration k of the active learning loop for each data point x_iwe compute this sum based on the weighted mean subtracting clustering approach. In this case we are using a Euclidean distance between x_i and all the other data points x_j within a radius of r_a.

This ranking however has to be changed each time we add more labels. We want to avoid always labeling in the same dense areas and continue exploring for new ones. Once a data point is labeled, we don’t want the other data points in its dense neighborhood to be labeled as well, in future iterations. To enforce this, we reduce the rank for data points within the radius of the labeled one (Formula 2).

Formula 2 : To measure the density score at the next iteration k+1 of the active learning loop we need to update it based on the new labels L_k from past iteration k for each data point x_j within a radius of r_b from each labeled data point x_y.

Once the density rank is updated we can retrain the model and move to the next iteration of the active learning loop. In the next iteration we explore new dense areas of the feature space thanks to the updated rank and we show new samples to the human-in-the-loop in exchange of labels (Fig. 3).

Figure 3: Active Learning Iteration k: the user labels where the density score is highest, then the density score is locally reduced where new labels were assigned. Active Learning Iteration k + 1: the user labels now in another dense area of the feature space, since the density score was reduced in previously explored areas. Conceptually the yellow cross stands for where new labels are assigned and the red one where the density has been reduced.

Wrapping up

In this episode we've looked at:

label density as an active sampling strategy
labeling in all dense areas of feature space
measuring the density of features space with the Euclidean distance measure and the weighted mean subtractive clustering approach

In the next blog article in this series, we’ll be looking at model uncertainty. This is an active sampling technique based on the prediction probabilities of the model on still unlabeled rows. Coming soon!

Blog

KNIME Blog: general

Virtual Screening with KNIMEPharmaceleraMon, 04/20/2020 - 10:00

by Enric Herrero (Pharmacelera)

What is virtual screening in pharmaceutical R&D?

Drug discovery projects are long R&D processes that last more than 10 years to reach the patient and have a high risk of failure. In small molecule research, the goal of these projects is to identify those chemical structures that interact with key receptors, have drug-like properties, and that are not already known. In this context, finding good starting points is critical in any drug discovery project. These starting points can come from extensive experimental testing, which is a costly and long process, or might rely on the use of computers to speed up the process. In a virtual screening, computers are used to analyze libraries of millions of compounds to identify which ones are more promising.

PharmScreen for KNIME

PharmScreen for KNIME is a set of nodes from Pharmacelera, oriented to help chemists in their drug discovery projects to find leads with higher chances of becoming a drug. PharmScreen nodes find candidate molecules with greater chemical diversity by searching proprietary, public or commercial compound libraries.

	The Ligand Preparation node enables compound libraries to be prepared for a virtual screening campaign. This preparation includes conformer generation, minimization and partial charge, as well as LogP calculation with semi-empirical quantum mechanical methods.
	The Virtual Screening node enables you to search in a compound library for promising candidates for your drug discovery project. Field-based alignment and comparison of compounds is performed to find more chemical diversity and minimize the project risks related to IP or undesired molecular properties.

Both nodes are parallelized to take advantage of all the computing power of your PC, workstation, or cluster without having to go through any configuration hurdles.

Main features

Pharmacelera’s nodes enable you to perform a variety of tasks such as:

Increase the chemical diversity of your candidate molecules
Enrich your compound library
Find alternative scaffolds not covered by existing IP
Overcome pharmacological limitations of your hits
Evaluate the selectivity of your hits for target / anti-target
Repurpose your candidate molecules for other therapeutic areas

What is the underlying science?

PharmScreen uses a unique and superior 3D representation of molecules based on electrostatic, steric, and hydrophobic interaction fields derived from semi-empirical Quantum-Mechanics (QM) calculations. Such fields describe with high accuracy the factors that determine ligand / receptor interactions. These chemo-type agnostic descriptors enable identification of the compounds with similar physico-chemical properties but with different and diverse molecular scaffolds.

Virtual Screening Pharmacelera — Fig.1 PharmScreen field alignment

Molecular recognition is a central biochemical process. It defines drug interaction with biomolecules. This recognition is largely driven by hydrophobicity: hydrophobic areas of drug compounds tend to match hydrophobic areas of binding sites and cavities of macromolecules.

Hydrophobicity is often neglected in existing in-silico tools, which tend to focus their algorithms on electrostatic, hydrogen bonds, and steric components. As a consequence, chemical space is not properly mined and the proposed new chemical structures tend to be constrained and repetitive.

PharmScreen for KNIME offers a robust solution to this problem based on new molecular hydrophobicity descriptors. These differential descriptors overcome the above-mentioned drawbacks and lead to clear improvements. Precisely, more complete and original description of chemical space is achieved which enables finding more chemical diversity.

KNIME workflow: remote virtual screening in a cluster

A potential use of Pharmacelera’s nodes is to perform a virtual screening campaign in a workstation or remote cluster. Molecule libraries might be large in size, and, in order to speed up the process this example workflow partitions the dataset in multiple parts and executes it across multiple machines.

This example Pharmacelera vs MultiServer workflow shows a simple way to deploy virtual screening requiring only the IP addresses of the remote Linux cluster machines, access information, reference molecule and molecule library.

A selection of the most promising molecule candidates is retrieved both in SDF and CSV formats for further postprocessing and analysis. You can download and try out the Pharmacelera_VS_MultiServer example workflow from the KNIME Hub.

Resources

The Pharmacelera Extensions can be found on the KNIME Hub here.
Related workflows using these nodes are listed here

About the author

Enric Herrero is the co-founder of Pharmacelera. He is a full stack engineer specialized in customized solutions for data analysis. He has worked for many years in the design of hardware accelerators to improve the performance and energy efficiency of neural networks and efficient memory systems for multicore processors. He currently uses these skills to lead the development of PharmScreen and PharmQSAR.

Pharmacelera is a trusted KNIME technology partner. PharmScreen nodes enable KNIME users to find more chemical diversity in your virtual screening campaigns. Try out Pharmacelera’s KNIME nodes on the KNIME Hub, and request a demo via this form.

Pharmacelera helps biotech and pharmaceutical companies improve the productivity of their R&D process with the use of advanced computational tools based on quantum mechanics algorithms and artificial intelligence.

Blog

KNIME Blog: general

Analyzing Gene Expression Data with KNIMEJeanyMon, 04/27/2020 - 10:00

Express Yourself!

All individuals are unique and so are our data needs. From simple csv files to REST APIs to Google’s BigQuery or using customized shared components, KNIME Analytics Platform offers many ways to access and analyze your data. Today, we will demonstrate how to access all of these aforementioned data sources through the use case of analyzing and annotating gene expression data. Gene expression analysis is widely used in bioinformatics because it enables researchers to find gene products with increased or decreased synthesis in individuals with e.g. particular diseases.

Analyzing Gene Expression Data with KNIME

Gene expression is the process by which information from a gene is used in the synthesis of a functional gene product. As we learned in a recent blog article, Motifs and Mutations - The Logic of Sequence Logos, DNA mutations yield different effects of varying impact. Some have no noticeable effect at all and some can lead to severe diseases. As we also saw in that blog post, mutations that change gene expression are often associated with harmful effects. Hence, analyzing gene expression data directly is a straightforward way to find connections between genes and diseases. The first step in gene expression is called transcription, during which DNA is transcribed to RNA. Advances in massively parallel sequencing enable the rapid sequencing of this RNA (RNA-Seq) in a genome-wide manner in order to quantify the amount of synthesized gene product. In our use case today we analyze RNA-Seq data from tumors and matched normal tissue from three patients with oral squamous cell carcinomas¹. We investigate all statistically significant over/under expressed genes and select interesting ones by looking into their functional annotations. Using hierarchical clustering, we select a cluster of similarly expressed genes and investigate their pathway enrichment. Lastly, we search for compounds that target the gene products we picked. As illustrated in Figure 1, the overall workflow consists of the following steps:

Input data
Find differentially expressed genes
View results of differential gene expression analysis
Clustering
Pathway enrichment
Display compounds targeting gene product of interest

Figure 1. Overview of workflow.Differentially expressed genes are discovered using R and then displayed in an interactive view. Subsequently, genes are hierarchically clustered based on their expression pattern, and the results are shown via a dendrogram alongside a heatmap. We then perform a pathway enrichment analysis and look for compounds targeting the gene product of interest.

The user can select the files containing RNA-Seq data for samples with and without a disease of interest (positive and control, respectively). This data then gets used in the R Snippet to find differentially expressed genes. The user can investigate those genes and select genes of interest based on statistics from the gene expression analysis. We then cluster the genes based on similar expression profiles and investigate their biological pathways. In the last step we search for compounds targeting the selected gene products.

Input data

As mentioned in the previous section, today’s example uses RNA-Seq data from normal and tumor cells from patients with oral squamous cell carcinomas². The standard procedure to generate that data consists of the following steps: the RNA of the cells is reversely transcribed to cDNA and then sequenced using massively parallel sequencing resulting in short sequenced reads. Subsequently, these reads are mapped back to the reference genome to identify the genes from which they originated. This results in a count for each position in a gene representing the amount of gene product. In our data set, read counts for 10,542 genes were collected.

Find differentially expressed genes

One of the strengths of KNIME Analytics Platform lies in its openness for other tools. This allows you to easily harness the power of those tools such as R with all of its libraries. In our case today we want to utilize a commonly used R library for differential expression analysis of RNA-seq expression profiles: edgeR ³. EdgeR implements a range of statistical methods including likelihood tests based on generalized linear models (GLM). GLMs are most commonly used to model binary or count data which makes them perfectly suited to model the aforementioned read counts. GLMs model a response by a linear function of explanatory variables and allow for constraints such as a restriction on the range of the response Y or the variance of Y depending on the mean. Hence, a generalized linear model is made up of three components: (1) a linear predictor, (2) a link function that describes how the mean depends on the linear predictor and (3) a variance function that describes how the variance depends on the mean [var(Yi) =φV(μ), with φ being the dispersion parameter].

Therefore, in our R snippet, we use the read count data from tumor vs. normal tissue and estimate the dispersion parameter. In the next step, we fit the generalized linear model and apply a likelihood-ratio test. This results in a log fold-change (logFC) and a p-value for each gene. The logFC describes how much a quantity changes between an original and a subsequent measurement. In our case that means how much the read counts per gene differ between tumor and normal cells. As we do this simultaneously for all 10,542 genes, we have to make sure to apply a multiple testing correction⁴. We use the default method provided by edgeR, Benjamini-Hochberg, which has the false discovery rate (FDR) as output.

As we are interested in the relative changes in expression levels between conditions, we do not have to account for factors such as varying gene length. However, we have to account for differing sequencing depth and RNA composition. Sequencing depth is adjusted in edgeR as part of the basic modeling procedure. To adjust for RNA composition effects, where highly expressed genes can cause the remaining genes to be under-sampled in that sample, we use the function calcNormFactors. In addition to the fold-change and the FDR for each gene, we extract the depth-normalized read counts (counts-per-million) for each gene in our analysis.

View of differentially expressed genes

We now have statistics for 10,542 genes, so in the following steps we want to narrow down the results to genes of interest. For that, we create an interactive composite view in which the user can select genes for further analysis. As can be seen in Figure 2, we display the fold-change vs. the FDR on a logarithmic scale in a scatter plot. The colors indicate if the FDR exceeds 0.01 or not. In addition, we show an interactive table with details of the results (FDR, -log10(FDR) , p-value, gene name) and a range slider that allows the user to interactively filter for the logFC. We can now, for example, extract only genes that are significantly upregulated in tumor cells by box selecting the genes in the upper right corner of the scatter plot.

Figure 2. View of differentially expressed genes. A scatter plot of the fold change vs. the FDR and a table with details of the result is shown. The user can select genes in the plot, the table, or filter by fold change using the range slider.

Clustering

Having narrowed down our search space, let’s have a closer look into the biological processes associated with these genes. Similar expression patterns of genes often point to a common function⁵. To find those genes with similar expression patterns, we perform a hierarchical clustering on the normalized read counts and display the results in a hierarchical cluster tree. For this, we can use a shared component that can be found on the KNIME Hub, the component Hierarchical Clustering and Heatmap. This component allows you to perform a hierarchical clustering on numerical columns of your choice and to display a heatmap sorted according to the clustering results. We combine this component with a table containing more detailed information about the genes (see Fig. 3), allowing us to interactively identify and pick a cluster of interest. In our case we select the one showing high (orange) values in tumor cells and low (blue) values in the matched normal tissue. As we can learn from the details in the table of our composite view, this cluster includes MMP11 (matrix metallopeptidase 11) which may play an important role in the progression of epithelial malignancies, and COL4A6 (collagen type IV alpha 6 chain) which is the major structural component of glomerular basement membranes.

To further investigate shared function of the selected genes we perform a pathway enrichment analysis in the next step.

Figure 3. View of heatmap with normalized read counts and dendrogram showing the hierarchical clustering of the counts.The heatmap is sorted according to the clustering. This combination of the heatmap with the dendrogram can be easily achieved using the shared component “Hierarchical Clustering and Heatmap”. Additionally, a table with more detailed information is shown.

Pathway enrichment

A pathway consists of a set of genes related to a specific biological function. As genes are often annotated to a lot of pathways, a pathway enrichment analysis allows us to find those pathways that are enriched in the input set of genes more than would be expected by chance⁶. Pathway enrichment analysis is, therefore, a widely used tool for gaining insight into the underlying biology of differentially expressed genes, as it reduces complexity and has increased explanatory power⁷. Again, we can use the KNIME hub to easily drag and drop a shared component called Pathway Enrichment Analysis. This component makes use of the Reactome Pathway Database, a resource which is open-source, curated and peer-reviewed. It provides a pathway enrichment web service which can be easily accessed by KNIME Analytics Platform. The component takes as input a set of Ensembl gene IDs and automatically performs the pathway enrichment analysis, the results can be seen in Figure 4. The pathways with the most significant enrichment are “Collagen chain trimerization” and “Degradation of the extracellular matrix”. Both MMP11 and COL4A6 are part of these two pathways. Indeed, collagen is an essential part of the extracellular matrix and extracellular matrix interactions are known to be involved in the process of tumor invasion and metastasis in oral squamous carcinoma⁸. This further corroborates our hypothesis that these genes play an important role in oral squamous cell carcinomas from our patients’ tumor cells.

Figure 4. Pathway enrichment view. The pathways with the highest enrichment are “Collagen chain trimerization” and “Degradation of the extracellular matrix”.

View compounds targeting gene product of interest

In this final step, we want to check if we can possibly interfere with the disease for which we did our expression data analysis. For that, we look for compounds that target the selected gene products. As we have seen in a previous blog article Interactive exploration and analysis of scientific datasets using Google BigQuery and KNIME Analytics Platform, Google BigQuery offers effortless access to public life sciences data. For this, you need to set up a BigQuery Account first, you can find more details on how to do that in this blog article: Tutorial - Importing Bike Data from GoogleBigquery. In particular, we can easily query bioactivity data from the database ChEMBL using the KNIME Google BigQuery Connector in combination with the KNIME Database nodes. For our query we gather all human synonyms for the gene products of choice and extract compounds known to target those. This allows us to retrieve information for all those compounds including the name, the assay ID, the type of measurement (e.g. IC50 or Ki), and the structure as SMILEs string.

From the SMILES string we create images of the molecules and display them in a tile view. As can be seen in Figure 5 we found only results for MMP11, Matrix metalloproteinase-11 also known as Stromelysin-3. MMP11 is known to be involved in extracellular matrix breakdown in normal physiological processes and has been implicated in promoting cancer development by inhibiting apoptosis as well as enhancing migration and invasion of cancer cells⁹. Additionally, it has been revealed that MMP-11 expression in oral squamous cell carcinoma samples can predict the progression and the survival of oral squamous cell carcinoma patients¹⁰. Moreover, MMP11 has been recently been identified as potential therapeutic target in lung adenocarcinoma. ¹¹

Figure 5. Tile view with compounds targeting the gene product of interest. The ChEMBL ID for the ligand, the name of the target, the assay ID, the action, the type of measurement, and its value are shown.

Summary

Today we learned how to perform a classic task in bioinformatics: differential gene expression analysis for a disease of interest. We created an interactive view that allowed the user to select significantly under/over expressed genes. From there we further narrowed that set of genes down to genes with similar expression patterns and common function. In the last step, we searched for compounds targeting the discovered gene products thereby offering the possibility to interfere with the disease under investigation. We applied our workflow to data from normal and tumor cells from patients with oral squamous cell carcinoma. Through our analysis we were able to identify a gene that has been independently implicated as a therapeutic target for carcinomas. Moreover, it has been shown that the expression of that gene can predict disease progression as well as survival of oral squamous cell carcinoma patients. All this was facilitated by KNIME’s openness for other tools which enabled us to use our favourite R library, extract data from Google’s BigQuery and use shared components to customize our analysis.

All steps of the analysis can also be performed on the WebPortal through the interactive views of the components.

Author: Jeany Prinz (KNIME)

References

^1.Tuch, B., Laborde, R., Xu, X., Gu, J., Chung, C., & Monighetti, C. et al. (2010). Tumor Transcriptome Sequencing Reveals Allelic Expression Imbalances Associated with Copy Number Alterations. Plos ONE, 5(2), e9317. doi: 10.1371/journal.pone.0009317

^2.Tuch, B., Laborde, R., Xu, X., Gu, J., Chung, C., & Monighetti, C. et al. (2010). Tumor Transcriptome Sequencing Reveals Allelic Expression Imbalances Associated with Copy Number Alterations. Plos ONE, 5(2), e9317. doi: 10.1371/journal.pone.0009317

^3.Robinson, M., McCarthy, D., & Smyth, G. (2009). edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26(1), 139-140. doi: 10.1093/bioinformatics/btp616

^4.Noble, W. (2009). How does multiple testing correction work?. Nature Biotechnology, 27(12), 1135-1137. doi: 10.1038/nbt1209-1135

^5.Alon, U., Barkai, N., Notterman, D., Gish, K., Ybarra, S., Mack, D., & Levine, A. (1999). Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proceedings Of The National Academy Of Sciences, 96(12), 6745-6750. doi: 10.1073/pnas.96.12.6745

^6.Reimand, J., Isserlin, R., Voisin, V. et al. Pathway enrichment analysis and visualization of omics data using g:Profiler, GSEA, Cytoscape and EnrichmentMap. Nat Protoc 14, 482–517 (2019). https://doi.org/10.1038/s41596-018-0103-9

^7.Khatri P, Sirota M, Butte AJ (2012) Ten Years of Pathway Analysis: Current Approaches and Outstanding Challenges. PLoS Comput Biol 8(2): e1002375. https://doi.org/10.1371/journal.pcbi.1002375

^8.Lyons, A., & Jones, J. (2007). Cell adhesion molecules, the extracellular matrix and oral squamous carcinoma. International Journal Of Oral And Maxillofacial Surgery, 36(8), 671-679. doi: 10.1016/j.ijom.2007.04.002

^9.ZHANG, X., HUANG, S., GUO, J., ZHOU, L., YOU, L., ZHANG, T., & ZHAO, Y. (2016). Insights into the distinct roles of MMP-11 in tumor biology and future therapeutics (Review). International Journal Of Oncology, 48(5), 1783-1793. doi: 10.3892/ijo.2016.3400

^10.Hsin, C., Chou, Y., Yang, S., Su, S., Chuang, Y., Lin, S., & Lin, C. (2017). MMP-11 promoted the oral cancer migration and FAK/Src activation. Oncotarget, 8(20). doi: 10.18632/oncotarget.15824

^11.Yang, H., Jiang, P., Liu, D., Wang, H., Deng, Q., & Niu, X. et al. (2019). Matrix Metalloproteinase 11 Is a Potential Therapeutic Target in Lung Adenocarcinoma. Molecular Therapy - Oncolytics, 14, 82-93. doi: 10.1016/j.omto.2019.03.012

Blog

KNIME Blog: general

Ten common issues when using Excel for data operationsrsMon, 05/04/2020 - 10:00

I know you are still using Excel sheets to transform and/or analyze your data! I know, because most of us still use it to some extent. There is nothing wrong with using Excel. Excel spreadsheets are a great tool to collect and transform small amounts of data. However, when the game becomes harder and requires larger amounts of data, Excel starts showing its limitations.

You do not believe me? Then let’s start with the list of most common issues when working with an Excel spreadsheet to transform data. For this post I used answers provided by fellow data scientists in this thread on LinkedIn. Thank you to everyone for contributing!

Ten common issues with Excel for data operations

1. No Error Control

One main issue that came out of many conversations with fellow data scientists: Excel spreadsheets have no error control and are therefore error prone.

According to Meta Brown and Karen Hardie, “It's easy to inadvertently change a cell or make mistakes - I’ve seen people suddenly realise that a macro was wrong by one cell after using the process for a long time and then have to go back and figure out when that happened.”

There is no debugging tool and no testing frame to inspect whether all cells keep working as expected, for example after a change.

John Peck also commented that “Excel is great for simple, ad hoc calculations, but its lack of structure and difficulty in automating and documenting the contents make its use error prone. Analyses built in Excel tend to grow and sprawl making them difficult to validate and to use on repetitive tasks.”

This last hint on the difficulty of using Excel spreadsheets for repetitive tasks takes us to issue #2.

2. Little Reusability

This one comes from the pool of my own personal mistakes when using Excel spreadsheets for professional data management. It had to do with the data input. Usually, data are stored in one or more source columns in an Excel spreadsheet, while the other columns contain the macros and formulas for the processing. Well, often, when reusing the spreadsheet for the current month’s analysis, the new data were copied and pasted manually into the dedicated source column(s). However, since the data rows for the current month were usually more than the data rows for the previous month, the pure copy/paste of the data would cover regions of the sheet where macros had not yet been defined, producing wrong unverified sums and macros results.

The lack of a verified, reliable, repetitive way to collect data from multiple sources makes reusability limited to very simple processes.

And if you're thinking of using Excel as a data source: Roger Fried warns against it!

3. Problematic Scalability

In professional data wrangling projects, we usually deal with very large amounts of data. Therefore scalability is often a concern when moving forward with the project. Excel spreadsheets show their shortcomings when large amounts of data are involved.

David Langer lists “speed of iteration of analyses” as one of the main problems of using an Excel spreadsheet for professional data transformations. “My experience has been that current Excel row limitations (I'm ignoring PowerPivot here) aren't a concern in the vast majority of cases. What kicks me out of Excel most of the time is speed of iteration. For example, in linear regression modeling.” he says.

For Giovanni Marano“performance degradation and crashes, when running operations on big datasets” are a big limitation for serious professional usage of Excel spreadsheets, while Anna Chaney confirms that “Excel doesn’t have enough memory to load larger datasets”.

David Montfort points to the limit in number of processable rows: “Excel has a row limit which can be an issue with very large datasets. Also, other programs offer better statistical and data visualization tools”.

So, either lack of memory, limit in number of rows, general slow speed in execution, and performance degradation represent a serious issue in scalability when implementing professional data wrangling and data management projects.

4. Low Coverage of Data Operations

Again, Excel spreadsheets do well for small datasets and for a reduced pool of data operations. However, when the projects become bigger and require more sophisticated data operations, some are not available in Excel.

Alessio Nicolai and his colleague Giovanni Marano focus on “ad-hoc” analyses (which don't require a scalable process). They identified the following limitations in data operations available to an Excel spreadsheet:

Operations on a filtered dataset are limited (filtered-out data are only “hidden”)
No availability of intermediate steps in data preparation (e.g. when filtering)
Formulas limitations (e.g. no MAXIFS/MINIFS without using computationally expensive array formulas)
Distinct count in pivot tables is not available
The equivalent of Joiner (Vlookup) is clunky and does not allow the Full Outer join
Multi-key joiners / full outer joiners not possible without work-arounds
Analysis tools (like regressions, correlations) are way too basic
Number of rows in the spreadsheet are limited

Amit Kulkarni adds the difficulty in referencing filtered sets for say vlookup functions and Sayed Bagher Nashemi Natanzi (Milad) would like to have more options for sorting and filtering.

5. Lack of Automation

Deeply connected with the lack of reusability is the lack of automation, as pointed out by Tyler Garrett below.

Copy and Paste operations are common when using Excel spreadsheets, to introduce new data, new cells, and new functions. Those are all operations that cannot be automated, because they require the start of the tool GUI and a certain degree of expertise. Every time, in order to calculate new values, you need to reopen Excel, perform such manual operations, and recalculate.

“It is great for prototyping, documenting, entry level input to get a ETL, analytics, or data science process started, but truly the value starts to disappear when the computer is offline. The "availability" being dependent on computers being ON, the "validity" being relevant only if users are experts (but even we make mistakes), and lack rules keeping it from being acid compliant :)”

6. Not open

We have referred to a Copy & Paste action often so far. Of course, this is not the only way to get data into Excel. You can connect to databases and some other external tools. However, there is a plethora of data sources, data types, and data formats that are usually needed within the scope of a data wrangling project. The openness of a tool allows you to connect, import, and process a number of different data sources and types, and to integrate scripts and workflows from other popular tools.

Transparency is another sign of the openness of the tool. The possibility to understand the formulas and operations quickly in the blink of an eye is an important feature to pass your work to someone else or to interpret your colleague’s work.

Alberto Marocchino has indicated this as another fault in the usage of Excel spreadsheets in data analysis. In particular he pointed out that:

You do not know if a cell contains a formula or a value (data and analysis are merged together)
Formulas are hidden in cells
There is no direct pipeline for dashboard export
It pushes data correction back to a DB

“Excel can be a wonderful tool, it depends on the use. It is general purpose and since most of the computer users stick with windows it is a native way to visually interact with CSV. But probably 'general tool' is not necessarily a synonym for quality when it comes to hardcore data analysis.”

This difficulty in documenting and communicating what happens in the Excel spreadsheet takes us directly to the next issue.

7. Difficult Collaboration

Nowadays no data scientist or data engineer works alone anymore. We are all part of bigger or smaller labs and we all need to communicate around the applications we build. Team debugging, feature discussions, best practices, documentation are all necessary tasks in the daily work. Excel is really not made for collaboration in big teams.

It is resident on your local machine, preferably hosting a Windows OS. Already exporting the spreadsheet to a Mac might require some extra effort.

David Springer indicates the “major issue with Excel when processing data as mostly the default, non-portable, proprietary data format”.

Documentation is a big part of collaboration. Michael Reithel observes that “Manual modifications to a spreadsheet are often undocumented and consequently lost over time making it hard to reproduce results.”

Those are just a few issues that make collaboration around Excel hard to implement”.

8. Time Consuming

The lack of scalability, the manual operations, the limitations in the amount of data make the whole process around an Excel spreadsheet quite time consuming, as reported by Hrvoje Gabelica and Tyler Garrett.

Both are encouraging to investigate other solutions that allow for automation, scheduling, openness, and better scalability.

9. Not user-friendly

All in all an Excel spreadsheet is not user-friendly. It seems easy to use at the beginning when moving the first steps in the world of data processing. However, when more complex operations are required, when collaboration would come in handy it turns out it is not that user-friendly after all.

Giovanni Marano lists two main reasons for that:

Excel’s Macros for repeated processes are not user-friendly and hard to code/debug in VBA
When multiple formulas/operations are set up in a spreadsheet, you don’t have an easy overview of the dependencies between each other, and – unless you use complex VBA coding – you need to run the whole execution at a single time

Evert Homan says that pivoting data in Excel is cumbersome. I would add that the lack of overview and the difficulty to introduce documentation make data processing in Excel quite user- hostile, even for simple tasks.

We can conclude with Davide Imperati’s statement: “It is the perfect device to generate corrupt data”, since we do not always understand the processing functions.

10. Productionizing is hard

Finally, after implementation, we need to move our application into production. Without scheduling, automatic import of new data, from many different data sources, automatic reset of macros before re-execution, moving into production can be quite a hard task.

This leaves Excel to be an excellent tool for small datasets and maybe prototyping, but unsuitable for professional data management projects.

Try something new

These listed here are just the most common ten issues data engineers have to deal with when working with Excel spreadsheets to store, clean, and transform their data. If you are still hooked on Excel and fighting to get the data in the right format, try to investigate a few alternative solutions for data analysis. Not all data science tools require programming or scripting skills. Some of them are based on visual programming, where drag&drop of visual icons and their connection into a pipeline takes the place of scripting.

KNIME Analytics Platform is an open source and open software for data analysis, with more than 3000 data operations. It can take your data from most sources and most formats to whatever shape you need them in and export your results in most available formats on most available platforms (open). It relies on a Graphical User Interface (GUI) where by drag&drop you can easily assemble a pipeline of operations (called “workflow”), which can be reused at any time. Thanks to its GUI, it is easy to combine documentation and functionality within the same project. Together with the KNIME Server, it also allows for easy productionization, collaboration, sharing, scheduling, and automation.

Just download KNIME Analytics Platform for free from the KNIME website, install it on your machine, and start assembling workflows right away!

To quickly transition your knowledge and maybe your existing spreadsheets into repeatable and reliable workflows, you can rely on the free booklet “From Excel to KNIME” and start migrating!

Sometimes, we might want to perform all complex data operations within KNIME Analytics Platform and then export the results back into an Excel spreadsheet. The second and latest version of this booklet introduced a few nodes from the community extension “Continental Nodes for KNIME”, which allow you to export the results back into an Excel spreadsheet with a specific look&feel.

Ten common issues with using Excel for data operations

In case you need help on specific issues, ask the KNIME community on the KNIME Forum for technical questions, on the KNIME Hub for examples, and at KNIME events to learn more from others’ experience.

We hope to see you soon at one of KNIME next virtual events!

Author: Rosaria Silipo (KNIME)

Blog

KNIME Blog: general

How to move data science into productionbertholdThu, 05/07/2020 - 10:00

By Michael Berthold (KNIME). As first published in InfoWorld.

With new Integrated Deployment extensions, data scientists can capture entire KNIME workflows for automatic deployment to production or reuse

How to move data science into production

Deploying data science into production is still a big challenge. Not only does the deployed data science need to be updated frequently but available data sources and types change rapidly, as do the methods available for their analysis. This continuous growth of possibilities makes it very limiting to rely on carefully designed and agreed-upon standards or work solely within the framework of proprietary tools.

KNIME has always focused on delivering an open platform, integrating the latest data science developments by either adding our own extensions or providing wrappers around new data sources and tools. This allows data scientists to access and combine all available data repositories and apply their preferred tools, unlimited by a specific software supplier’s preferences. When using KNIME workflows for production, access to the same data sources and algorithms has always been available, of course. Just like many other tools, however, transitioning from data science creation to data science production involved some intermediate steps.

In this post, we are describing a recent addition to the KNIME workflow engine that allows the parts needed for production to be captured directly within the data science creation workflow, making deployment fully automatic while still allowing every module to be used that is available during data science creation.

Why is deploying data science in production so hard?

At first glance, putting data science in production seems trivial: Just run it on the production server or chosen device! But on closer examination, it becomes clear that what was built during data science creation is not what is being put into production.

I like to compare this to the chef of a Michelin star restaurant who designs recipes in his experimental kitchen. The path to the perfect recipe involves experimenting with new ingredients and optimizing parameters: quantities, cooking times, etc. Only when satisfied, are the final results — the list of ingredients, quantities, procedure to prepare the dish — put into writing as a recipe. This recipe is what is moved “into production,” i.e., made available to the millions of cooks at home that bought the book.

This is very similar to coming up with a solution to a data science problem. During data science creation, different data sources are investigated; that data is blended, aggregated, and transformed; then various models (or even combinations of models) with many possible parameter settings are tried out and optimized. What we put into production is not all of that experimentation and parameter/model optimization — but the combination of chosen data transformations together with the final best (set of) learned models.

This still sounds easy, but this is where the gap is usually biggest. Most tools allow only a subset of possible models to be exported; many even ignore the preprocessing completely. All too often what is exported is not even ready to use but is only a model representation or a library that needs to be consumed or wrapped into yet another tool before it can be put into production. As a result, the data scientists or model operations team needs to add the selected data blending and transformations manually, bundle this with the model library, and wrap all of that into another application so it can be put into production as a ready-to-consume service or application. Lots of details get lost in translation.

For our Michelin chef above, this manual translation is not a huge issue. She only creates or updates recipes every other year and can spend a day translating the results of her experimentation into a recipe that works in a typical kitchen at home. For our data science team, this is a much bigger problem: They want to be able to update models, deploy new tools, and use new data sources whenever needed, which could easily be on a daily or even hourly basis. Adding manual steps in between not only slows this process to a crawl but also adds many additional sources of error.

The diagram below shows how data science creation and productionization intertwine. This is inspired by the classic CRISP-DM cycle but puts stronger emphasis on the continuous nature of data science deployment and the requirement for constant monitoring, automatic updating, and feedback from the business side for continuous improvements and optimizations. It also distinguishes more clearly between the two different activities: creating data science and putting the resulting data science process into production.

Often, when people talk about “end-to-end data science,” they really only refer to the cycle on the left: an integrated approach covering everything from data ingestion, transforming, and modeling to writing out some sort of a model (with the caveats described above). Actually consuming the model already requires other environments, and when it comes to continued monitoring and updating of the model, the tool landscape becomes even more fragmented. Maintenance and optimization are, in many cases, very infrequent and heavily manual tasks as well. On a side note: We avoid the term “model ops” purposely here because the data science production process (the part that’s moved into “operations”) consists of much more than just a model.

Removing the gap between data science creation and data science production

Integrated deployment removes the gap between data science creation and data science production by enabling the data scientist to model both creation as well as production within the same environment by capturing the parts of the process that are needed for deployment. As a result, whenever changes are made in data science creation, these changes are automatically reflected in the deployed extract as well. This is conceptually simple but surprisingly difficult in reality.

If the data science environment is a programming or scripting language, then you have to be painfully detailed about creating suitable subroutines for every aspect of the overall process that could be useful for deployment — also making sure that the required parameters are properly passed between the two code bases. In effect, you have to write two programs at the same time, ensuring that all dependencies between the two are always observed. It is easy to miss a little piece of data transformation or a parameter that is needed to properly apply the model.

Using a visual data science environment can make this more intuitive. The new Integrated Deployment node extensions from KNIME allow those pieces of the workflow that will also be needed in deployment to be framed or captured. The reason this is so simple is that those pieces are naturally a part of the creation workflow. This is because first, the exact same transformation pieces are needed during model training, and second, evaluation of the models is needed during fine tuning. The following image shows a very simple example of what this looks like in practice:

The purple boxes capture the parts of the data science creation process that are also needed for deployment. Instead of having to copy them or having to go through an explicit “export model” step, now we simply add Capture-Start/Capture-End nodes to frame the relevant pieces and use the Workflow-Combiner to put the pieces together. The resulting, automatically created workflow is shown below:

The Workflow-Writer nodes come in different shapes that are useful for all possible ways of deployment. They do just what their name implies: write out the workflow for someone else to use as a starting point. But more powerful is the ability to use Workflow-Deploy nodes that automatically upload the resulting workflow as a REST service or as an analytical application to KNIME Server or deploy it as a container — all possible by using the appropriate Workflow-Deploy node.

The purpose of this article is not to describe the technical aspects in great detail. Still, it is important to point out that this capture and deploy mechanism works for all nodes in KNIME — nodes that provide access to native data transformation and modeling techniques as well as nodes that wrap other libraries such as TensorFlow, R, Python, Weka, Spark, and all of the other third-party extensions provided by KNIME, the community, or the partner network.

With the new Integrated Deployment extensions, KNIME workflows turn into a complete data science creation and productionization environment. Data scientists building workflows to experiment with built-in or wrapped techniques can capture the workflow for direct deployment within that same workflow. For the first time, this enables instantaneous deployment of the complete data science process directly from the environment used to create that process.

Blog

KNIME Blog: general

Guided Labeling Blog Series - Episode 3: Model UncertaintypaolotamagMon, 05/11/2020 - 13:58

Welcome to the third episode of our series on Guided Labeling!

In this series, we've have been exploring the topic of guided labeling by looking at active learning and label density. In the first episode we introduced the topic of active learning and active learning sampling and moved on to look at label density in the second article. Here are the links to the two previous episodes:

In this third episode, we are moving on to look at Model Uncertainty.

Using label density we explore the feature space and retrain the model each time with new labels that are both representative of a good subset of unlabeled data and different from already labeled data of past iterations. However, besides selecting data points based on the overall distribution, we should also prioritize missing labels based on the attached model predictions. In every iteration we can score the data that still need to be labeled with the retrained model. What can we infer given those predictions by the constantly re-trained model?

Before we can answer this question there is another common concept in machine learning classification related to the feature space: the decision boundary. The decision boundary defines a hyper-surface in the feature space of n dimensions, which separates data points depending on the predicted label. In Figure 1 we point again to our data set with only two columns: weight and height. In this case the decision boundary is a line drawn machine learning model to predict overweight and underweight conditions. In this example we use a line, however we could have also used a curve or a closed shape.

Fig. 1: In the 2D feature space of weight vs height we train a machine learning model to distinguish overweight and underweight subjects. The model prediction is visually and conceptually represented by the decision boundary - a line dividing the subjects in the two categories.

So let’s say we are training an SVM model - starting with no labels and using active learning. That is we are trying to find the right line. We label a few subjects in the beginning using label density. Subjects are labeled by simply applying an heuristics called body mass index - no need for a domain expert in this simple example.

In the beginning, the position of the decision boundary will probably be wrong as it is based on only a few data points in the most dense areas. However the more labels you keep adding the more the line will position itself closer to the actual separation between the two classes. Our focus here is to move this decision boundary to the right position using as few labels as possible. In active learning, this means using as little time as possible of our expensive human-in-the-loop expert.

To use less labels we need data points positioned around the decision boundary, as these are the data points best defining defining the line we are looking for. But how do we find them, not knowing where this decision boundary lies? The answer is, we use model predictions - and, to be more precise - we use model certainty.

Fig. 2: In the 2D dimensional feature space, the dotted decision boundary belongs to the model trained in the current iteration k. To move the decision boundary in the right direction we use uncertainty sampling, asking the user to label new data points near to the current decision boundary. We then identify misclassification, which subsequently leads to a better decision boundary in the next iteration after the model is retrained.

Looking for misclassification using uncertainty

At each iteration the decision boundary moves when a new point is labeled contradicting the model prediction. The intuition behind model certainty is that a misclassification is more likely to happen when the model is uncertain of its prediction. When the model has already achieved decent performance, model uncertainty is symptomatic of misclassification being more probable, i.e. a wrong prediction. In the feature space, model uncertainty increases as you get closer to the decision boundary. To quickly move our decision boundary to the right position we therefore look for misclassification using uncertainty. In this manner, we select data points that are close to the actual decision boundary (Fig. 2).

So here we go: at each iteration we score all unlabeled data points with the re-trained model. Next, we compute the model uncertainty, take the top uncertain predictions and we ask the user to label them. By retraining the model with all of the corrected predictions we are likely to move the decision boundary in the right direction and achieve better performance with less labels.

How do we measure model certainty/uncertainty?

There are different metrics; we are going to use the entropy score (Form. 3). This is a concept common in information theory. High entropy is a symptom of high uncertainty. This strategy is also known as uncertainty sampling and you can find the details in the blog article Labeling with Active Learning, which was first published in Data Science Central.

Prediction Entropy Formula

Given a prediction for row x by the classification model we can retrieve a probability vector P(l/x) which sums up to 1 and shows the different n probability of a row to belong to a possible target class l_i. Using such prediction vector we can measure its entropy score between 0 and 1 to define the uncertainty of the model in predicting P(l/x).

Wrapping up

In today's episode, we've taken a look at how model uncertainty can be used as a rapid way of moving our decision boundary to the correct position using as few labels as possible i.e. taking up as little time as possible of our expensive human-in-the-loop expert.

In the fourth episode of our Guided Labeling Blog Series we will go on to use uncertainty sampling to exploit the key areas of the feature space to an ensure an improvement of the decision boundary.Stay tuned to our blog post channel for the next episode and also for more posts on other data science topics!

The Guided Labeling Blog Series

By Paolo Tamagnini (KNIME)

Blog

KNIME Blog: general

KNIME on Databricks - Guideandisa.dewiMon, 05/18/2020 - 08:28

Continuing with our series of articles about cloud connectivity, this blog post is an introduction of how to use KNIME on Databricks. It's written as a guide, showing you how to connect to a Databricks cluster within KNIME Analytics Platform, as well as looking at several ways to access data from Databricks and upload them back to Databricks.

A Guide in 5 Sections

This "how-to" is divided into the following sections:

Connect to Databricks
Connect to a Databricks Cluster
Connect to a Databricks File System
Reading and Writing Data in Databricks
Databricks Delta

What is Databricks?

Databricks is a cloud-based data analytics tool for big data management and large-scale data processing. Developed by the same group behind Apache Spark, the cloud platform is built around Spark, allowing a wide variety of tasks from processing massive amounts of data, building data pipelines across storage file systems, to building machine learning models on a distributed system, all under a unified analytics platform. One advantage of Databricks is the ability to automatically split workload across various machines with on-demand autoscaling.

The KNIME Databricks Integration

KNIME Analytics Platform includes a set of nodes to support Databricks, which is available from version 4.1. This set of nodes is called the KNIME Databricks Integration and enables you to connect to your Databricks cluster running on Microsoft Azure or Amazon AWS cluster. You can access and download the KNIME Databricks Integration from the KNIME Hub.

Note: This guide is explained using the paid version of Databricks. The good news is: Databricks also offers a free community edition of Databricks for testing and education purposes, with access to 6 GB clusters, a cluster manager, a notebook environment, and other limited services. If you are using the community edition, you can still follow this guide without any problem.

Connect to Databricks

Add the Databricks JDBC driver to KNIME

To connect to Databricks in KNIME Analytics Platform, first you have to add the Databricks JDBC driver to KNIME with the following steps.

1. Download the latest version of the Databricks Simba JDBC driver at the official website. You have to register to be able to download any Databricks drivers. After registering, you will be redirected to the download page with several download links, mostly for ODBC drivers. Download the JDBC Drivers link located at the bottom of the page.

NOTE: If you’re using a Chrome-based web browser and the registration somehow doesn’t work, try to use another web browser, such as Firefox.

2. Unzip the compressed file and save it to a folder on your hard disk. Inside the folder, there is another compressed file, unzip this one as well. Inside, you will find a .jar file which is your JDBC driver file.

NOTE: Sometimes you will find several zip files inside the first folder, each file refers to the version of JDBC that is supported by the JDBC driver. KNIME currently supports JDBC drivers that are JDBC 4.1 or JDBC 4.2 compliant.

3. Add the new driver to the list of database drivers:

In KNIME Analytics Platform, go to File > Preferences > KNIME > Databases and click Add
The “Register new database driver” window opens.
Enter a name and an ID for the JDBC driver. For example, ID=Databricks, and name=Databricks
In the Database type menu select databricks.
The URL template should be automatically detected. If not, enter the following URL template jdbc:spark://:/default. The and placeholder will be automatically replaced with your cluster information. This URL points to the schema default, which will be the standard schema for the database session. If you want to change the sessions standard schema, replace the default part in the URL with your own schema name. You can always access other schemas as well by entering the schema name in the node dialogs when working with database objects.
Click Add file. In the window that opens, select the JDBC driver file (see item 2 of this step list)
Click Find driver classes, and the field with the driver class is populated automatically
Click OK to close the window
Now click Apply and close.

If you are somehow not able to download and add the official JDBC driver, don’t despair! KNIME Analytics Platform provides an open source Apache Hive driver that you can directly use to connect to Databricks. However, it is strongly recommended to use the official JDBC driver provided by Databricks. If you do want to use the open source Apache Hive driver, you can skip this section and go directly to the next section.

Connect to a Databricks cluster

In this section we will configure the Create Databricks Environment node to connect to a Databricks cluster from within KNIME Analytics Platform.

Note: The Create Databricks Environment node is part of the KNIME Databricks Integration, available on the KNIME Hub.

Before connecting to a cluster, please make sure that the cluster is already created in Databricks. For a detailed instruction on how to create a cluster, follow the tutorial provided by Databricks. During cluster creation, the following features might be important:

Autoscaling: Enabling this feature allows Databricks to dynamically reallocate workers for the cluster depending on the current load demand.

Auto termination: You can specify an inactivity period, after which the cluster will terminate automatically.

The autoscaling and auto termination features, along with other features during cluster creation might not be available in the free Databricks community edition.

After the cluster is created, open the configuration window of the Create Databricks Environment node. The information we have to provide when configuring this node are:

The full Databricks deployment URL
- The URL is assigned to each Databricks deployment. For example, if you use Databricks on AWS and log into https://1234-5678-abcd.cloud.databricks.com/, it is your Databricks URL
- Warning: The URL looks different depending on whether it is deployed on AWS or Azure.

In the free Databricks community edition, the deployment URL is https://community.cloud.databricks.com/.

The Cluster ID

Cluster ID is the unique ID for a cluster in Databricks. To get the cluster ID, click the Clusters tab in the left pane and then select a cluster name. You can find the cluster ID in the URL of this page /#/settings/clusters//configuration.

The examples below show how to find the cluster ID on both AWS and Azure Databricks.

The URL in the free Databricks community edition is similar to the one on Azure Databricks (see Figure 5).

Workspace ID

Workspace ID is the unique ID for a Databricks workspace where you can create Spark clusters or schedule workloads. It is only available for Databricks on Azure, or if using the free Databricks community edition. If you’re using Databricks on AWS, just leave it blank.

You can find the workspace ID also in the deployment URL. The random number after o= is the workspace ID, for example, https:///?o=3272736592385

Note: For more information on URLs and IDs please check the Databricks documentation.

Authentication

Token is strongly recommended as the authentication method in Databricks. To generate an access token:

1. In your Databricks workspace, click on the user profile icon on the upper right corner and select User Settings.

2. Navigate to the Access Tokens tab.

3. Click Generate New Token as shown in Figure 7, and optionally enter the description and the token lifetime.

4. Finally, click the Generate button, as shown in Figure 8.

5. Store the generated token in a safe location.

Note: For more information on Databricks access token, please check the Databricks documentation.

Access token is unfortunately not available in the free Databricks community edition. Use the username and password option as an alternative.

To configure more advanced options, you can check the Advanced tab in the Create Databricks Environment node. For example, the following settings might be useful:

Create Spark context checkbox is enabled by default to run KNIME Spark jobs on Databricks. However, if your cluster runs with Table Access Control, you have to disable this option because TAC doesn’t support a Spark execution context.
Enabling the Terminate cluster on context destroy checkbox will terminate the cluster when the node is reset, when the Destroy Spark Context node is executed, or when the workflow or KNIME is closed. This might be important if you need to release resources immediately after being used. However, use this feature with caution! Another option is to enable the auto termination feature during cluster creation, where the cluster will auto terminate after a certain period of inactivity.

Additionally, the DB Port tab contains all database-related configurations, which are explained in more detail in the KNIME database documentation.

That’s it! After filling all the necessary information in the Create Databricks Environment node, you can execute the node and it will automatically start the cluster if required and wait until the cluster becomes ready. This might take some minutes until the required cloud resources are allocated and all services are started.

The node has three output ports:

Red port: JDBC connection which allows connecting to KNIME database nodes.
Blue port: DBFS connection which allows connecting to remote file handling nodes as well as Spark nodes.
Gray port: Spark context which allows connecting to all Spark nodes.

The Remote File Handling nodes are available under IO > File Handling > Remote in the node repository.

These three output ports allow you to perform a variety of tasks on Databrick clusters via KNIME, such as connecting to a Databricks database and performing database manipulation via KNIME database nodes or executing Spark jobs via KNIME Spark nodes, while pushing down all the computation process into the Databricks cluster.

Connect to the Databricks File System

Another node in the KNIME Databricks Integration package is called the Databricks File System Connection node. It allows you to connect directly to Databricks File System (DBFS) without having to start a cluster as is the case with the Create Databricks Environment node, which is useful if you simply want to get data in or out of DBFS.

In the configuration dialog of this node, you have to provide the domain of the Databricks deployment URL, e.g 1234-5678-abcd.cloud.databricks.com, as well as the access token or username/password as the authentication method. Please check the Connect to a Databricks cluster section for information on how to get the Databricks deployment URL and generate an access token.

Note: The Databricks File System Connection node is a part of the KNIME Databricks Integration, available on the KNIME Hub.

Reading and Writing Data in Databricks

Now that we are connected to our Databricks cluster, let’s look at the following KNIME example workflow to read data from Databricks, do some basic manipulation via KNIME, and write the result back into Databricks. You can access and download the workflow Connecting to Databricks from the KNIME Hub.

We are going to read an example dataset flights provided by Databricks. The dataset contains flight trips in the United States during the first three months in 2014.

Because the dataset is in CSV format, let’s add the CSV to Spark node, just after the Create Databricks Environment node by connecting it to the DBFS (blue) port and Spark (gray) port. In the configuration window, simply enter the path to the dataset folder, for the flights dataset the path is /databricks-datasets/flights/departuredelays.csv, and then execute the node.

The dataset is now available in Spark and you can utilize any number of Spark nodes to perform further data processing visually. In this example, we do a simple grouping by origin airports and calculate the average delay using the Spark GroupBy node.

To write the aggregated data back to Databricks, let’s say in Parquet format, add the Spark to Parquet node. The node has two input ports, connect the DBFS (blue) port to the DBFS port of the Create Databricks Environment node, and the second port to the Spark GroupBy node. To configure the Spark to Parquet node:

1. Under Target folder, provide the path on DBFS to the folder where you want the Parquet file(s) to be created.

2.Target name is the name of the folder that will be created in which then the Parquet file(s) will be stored.

3. If you check the option Overwrite result partition count, you can control the number of the output files. However, this option is strongly not recommended as this might lead to performance issues.

4. Under the Partitions tab you can define whether to partition the data based on specific column(s).

KNIME supports reading various file formats into Spark, such as Parquet or ORC, and vice versa. The nodes are available under Tools & Services > Apache Spark > IO in the node repository.

It is possible to import Parquet files directly into a KNIME table. Since our large dataset has now been reduced a lot by aggregation, we can safely import them into KNIME table without worrying about performance issues. To read our aggregated data from Parquet back into KNIME, let’s use the Parquet Reader node. The configuration window is simple, enter the DBFS path where the parquet file resides. Under the Type Mapping tab, you can control the mapping from Parquet data types to KNIME types.

Now that our data is in a KNIME table, we can create some visualization. In this case, we do further simple processing with sorting and filtering to get the 10 airports with the highest delay. The result is visualized in a Bar Chart.

Now we would like to upload the data back to Databricks in Parquet format, as well as write them to a new table in the Databricks database. The Parquet Writer node writes the input KNIME table into a Parquet file. To connect to DBFS, please connect the DBFS (blue) port to the DBFS port of the Create Databricks Environment node. In the configuration window, enter the location on DBFS where the Parquet file will be written to. Under the Type Mapping tab, you can control the mapping from KNIME types to Parquet data types.

To create a new table, add the DB Table Creator node and connect the DB (red) port to the DB port of the Create Databricks Environment node. In the configuration window, enter the schema and the table name. Be careful when using special characters in the table name, e.g underscore (_) is not supported. Append the DB Loader node to the DB Table Creator with the KNIME table you want to load, and connect the DB (red) port and the DBFS (blue) port to the DB port and DBFS port of the Create Databricks Environment node respectively. Executing this node will load the content of the KNIME table to the newly created table in the database.

At the end there is an optional step to execute the Destroy Spark Context node to delete the Spark context, and if the option is enabled in the Create Databricks Environment node, the cluster will also be terminated to save resources. However, use this method with caution especially if you share the cluster with other people!

Note: Parquet Reader and Parquet Writer nodes are part of the KNIME Extension for Big Data File Formats, available on the KNIME Hub.

To summarize, there are several ways to read data from Databricks:

To read from a data source and convert them to Spark, you can choose any node under Tools & Services > Apache Spark > IO > Read in the node repository, depending on your choice of data source. KNIME supports a variety of data sources, such as Parquet, ORC, CSV, etc.
To import Parquet or ORC dataset into a KNIME table, use the Parquet Reader or ORC Reader node, respectively.
To read from Databricks database, you can use the DB Table Selector node, where you can select a table and perform some processing with the KNIME database nodes. Additionally, the node Hive to Spark, and Spark to Hive support reading database data from/to Spark.

Note: Always connect the input DBFS (blue) port to the DBFS port of the Create Databricks Environment node.

As with reading, there are also several ways to write data back into Databricks:

To convert Spark DataFrame back into a certain data source format, you can select any node under Tools & Services > Apache Spark > IO > Write in the node repository.
The Parquet Writer node allows you to convert a KNIME table into Parquet files and write them locally or on a remote file system.
To write into a Databricks database, one way to do it is with a DB Loader node to bulk load the data if you have a large amount of data.

Databricks Delta

Databricks Delta Lake is a storage layer between the Databricks File System (DBFS) and Apache Spark API. It provides additional features, such as ACID transactions on Spark, schema enforcement, time travel, and many others.

To create a Delta table in KNIME using DB Table Creator node:

1. Connect the first port to the DB port (red) of the Create Databricks Environment node, and the second port to the KNIME table you want to write into the Databricks database.

2. In the configuration window, enter the table name and schema as usual, and configure the other settings as according to your need. The important addition to make this table become a Delta table, is to insert a USING DELTA statement under the Additional Options tab (see Figure below).

3. Execute the node and you will have a newly created empty Delta table. Fill the table with data using e.g the DB Loader node.

Time Travel on Databricks Delta

Databricks Delta offers a lot of additional features to improve data reliability, such as time travel. Time travel is a data versioning capability allowing you to query an older snapshot of a Delta table (rollback).

To access the version history in a Delta table on the Databricks web UI:

1. Navigate to the Data tab in the left pane.

2. Select the database and the Delta table name.

3. The metadata and a preview of the table will be displayed. If the table is indeed a Delta table, it will have an additional History tab beside the Details tab (see Figure below).

4. Under the History tab, you can see the versioning list of the table, along with the timestamps, operation types, and other information.

In KNIME, accessing older versions of a Delta table is very simple:

1. Use a DB Table Selector node. Connect the input port with the DB port (red) of the Create Databricks Environment node.

2. In the configuration window, enter the schema and the Delta table name. Then enable the Custom query checkbox. A text area will appear where you can write your own SQL statement.

a) To access older versions using version number, enter the following SQL statement:

SELECT * FROM #table# VERSION AS OF

Where is the version of the table you want to access. Check Figure 13 to see an example of a version number.

b) To access older versions using timestamps, enter the following SQL statement where is the timestamp format. To see the supported timestamp format, please check the Databricks documentation

SELECT * FROM #table# TIMESTAMP AS OF

3. Execute the node. Then right click on the node, select DB Data, and Cache no. of rows to view the table.

Wrapping up

We hope you found this guide on how to connect and interact with Databricks from within KNIME Analytics platform useful.

by Andisa Dewi (KNIME)

Summary of the resources mentioned in the article

Example workflow on KNIME Hub: Connecting to Databricks
KNIME Databricks Integration
KNIME Extension for Apache Spark
Databricks Documentation

The dataset

In this blog post we will dig into the Netflix Movies and TV shows dataset, freely available on Kaggle. It contains all the shows offered in the US by the streaming platform as of January 2020. Each entry carries the title of the show, whether it is a Movie or a TV Show, the director and cast, the country and year of production, the date when it has been added to the catalog, the duration and category and a short description. Enough information to pull out some interesting visualizations!

Prestep: importing and preprocessing the data

You can download the dataset directly from the Kaggle page. Once on your machine, import the data into a new workflow by drag and drop. As often happens, some preprocessing is needed. Inspecting the raw data we can see that the date_added column has a verbose format that makes it difficult to work with. So, I converted it to the Date&Time format and grouped all the steps in the Preprocessing metanode. The workflow developed for this blog post is available on the KNIME Hub and can be downloaded here, Create an interactive dashboard in 3 steps: Netflix databset. After importing it into your KNIME Analytics Platform, you can have a look at the content of the metanode “Preprocessing” more in detail.

Step 1: Create a few beautiful charts

There are two kinds of people: the ones who watch Netflix and the ones who lie. But we can also split the population into two different categories: movie people and TV series people!

Also, do we know how many movies and series are on Netflix? Which one is the most popular category? How long is the longest movie? With the right chart, this is soon said.

For example, a Sunburst Chart (Figure 1) can easily point out how the shows are distributed among the categories. Attach a Sunburst Chart node to the Preprocessing metanode, configure it as to group first by type (movie or TV show), then by category, and execute it. Now, right-click on the Sunburst Chart node and select “Interactive View: Sunburst Chart”. The view shown in Fig. 1 will pop up: we can see that the movies are double than the TV shows, and that the most populated categories are International, Dramas and Comedies. Did you know that?

How to create a great dashboard with KNIME in 3 steps — Fig. 1. Sunburst chart: the number of movies offered is double than TV series. Hover on a portion of the chart to show the percentage.

One more aspect we can explore is the evolving of the catalog over the years: I grouped the shows by the year_added column and displayed the result on a Line Plot (Fig. 2). Apparently, the number of productions added to the offer keeps increasing every year: in the first month of 2020 Netflix have already added more shows than in the whole 2015!

Then I built a Bar Chart (Fig. 3) to visualize the number of seasons produced for the TV Shows and a Histogram (Fig. 4) that groups movies per duration. Did you expect so many TV Shows left with only one season? Did you know that there are movies longer than 4 hours?

There are plenty of opportunities for data visualization in KNIME Analytics Platform. You can find dedicated nodes in Node Repository > Views -> Javascript and even build your own visualization using the Generic Javascript View node.

If you are running out of imagination, there is an entire selection of workflows on the EXAMPLES Server full of useful visualizations that you can easily readapt to your needs.

Charts are also customizable! See for example the Bar Chart in Figure 3, where I changed the default blue to the -official- Netflix red.

Download and try out the workflow yourself, called "Create an interactive dashboard in 3 steps: Netflix shows" from the KNIME Hub.

All produced charts and plots are interactive. You can change the visualized data, the plot properties, the selected points and more directly from the interactive view by clicking on the upper right Setting icon, circled in red in Figure 3. For more in-depth customizations, check the guide showing how to integrate CSS code to make your JavaScript visualizations shine.

Step 2: Wrap them up into a component

If one plot is nice...two plots are nicer! Let’s organize all our wonderful graphics in a complete dashboard. Select all four nodes used for the visualizations and right click ->“Create Component…”. This creates a new gray node: the component.

This ensemble visualization can also be enriched and customized. CTRL + double click on the component to open it. Add a Text Output Widget node and type the description you want to add to your visualization.

We can make the dashboard more interactive adding, for example, a Table View node for selection. I set it to only display the shows selected in the Histogram and Bar Chart.

This is a good way to inspect the content of the different bins.

For example, do you know which is the longest movie on Netflix? Open the interactive view of the component, select the last histogram bin - which contains only one movie - and look at the table view. If you have heard about that movie you can easily imagine why it lasts so long!

It’s now time to organize our dashboard to make it neater and understandable. From inside the component, click the last icon of the toolbar (see screenshot in Fig.7) to open the Node Usage and Layout window. Here you can arrange your charts, set position, dimension, and create groups. If you have created a nested component, it will be handled as a grouped visualization.

Step 3: Deploy the interactive view as a web page

You can also inspect the component's interactive view as a web page in a web browser. To perform this operation, you need to deploy your workflow to a KNIME Server instance, using the one-click-deployment. Do this by going to the KNIME Explorer panel, right-clicking your workflow and selecting “Deploy to Server…”. Now choose the desired destination and click OK.

To visualize the dashboard, right-click the uploaded workflow and select Open -> In Web Portal. Your browser will let you execute the workflow and visualize the dashboard built by the component (Figure 8).

Summary

In this blog post we discovered how simple it is to create an interactive dashboard for your data in KNIME Analytics Platform. Set up your charts, wrap up the nodes into a component and customize it if needed, execute locally or on the KNIME WebPortal and play with your visualization. As easy as a pie (chart)!

How to create an interactive dashboard in 3 steps with KNIME — Fig. 8. This is the dashboard visualization as it would appear on the KNIME WebPortal.

Want more visualizations? Here are some more advanced ideas you can easily implement. You'll find these visualizations in this more advanced version of the example workflow on the KNIME Hub here: https://kni.me/w/grHmwo1F0xiQPdO7

Author: Emilio Silvestri (KNIME)

Resources

The workflows shown in this article are both available for you to download and try out yourself on the KNIME Hub:

Blog

KNIME Blog: general

Predicting employee attrition with machine learningClearPeaksTue, 06/02/2020 - 10:00

Predicting customer attrition with machine learning

Data is the new oil. More and more data is being captured and stored across industries and this is changing society and, therefore, how businesses work. Traditionally, BI tried to give an answer to the general question: what has happened in my business? Today, companies are involved in a digital transformation that enables the next generation of BI: Advanced Analytics (AA). With the right technologies and a data science team, businesses are trying to give an answer to a new game changer question: what will happen in my business?

We are already listening how AA is helping to increase profits in many companies. However, some businesses are late in the adoption of AA, while others are trying to adopt AA but are just failing for various reasons. ClearPeaks is already helping many businesses to adopt AA, and in this blog article we will review, as an illustrative example, an AA use case involving Machine Learning (ML) techniques to help HR departments to retain talent.

Employee attrition refers to the percentage of workers who leave an organization and are replaced by new employees. A high rate of attrition in an organization leads to increased recruitment, hiring and training costs. Not only it is costly, but qualified and competent replacements are hard to find. In most industries, the top 20% of people produce about 50% of the output. (Augustine, 1979).

Join Marc Guirao's free webinar on June 17, 2020 at 10 AM (CEST) - Using Machine Learning to Predict Customer Attrition. Register here.

The use case: employee attrition

This use case takes HR data and uses machine learning models to predict what employees will be more likely to leave given some attributes. Such model would help an organization predict employee attrition and define a strategy to reduce such costly problem.

The input dataset is an Excel file with information about 1470 employees. For each employee, in addition to whether the employee left or not (attrition), there are attributes / features such as age, employee role, daily rate, job satisfaction, years at the company, years in current role, etc.

The steps we will go through are:

Data preprocessing
Data analysis
Model training
Model validation
Model predictions
Visualization of results

The "training" workflow

1. Data preprocessing

First, let’s look at the data. The dataset was released by IBM as open-data some time ago, and you can download it from Kaggle. We import the excel file with an Excel Reader node in KNIME and then we drag & drop the Statistics node.

Predicting employee attrition with machine learning — Fig. 1. Input data & check statistics

Good news, there are no missing values in the dataset. With Statistics view we can also see that the variables EmployeeCount, Over18, and StandardHours have a single value in the whole dataset; we will remove them as they are useless with regard to predictive significance.

Let’s add a Column Filter node to exclude the mentioned useless variables. We will also exclude the EmployeeNumber variable as it’s just an ID. Next, we can generate some features to give more predictive power to our model:

We categorized Monthly Income: from 0 to 6503 it was labeled as “low” and “high” if it was over 6503.
We categorized Age: 0 to 24 corresponds to “Young”, 24 to 54 corresponds to “Middle-Age” and over 54 corresponds to “Senior”.
We aggregated the fields EnvironmentSatisfaction, JobInvolvement, JobSatisfaction, RelationshipSatisfaction and WorkLifeBalance into a single feature (TotalSatisfaction) to have an overall satisfaction.

2. Data analysis

At this point we will analyze the correlation between independent variables and the target variable, attrition. We have created the following visualizations using Tableau, but you can find some KNIME visualizations in the BIRT report of the workflow that accompanies this blog post.

We see that employees that travel frequently tend to leave more the company. This will be an important variable for our model.

In a similar way, single people working overtime hours tend to leave in a higher rate than those who work regular hours and are married or divorced.

Finally (and as we could expect), low salaries make the employees more likely to leave.

3. Model training

As the graph shows, the dataset is unbalanced. When training models on such datasets, class unbalance influences a learning algorithm during training by making decision rule biased towards the majority class and optimizes the predictions based on the majority class in the dataset. There are three ways to deal with this issue:

Upsampling the minority class or downsampling the majority class.
Assign a larger penalty to wrong predictions from the minority class.
Generate synthetic training examples.

In this example we will use the first approach.

To begin, let’s split the dataset into training and test sets using 80/20 split; 80% of data will be used to train de model and the rest 20% to test the accuracy of the model. Then we can upsample the minority class, in this case the positive class. We added the Partitioning and SMOTE node in KNIME.

After partitioning and balancing, our data is finally ready to be the input of the machine learning models. We will train 4 different models: Naïve Bayes, Random Forest, Logistic regression and Gradient Boosting. In this step, you should start modifying model parameters, perform feature engineering and balancing data strategies to improve the performance of the models. Try with more trees in the Random Forest model, include new variables, penalize wrong predictions from the minority class until you beat the performance of your current best model.

You can download the training KNIME workflow from the KNIME Hub by following this link.

4. Model validation

Finally, after testing our models with the test set, we concluded that best model was the Random Forest (RF). We can save the trained model using the Model Writer node. We based our decision on the statistics we see in the following table:

RF has the highest accuracy, meaning it guesses correctly 89.1% of the predictions. Moreover, and more important, it has the highest F1-score, which gives a balance between precision and recall an it is the measure to use if the sample is unbalanced. The ROC curve is also a good measure to choose the best model. AUC stands for area under the curve, and the larger this is the better the model. Applying the ROC Curve node, we can visualize each ROC curve.

These measures come from the confusion matrix, showing which predictions were correct (matrix diagonal) and which were not. We can check the confusion matrix out of the RF model.

The Random forest works on the Bagging principle; it is an ensemble of Decision Trees. The bagging method is used to increase the overall results by combining weak models. How does it combine the results? In the case of classification problem, it takes the mode of the classes, predicted in the bagging process.

The "deployment" workflow

5. Model predictions

Once we have chosen the best model, we apply the saved model to the current employees. Generate a new workflow that outputs the predictions we will visualize in Tableau.

You can download the deployment KNIME workflow from the KNIME Hub by following this link.

6. Visualization of results

Now we have our dataset with our current employees and their probability of leaving the company. If we were the HR manager of the company, we would require a dashboard in which we could see what to expect regarding future attrition and, hence, adopt the correct strategy to retain the most talented employees.

We will connect Tableau to this dataset and make a dashboard. (You can see what it would look like below). It contains analysis on percentage of predicted attrition, analysis by gender, business travel, department, salary hike or by distance from home. You can also drill down to see the employees aggregated in each of these analyses. As a quick conclusion, male employees who travel frequently, work at HR department, have a low salary hike, and live far from workplace have a high probability of leaving the company.

Conclusions

In this blog article we have detailed the various steps when implementing an advanced analytics use case in HR, employee attrition. We used the open-source tool KNIME to prepare the data, train different models, compare them and chose the best. With the model predictions, we created a dashboard in Tableau that would help any HR manager to retain the best talent by applying the correct strategies. This step-by-step blog article is just an example of what advanced analytics can do for your business, and of how easy is to do it with the proper tool.

In ClearPeaks we have a team of data scientists that have implemented many use cases for different industries using KNIME as well as other AA tools. If you are wondering how to start leveraging AA to improve your business, contact us and we will help you on your AA journey! Stay tuned for future posts!

By Marc Guirao, Senior Consultant BI, Big Data & Data Science (ClearPeaks)

About the author

Marc Guirao is a BI and data science expert. Currently, he works in the Advanced Business Analytics team at ClearPeaks. He holds an MSc in mathematics and has expertise in banking, logistics, and retail industries.

Blog

KNIME Blog: general

Guided Labeling Blog Series - Episode 4: From Exploration to ExploitationpaolotamagMon, 06/08/2020 - 10:00

One of the key challenges in using supervised machine learning for real world use cases is that most algorithms and models require a sample of data that is large enough to represent the actual reality your model needs to learn.

These data need to be labeled. These labels will be used as the target variable when your predictive model is trained. In this series we've been looking at different labeling techniques that improve the labeling process and save time and money.

What happened so far:

Episode 1 introduced as to active learning sampling, bring the human back into the process to help guide the algorithm.
Episode 2 discussed the label density approach, which follows the strategy that when labeling a dataset you want to label feature space that has a dense cluster of data points.
Episode 3 moved on to the topic of model uncertainty as a rapid way of moving our decision boundary to the correct position using as few labels as possible and taking up as little time of our expensive human-in-the-loop expert.

Today, we explore and exploit the feature space

Using uncertainty sampling we can exploit certain key areas of the feature space, which will ensure an improvement of the decision boundary. Using label density, we explore the feature space to find something new the model has never seen before and that might change entirely the decision boundary. The idea now is to use both approaches - label density and uncertainty sampling - at the same time to enhance our active learning sampling. To do this we combine the density score and the uncertainty score in a single metric we can call potential (Form. 4).

Guided Labeling Series - 4 - Exploration to Exploitation

Formula 1: The potential score P(x _i) is defined as the sum of the density score D(x_i) and the uncertainty score R(x_i), where Ɛ is a fixed parameter between 0 and 1 which defines the contribution of the two separate scores.

We can now rank the data points that can still be labeled using this potential rank. The epsilon parameter defines which of the two strategies is most contributing. In the beginning, with only a few labels, model performance will be quite bad. That is why it is not wise to use the uncertainty score (R(x) or exploitation). On the contrary the feature space is still unexplored so it makes total sense to leverage on the density score (D(x) or exploration). After labeling for a few iterations, depending on the feature distribution, the exploration strategy will usually become less and less useful as we will have explored most of the dense areas. Despite that, the model should increase in performance as you keep on providing new labels. This means that the exploitation strategy will become more and more useful to finally correctly place the decision boundary. To ensure leveraging on the right technique you can implement a smooth transition so that in the first iterations epsilon is less than 0.5 and when you see an improvement in performance of the model epsilon becomes greater than 0.5.

That is it then, in active learning we can use the potential score to go from exploration to exploitation: from discovering cases in the feature space that are new and interesting to the expert, to focusing on critical ones where the model needs the human input. The question is, though: When do we stop?

The active learning application might implement a threshold for the potential. If no data point is above a certain threshold this means we have explored the entire feature space and the model has enough certainty in every prediction (quite unlikely). Therefore we can end the human-in-the-loop cycle. Another way to stop the application is by measuring the performance of the model if you have enough new labels to build a test set. If the performance of the model has a achieved a certain accuracy we can automatically end the cycle and export the model. Despite those stopping criteria, in most active learning applications the expert is supposed to label as much as possible and freely quit the application and save the output, i.e. the model and the labels gathered so far.

Comparing Active Learning Sampling with Random Sampling

Let’s look at our overweight/underweight prediction example again and do an experiment. We want to compare two SVM models: the first is trained using 10 random data points and the second is trained using 10 data points selected by using the exploration vs exploitation active learning approach. The labels are given by applying the heuristic of the body mass index. If applied to all the rows, such a heuristic would create a curve close to a line which the SVM will try to reproduce based on only 10 data points.

For the random labels approach we simply select 10 random data points, label them “overweight” or “underweight” using the heuristic and then use them to train the SVM. For the active learning sampling approach we will select and label three data points at random and start the active learning loop. We train an SVM and compute the potential score on all the remaining ones. Selecting and labeling the top ranked row by potential, we retrain the SVM with this additional, newly labeled data point. We repeat this a further six times until we have an SVM trained on 10 rows that were selected using active learning sampling. Which model will now perform better? The SVM with randomly picked labels or the the model where the labels were selected using active learning sampling?

Figure 1. A chart comparing the performance of the two experiments. The color blue indicates the model trained with 10 data points, selected by active learning sampling. This model is performing better than the green model, which was trained with 10 random data points.

As expected the active learning strategy is more reliable (Fig. 1).The random sampling instead depends entirely on how the 10 data points are randomly distributed in the feature space. In this case they were picked with such bad positioning that the trained SVM is quite distant from the actual decision boundary, found by the body mass index. In comparison, the active learning sampling produces an almost overlapping decision boundary with the body mass index heuristic with only 10 data points. (Fig. 2).

Figure 2. Comparing the two sampling techniques and the resulting models. On the left, the model trained with ten randomly selected data points shows a solid line which fits the samples, but is quite different from the body mass index heuristic (dotted line) we are trying to learn. On the right, active learning sampling defines a tight sample which results in a decision boundary quite close to the desired dotted line.

To compute those results we used KNIME Analytics Platform. You can find a KNIME workflow (Fig. 3) that runs this experiment on the KNIME Hub. It uses the KNIME Active Learning Extension. This workflow generates an interactive view (Fig. 3) which you can use to compare misclassifications and the performance of the two models.

The Active Learning Extension provides a number of nodes that allow us to build workflows realizing the active learning technique in various ways. Among these nodes are nodes to estimate the uncertainty of a model, as well as density based models that ensure that your model explores the entire feature spaces, making it easy to balance exploration and exploitation of your data. You'll also find a dedicated active learning loop, as well as a labeling view, which can be used to create active learning WebPortal applications. These applications hide the complexity of the underlying workflow and to simply allow the user to concentrate on performing the task of labeling new examples.

Figure 3. The workflow that runs the experiment. You can download it from the KNIME Hub here. In the top branch, the partitioning node selects ten random data points to train the first SVM. In the bottom branch, the Active Learning Loops selects ten data points based on the exploration vs. exploitation approach. The predictions of the two models are imported into a component, which generates an interactive view (Fig. 4).

Figure 4. To visually compare the two techniques, the workflow includes a component that generates an interactive view. The view can be used to explore the misclassifications across the two models. You can clearly see the SVM model, which learns from randomly picked labels (bottom left), is classifying quite differently from the actual, ground truth based decision boundary (center). Using interactivity, it is possible to select data points via the two different confusion matrices and see them selected in the 2D density plots.

Using KNIME Analytics Platform you can design a web-based application for the expert to label data using active learning on the KNIME WebPortal in combination with Guided Analytics.

Have a look at our blueprint for classifying documents on the KNIME Hub.

Summing up

You now have all the ingredients to start developing your own active learning strategy from exploration to exploitation of unlabeled datasets!

In the next and fifth episode of our series of Guided Labeling Blog Posts we will introduce you to weak supervision, an alternative to active learning for training a supervised model with an unlabeled dataset. Stay tuned to our blog post channel for a full list of blog posts!

Resources:

Active Learning with Basic SVM Model workflow
KNIME Active Learning Extension
Active Learning for Document Classification - blueprint

The Guided Labeling Blog Series

By Paolo Tamagnini (KNIME)

Blog

KNIME Blog: general

Gut Microbiome Analysis with KNIME Analytics Platformtemesgen-dadiMon, 06/15/2020 - 10:00

I like your gut feeling better. Can I have your gut microbes?

Microbiomes live inside us and on us and are real multi-taskers. They break down nutrients that our body couldn’t break down by itself. They train our immune system. And they are first in line in our defense against pathogens. Our health depends on them.

Microbiome Analysis with KNIME Analytics Platform

The analysis of the quality and quantity of microbiomes is therefore an important undertaking. This article takes a look at one of the steps involved in this analysis called taxonomic profiling. Taxonomic profiling determines where exactly a group, or community, of microbes is living in or on the body - for example the gut - and then estimates how many of them there are.

The human gut is home to a lot of bacteria (approx. 10^9!) plus other microorganisms we collectively call microbes.These microbes play a wide range of roles in keeping us healthy. And in order for them to keep us healthy and for us to - well - provide for them, there needs to be a particular composition of different types (species) of microbes so that we get key nutrients in the right amounts.. Diseases like Irritable Bowel Syndrome (IBS) and Inflammatory Bowel Disease (IBD) ² ⁴ ⁵ are the result when this mix is sub-optimal.

Researchers have been looking at different treatment strategies (probiotics, prebiotics, symbiotics and antibiotics ⁵ ⁸) to alter microbiome composition to achieve a state that promotes gut health. While these strategies are still in need of formation to become standard treatment, fecal transplant from a healthy donor to a patient with IBD is gaining momentum as an alternative and blanket solution. The ultimate goal: to replace the gut microbial community of the patient with that of the healthy donor.

In order to evaluate whether or not the transplant works, the composition of the gut microbial community needs to be monitored before and after the transplant, for example by taking environmental samples, extracting the genetic material, and performing DNA sequencing. The resulting data enables us to infer which types of organisms are in the sample and the prevalence of each type. This is done by using the subtle differences in nucleotide sequences among genomes/genes of different bacterial species. In this article, I focus on using a particular gene, namely the 16S ribosomal RNA gene, for this purpose.

This blog presents a KNIME workflow that has been created to analyze 16S-rRNA data obtained from the gut of 10 IBD patients at different time points while they undergo fecal transplants. The overall goal is to understand the shift in gut microbiome composition with the help of multiple visualizations. Using an interactive sunburst chart (below), I can visualize which group of bacteria are common and show how prevalent they are in the human gut. Being interactive, users can click on portions of the chart to display the actual sequences that are tied to the selected group and see their count in a table view.

Microbiome Analysis with KNIME - summary

The workflow ultimately produces a JavaScript visualization, which shows the shift in microbial composition of individual patients via a dashboard of stacked bar plots. This visualization makes it very easy to compare the microbial composition of the donor’s gut with the gut microbiome of the receiver at different time points.

The full article showcases how to get the data from the European Nucleotide Archive via REST and FTP services, how to preprocess the data, use an external R-package within KNIME, and visualize multiple microbiome composition so as to compare them. The integration of complex and domain specific external R packages in KNIME enables the creation of workflows that are not just transparent and easy to understand, but also sufficiently powerful to get the job done.

Additional use cases for this workflow

The workflow can be used for analysing microbial communities from any other sources. The use cases could range from soil microbiome analysis to monitor the fertility of a soil to environmental bioremediation where microorganisms are used to clean up environmental messes such as oil spills.

Author: Temesgen H. Dadi (KNIME)

Temesgen is part of the Life Science Team at KNIME: You can read more about Life Science at KNIME, find links leading you to more life science topics, the community, and further blog articles on the Why KNIME for Life Science webpage

Read the full blog article here:

Abstract

Microbiomes living inside and on us produce essential enzymes, breakdown nutrients that our body by itself couldn’t, train our immune system and are our first line of defense against pathogens. Our health depends on them. This makes qualitative and quantitative analysis of microbiomes an important undertaking. The first step of such analysis is to know which group of microbes are living in a particular body site, such as our gut, and to estimate their respective relative abundances. This step is known as taxonomic profiling. In this blog, I present a step by step guide how to perform taxonomic profiling on microbial communities using the 16S ribosomal RNA gene as a fingerprint. The data I picked is coming from a study on the dynamics of the gut microbiome during the process of fecal transplant in 10 inflammatory bowel disease (IBD) patients. 16S-rRNA sequences were collected from fecal samples of IBD patients and their donors at different time points of fecal transplant. I will be using the KNIME Analytics Platform and its R-Integration for the whole process. The R-Integration of KNIME allows me to use a domain specific program called DADA2 which is available only as an R package. The blog showcases how to get data directly from the European Nucleotide Archive (https://www.ebi.ac.uk/ena) via REST and FTP services, pre-process data, use an external R-package within KNIME Analytics Platform, and visualize multiple microbiome composition with the purpose of comparing them.

Human Gut Microbiome

The human gut, irrespective of its health state, is home to approximately 10^9 bacteria and other microorganisms which we collectively refer to as microbes. These microbes are of various sorts and play a wide range of roles in keeping us healthy. But for this (them keeping us healthy and we, well, providing for them) to work, there needs to be a particular composition of different types (species) of microbes. Only then, we can get the essential nutrients just in the right amount, for example. Not having a “good configuration” of the gut microbiome, in other words gut dysbiosis, can cause diseases like Irritable Bowel Syndrome (IBS) and Inflammatory Bowel Disease (IBD) ² ⁴ ⁵.

Researchers have been looking at different treatment strategies to alter the microbiome composition towards a state that promotes gut health. The strategies include probiotics, prebiotics, symbiotics and antibiotics ⁵ ⁸. While these strategies are still in need of formation to become standard treatment, fecal transplant from a healthy donor to a patient with IBD is gaining momentum as an alternative and blanket solution. The ultimate goal, here, is replacing the gut microbial community of the patient with that of the healthy donor.

To evaluate if the transplant worked or not, one needs to monitor the composition of the gut microbial community before and after transplant. One way of doing that is to take environmental samples, extract all the genetic material from the samples, perform DNA sequencing and use the resulting data to infer

1) which types of organisms are in the sample and

2) what is the prevalence of each type.

This is done by using the subtle differences in nucleotide sequences among genomes/genes of different bacterial species. In this blog post, I will focus on using a particular gene, namely the 16S ribosomal RNA gene, for this purpose.

The 16S ribosomal RNA (16S rRNA) gene

The 16S rRNA gene has the advantage of being highly conserved across almost all prokaryotic species, which facilitates designing primers that can bind to a specific region within the gene ³. The primers are used to selectively perform PCR (Polymerase chain reaction) which produces multiple copies of (parts of) 16S-gene. The resulting sequences are called amplicon sequences. The 16S gene also contains hypervariable regions that can be used as fingerprints to identify the types of bacteria. These variable regions are numbered V1-V9 and have a well-defined locus within the stretch of the gene. In order to identify microbial species/groups, one can use either the entire length of the 16S gene spanning all the variable regions, or part of the 16S gene covering two or three hypervariable regions.

Figure 1. Regions of 16s rRNA genes. The grey regions indicate hypervariable regions that can be used as fingerprints to identify the types of bacteria.

A KNIME Workflow for Gut Microbiome Analysis of IBD Patients from 16S sequencing data

In this blog post, I present a KNIME workflow created to analyze 16S-rRNA data obtained from the gut of 10 IBD patients at different time points while they undergo fecal transplants. The overall goal is to understand the shift in gut microbiome composition with the help of multiple visualizations.

The workflow uses the DADA2 R package by Callahan et al. ¹ to determine the microbial composition from the 16S sequences. This is done via the R-Integration of KNIME which allows usage of such domain specific applications available as R packages with minimal effort. Such packages are focussed on solving a particular scientific problem and are results of months of research and being able to use them directly in a KNIME workflow is just great.

A system wide installation of the DADA2 R package is needed for the workflow to work. Installation instructions can be found here. The main result of DADA2 is a table containing a list of unique amplicon sequences called Amplicon Sequence Variants (ASV) and their count. In the workflow, this result will go through further analysis steps using KNIME Analytics Platform to have a dashboard of visualizations of taxonomic profiles across patients and timepoints.

Figure 2. A KNIME workflow for gut microbiome analysis of IBD patients from 16S sequencing data. It can be accessed and downloaded from the KNIME Hub.

In short the workflow does the following:

Downloads 16S amplicon sequencing files (FASTQ format) from European Nucleotide Archive (ENA)
Uses the R package DADA2 to quality check the sequences and create an amplicon sequence variant table and assign them to a group of bacteria
Create taxonomic profile at a desired taxonomic rank
Visualize the results to demonstrate the change in the composition of gut microbiota of each patients

Let us dive into each step of the workflow and explain the main ideas behind each part/component of the workflow. You can download this workflow from the KNIME Hub here.

1. Download FASTQ sequences from ENA

The first thing to do is getting the DNA sequencing data from European Nucleotide Archive (ENA) where it is publicly available under project identifier PRJDB4959. More metadata about the project can be accessed here. I used “Download FASTQ files from ENA” component available from the KNIME Hub to easily retrieve our example dataset from the source. The dataset contains a total of 40 FASTQ files representing 10 donors and 10 patients at 3 different time points after going through fecal microbiota transplantation. In each FASTQ file there are thousands of short DNA sequences (sequencing reads) obtained via amplicon sequencing. The component outputs a table that contains the path to the sequence files of each sample that are downloaded and stored locally (see the table below).

Figure 3. The output of “Download FASTQ sequences” component showing a partial list of sequences downloaded from EBI.

2. Create an Amplicon Sequence Variant (ASV) table

Since I have our sequencing reads of each sample as individual FASTQ files, I can now go ahead and start the analysis with DADA2. A typical DADA2 pipeline starts by inspecting the quality profiles of the input sequencing reads. The results are then used as a guide to perform error correction on the original sequences to account for sequencing errors. Then sequences are truncated at a length where the quality drops for the majority of the sequences. The details on how exactly this is done can be found in the DADA2 paper (Callahan et al.) and I leave that to the interested reader. The error correction is followed by a series of R Scripting nodes matching each stage of the DADA2 pipeline. Each node mostly performs a singular task by calling a DADA2 routine/function. Here is an example code snippet inside the R to R node that filter sequences.

Figure 4. An example R code snippet that filters and trims sequences

It is of course possible to use just a single R to R node with the combined R source code instead of a series of nodes. But, I think the later representation makes both understanding and maintaining the pipeline easier.

Figure 5. DADA2 pipeline represented by a series of KNIME R Scripting nodes.

The pipeline can be summarized into 5 steps.

Sequences below a certain threshold length and quality are filtered out
By looking at the error profile, noisy sequences are filtered out. Here probabilistic error correction is done to account for nucleotide differences that are artifacts of the sequencing process.
A table of ASVs and their frequency in each sample is generated.
Chimeric sequences are removed. Chimeric sequences are sequences that do not exist naturally but created by a faulty PCR process in which sequences from two different origins are artificially concatenated.
After getting the ASV count table, the next step is to decode each ASV into a (group of) bacteria/taxa known to be associated with it. I will use a database curated and provided by the authors of DADA2. The database contains a list of known amplicon sequences and the taxa they belong to. Ambiguous sequences are assigned to a more generic taxa. For example, 16S sequences that are equally similar between that of species_1 and species_2 will be assigned to a group that covers both species. In this process a given ASV can be assigned to a single bacterial species or at a higher level of taxonomy such as genus or family. This is dependent on the specificity of the ASV.

At the end I will have two tables:

a) An ASV table where different variants of 16S sequence fragments are represented as rows and samples are represented by columns. The values in the table represent how often a sequence (row) is observed in a sample (column).

Figure 6. ASV count table. Valus in the table show the frequencies of each amplicon sequence variant (rows) in different samples (columns).

b) The assignment of individual ASVs to a taxonomic entry.

Figure 7. Assignment of individual ASVs to a taxonomic entry.

Quality control

Before proceeding into joining tables, aggregating and visualizing the results, one needs to check how many of the original sequencing reads made it through each stage of the analysis per sample. There is a dedicated functionality of the DADA2 package for this purpose. I exposed the result through the Table View node, where the number of sequencing reads that were available originally and how many of them passed different filtration steps are shown. One should look out for an unreasonable reduction of read count, as that could mean a bad sample or wrong combination of parameters in the pipeline.

Figure 8. Sequencing read statistics. The quality looks good, as there is no unreasonable reduction of read count.

The table containing the analysis statistics looks just fine. Starting from 3000 sequencing reads I ended up with 2376, 1989, and 2123 reads. For the second row the chimera removal step took away about 300 reads which is quite higher compared to the other two displayed here. Chimeric sequences are simply sequences that do not exist naturally but created by a faulty PCR process in which sequences from two different origins are artificially concatenated.

3. Create taxonomic profile at a desired taxonomic rank

Depending on the level granularity required or for the purpose of finding more fitting patterns among samples or sample groups it is important to produce taxonomic profiles at different relative levels of grouping (taxonomic ranks). In the workflow, it is possible to select among 7 different ranks. The ranks are Kingdom, Phylum, Class, Order, Family, Genus and Species from generic to specific. Selecting the rank is usually done by looking at the taxonomic assignment and the most specific rank with not too many missing values in the corresponding column is chosen. In our example, this is either genus or family. The counts of sequences will be grouped by the chosen rank and relative abundance is calculated to show the percentage of each group of that taxonomic rank.

Figure 9. Relative abundance table at family level of the taxonomy

4. Visualize the results

It is natural to ask which types of Bacteria are found in a human gut. Using an interactive sunburst chart, I can visualise which group of bacteria are common and their prevalence in the human gut (see Figure 10). The charts represent aggregated/averaged relative abundance of bacterial groups at different taxonomic levels from healthy donors (left) and IBD patients before a fecal transplant (right). Clicking on each portion of the charts displays the actual of sequences that are tied to the selected group and their count in a table view.

A simple BLAST of the sequences on NCBI can verify the sanity of the method by checking if the sequences are actually related to the group they are assigned to by the pipeline. If you want to learn more about BLAST and do BLASTing without leaving your KNIME Analytics Platform, check out the BLAST from the PAST blog post by Jeany Prinz.

Figure 10. Sunburst charts showing the average composition of the gut microbiome in healthy donors (left) and IBD patients before transplant (right). Selecting a group will show the corresponding sequences representing that group of bacteria in the samples.

The commensal (more common) bacterial families are Lachnospiraceae, Bifidobacteriaceae, Ruminococcaceae and Bacteroidaceae. In general, these groups are higher in abundance in the healthy patients. If we take the currently selected group Lachnospiraceae for example, it is higher in the donors than in patients. These observations are in-line with literature which suggests that IBD patients are characterized by a lower abundance of Bacteroidetes and Lachnospiraceae compared to healthy controls ⁴.

The final step of our workflow produces a JavaScript visualization whereby the shift in microbial composition of individual patients is represented as a dashboard of stacked bar plots. The right most stacked bar is always the microbial composition of the donor’s gut, whereas the other three bars to the right represent the gut microbiome of the receiver at different time points.

First of all, it is interesting to note that the composition of gut microbiomes differ among individuals. This is true even in the healthy donors. Secondly, the patients’ gut microbiomes showed some changes towards those of the donors’, although in some cases, these changes didn’t persist overtime for all patients. For example, the bacterial family Ruminococcaceae (Orange) was present in high abundance in the donor's gut but not so in patient A. But after a week from the transplant it also became as abundant in the patient. Similarly, in patient D the proportion of the bacterial families Streptococcaceae (green) and Enterobacteriaceae (red) were absent in the beginning. But after a week from the transplant these groups of bacteria cover a large proportion of the patient’s gut microbiome, as they did in the patient's respective donor.

Figure 11. A dashboard of visualizations of taxonomic profiles across 10 patients + their donors (A-J) through different timepoints.

Let us take a closer look at patient H. From the bar plot it is clear that in the beginning patient H has close to zero Bifidobacteriaceae family in his gut. On the contrary the Donor has an abundance of the same family of bacteria. After the fecal transplant it can be seen that the abundance of this group of bacteria increased significantly. And it is known from literature [6,7] that probiotics rich in Bifidobacteria are successfully used in treating patients with inflammatory bowel diseases.

Figure 12, a closer look at the taxonomic profiles of patient J and his/her donor

Side Note

I can not confirm if the patient's health has improved due to the fecal transplant since I don't have the full treatment data available. I only have the sequencing data with the patient ID and time points at our disposal. It is also unclear if these positive changes are long lasting as the study is limited to 8-12 weeks. I am also fully aware that this is not even close to enough for making scientific conclusions. The focus here is to demonstrate how a KNIME workflow can be developed and used to assist a comparative study of microbial communities at different time points.

Summary

We have created a KNIME workflow that retrieves sequencing data from public repositories, analyses them, and creates useful visualisations to investigate the changes in microbial composition of patients' gut. We have learned how we can use 16S rRNA amplicon sequences for characterisation of microbial communities in KNIME Analytics Platform. Most importantly, we showed how we can integrate complex and domain specific external R packages in KNIME to create workflows that are transparent and easy to understand but yet powerful enough to get the job done.

References

^1.Callahan BJ, McMurdie PJ, Rosen MJ, Han AW, Johnson AJ, Holmes SP. DADA2: High-resolution sample inference from Illumina amplicon data. Nat Methods. 2016;13(7):581–583. doi:10.1038/nmeth.3869

^2.Ruairi Robertson,, 'Why the Gut Microbiome Is Crucial for Your Health', www.healthline.com, June 27, 2017, accessed Feb 2020, https://www.healthline.com/nutrition/gut-microbiome-and-health

^3.Wang, Y., & Qian, P. Y. (2009). Conservative fragments in bacterial 16S rRNA genes and primer design for 16S ribosomal DNA amplicons in metagenomic studies. PLoS ONE, 4(10). https://doi.org/10.1371/journal.pone.0007401

^4. Kennedy PJ, Cryan JF, Dinan TG, Clarke G. Irritable bowel syndrome: a microbiome-gut-brain axis disorder?. World J Gastroenterol. 2014;20(39):14105–14125. doi:10.3748/wjg.v20.i39.14105

^5. Distrutti E, Monaldi L, Ricci P, Fiorucci S. Gut microbiota role in irritable bowel syndrome: New therapeutic strategies. World J Gastroenterol. 2016;22(7):2219–2241. doi:10.3748/wjg.v22.i7.2219

^6. Halfvarson J, Brislawn CJ, Lamendella R, et al. Dynamics of the human gut microbiome in inflammatory bowel disease. Nat Microbiol. 2017;2:17004. Published 2017 Feb 13. doi:10.1038/nmicrobiol.2017.4

^7. Pozuelo M, Panda S, Santiago A, et al. Reduction of butyrate- and methane-producing microorganisms in patients with Irritable Bowel Syndrome. Sci Rep. 2015;5:12693. Published 2015 Aug 4. doi:10.1038/srep12693

^8. McFarland LV, Dublin S. Meta-analysis of probiotics for the treatment of irritable bowel syndrome. World J Gastroenterol. 2008;14(17):2650–2661. doi:10.3748/wjg.14.2650

Blog

KNIME Blog: general

Data Science in Times of Change: (Some) Re-Assembly RequiredbertholdTue, 06/16/2020 - 09:00

Most likely, the assumptions behind your data science model or the patterns in your data did not survive the coronavirus pandemic. Here’s how to address the challenges of model drift.

By Michael Berthold (KNIME). As first published in InfoWorld.

The enormous impact of the current crisis is obvious. What many still haven’t realized, however, is that the impact on ongoing data science production setups can be dramatic, too. Many of the models used for segmentation or forecasting started to fail when traffic and shopping patterns changed, supply chains were interrupted, borders were locked down, and just in general the way people behaved changed fundamentally.

Sometimes, data science systems adapt reasonably quickly when the new data starts to represent the new reality. In other cases, the new reality is so fundamentally different that the new data is not sufficient to train a new system or, worse, the base assumptions built into the system just don’t hold anymore so the entire process from data science creation to productionizing must be revisited.

This post describes different scenarios and a few examples for what happens when old data becomes completely outdated, base assumptions are not valid anymore, or patterns in the overall system change. I then highlight some of the challenges data science teams face when updating their production system and conclude by a set of recommendations for a robust future proof data science setup.

Impact Scenario: Complete Change

The most drastic scenario is a complete change of the underlying system that not only requires an update of the data science process itself but also a revision of the assumptions that went into its design in the first place. This requires a full new data science creation and productionization cycle: understanding and incorporating business knowledge, exploring data sources (possibly to replace data that doesn’t exist anymore) and select and fine tune suitable models. Examples include traffic predictions (especially near suddenly closed borders), shopping behaviour under more or less stringent lock downs, and healthcare related supply chains.

A subset of the above is the case where the availability of the data changed. A very illustrative example here are weather predictions where quite a bit of data is collected by commercial passenger aircrafts that are equipped with additional sensors. With the grounding of those aircraft, the volume of data has been drastically reduced.. Because base assumptions about weather development itself remain the same (ignoring for a moment that other changes in pollution and energy consumption may affect the weather as well) “only” a retraining of the existing models may be sufficient. However, if the missing data represents a significant portion of the information that went into model construction the data science team is well advised to rerun the model selection and optimization process as well.

Impact Scenario: Partial Change

In many other cases the base assumptions remain the same. For example recommendation engines will still work very much the same, but some of the dependencies extracted from the data will change. This is not necessarily very different from, say, a new bestseller entering the charts, but the speed and magnitude of change may be bigger: the way for instance health related supplies jumped in demand, outpaces how a bestseller rises in the charts. If the data science process has been designed flexibly enough, its built-in change detection mechanism should quickly identify the change and trigger a retraining of the underlying rules. Of course, that presupposes that change detection was in fact built-in and that the retrained system achieves sufficient quality levels.

Impact Scenario: No Change

This brief list is not complete without stressing that many concepts remain the same: predictive maintenance is a good example. As long as the usage patterns stay the same, engines will continue to fail in exactly the same ways as before. But the important question here for your data science team is: Are you sure? Is your performance monitoring setup thorough enough that you can be sure you are not losing quality? This is a predominant theme these days anyway: do you even notice when the performance of your data science system changes?!

A little side note on Model Jump vs Model Shift which is also often used in the context but refers to a different aspect. In the first two scenarios above (Complete/Partial Change) that change can happen abruptly (when borders are closed from one day to the next, for example) or gradually over time. Some of the bigger economic impacts will become apparent in customer behaviour only over time. For example, in the case of a SaaS business, customers will not cancel their subscriptions overnight but over coming months).

What’s the Problem?

In reality, one most often encounters two types of production data science setups. There are the older systems that were built, deployed, and have been running for years without any further refinements, and then there are the newer systems that may have been the result of a consulting project, possibly even a modern automated machine learning (AutoML) type of project. In both cases, if you are fortunate, automatic handling of partial model change has been incorporated into the system so at least some model retraining is handled automatically. But since none of the currently available AutoML tools allow for performance monitoring and automatic retraining and usually one-shot projects don’t worry about that either, you may not even be aware that your data science process has failed.

If you are lucky to have a setup where the data science team has made numerous improvements over the years, chances are higher that automatic model drift detection and retraining is built-in as well. However, even then - and especially in case of a complete model jump - it is far more likely that the existing system cannot easily be recreated to accommodate the new setup because all those steps are not well documented, making it hard to revisit the assumptions and update the process. Often also the process relies on intransparent code pieces, written by experts who have left the team in the meantime. The only solution? Start an entirely new project.

What’s Needed?

Obviously, if your data science process was set up by an external consulting team you don’t have much of a choice than to bring them back in. If your data science process is the result of an automated ML/AI service, you may be able to re-engage that service, but especially in the case of the change in business dynamics you should expect to be involved quite a bit - similar to the first time you embarked on this project.

One side note here: Be skeptical when someone is trying to push for super-cool new methods. In many cases this is not needed but one should rather focus on carefully revisiting the assumptions and data used for the previous data science process. Only in very small cases is this really a “data 0” problem where one tries to learn a new model from very few data points. Even then one should also explore the option of building on top of the previous models and keep them involved in some weighted way. Very often, new behavior can very well be represented as a mix of previous models with a sprinkle of new data.

But if your data science development is done inhouse - now is the time where an integrative and uniform environment such as KNIME comes in very handy. KNIME workflows are 100% backwards compatible and the underlying assumptions are all visually documented in one environment, allowing for well-informed changes and adjustments to be made. Using KNIME Server, you then validate and test the revised workflows and deploy it to production from that same environment without any need for manual translation to a different production environment.

Blog

KNIME Blog: general

High-throughput screening, data analysis, processing, and hit identificationJordiFri, 06/19/2020 - 10:26

High throughput biochemical and phenotypic screening (HTS) enables scientists to test thousands of samples simultaneously. Using automation, the effects of thousands of compounds can be evaluated on cultured cells, or using biochemical in vitro assays. The goal of HTS is to be able to identify or “hit” compounds that match certain properties. As HTS is usually conducted on very large libraries of compounds the volume of raw data that is produced is usually huge. This calls for an analysis tool that is able to handle large volumes of data easily.

In our analysis, we have used a platform that supports data science techniques. These techniques are better able to process and assess very large sets of raw data, in comparison to, say, conducting our analysis with a spreadsheet-based tool. The data science tool also enables us to perform more complex operations.

The motivation for this study was to help laboratories assess all kinds of raw data generated from HTS, regardless of whether these data are chemical, genetic, or pharmacological in origin. We wanted to provide a process that enables the analysis of large volumes of data and quick and interactive visualization of the screening results. The data science tool we chose for the analysis and visualization was KNIME Analytics Platform.

Read full article here:

A workflow for high-throughput screening, data analysis, processing, and hit identification

Authors: Catherine S. Hansel¹, Schayan Yousefian¹, Anna H. Klemm² and Jordi Carreras-Puigvert¹

Science for Life Laboratory, Division of Genome Biology, Department of Medical Biochemistry and Biophysics, Karolinska Institute, Stockholm, Sweden
Department of Information Technology, Division of Visual Information and Interaction, SciLifeLab BioImage Informatics Facility, Uppsala University, Uppsala, Sweden

Keywords. High-throughput screening, data processing, hit identification

High-throughput biochemical and phenotypic screening (HTS) is a gold standard technique for drug discovery. Using automation, the effects of thousands of compounds can be evaluated on cultured cells, or using biochemical in vitro assays. By doing so, “hit” compounds can be identified that modulate the readout(s) favourably. Since HTS is typically conducted with large compound libraries under several conditions, the raw data generated is often very large and split over a number of spreadsheets/table-like sheets containing data. Therefore, we have created a KNIME workflow to help process and assess large sets of raw data generated from HTS. This workflow automatically imports HTS data and processes it to identify hits with tunable criteria. This means that the user is able to choose different thresholds to identify a compound considered as a hit. Additionally, three commonly used quality control measures, the Z-Prime, signal/background (S/B) and CV, are calculated in the workflow and are visualized in a comprehensive manner.

Helping labs assess raw data from HTS

The motivation for this study is to help laboratories assess the raw data generated from high-throughput screens (HTS), whether they be chemical, genetic, or pharmacological in origin. This workflow focuses on a chemical phenotypic HTS; the aim is to identify small molecules/compounds that alter the phenotype of a cell in a desired manner. It has three blocks: file-upload, data processing and visualization.

In our experiment, the objective is to find hit compounds that rescue the cell death induced by the expression of a deadly protein. The positive control cells do not die, since the deadly protein is not expressed, whereas the negative control cells do die since the deadly protein is expressed. Compounds are added to cells expressing the deadly protein and a hit compound shows greater viability than that of the negative control cells. First the HTS viability data is uploaded and then it is normalized and the compounds’ Z-scores are calculated in the “Normalize” metanode. In this example the data is normalized to the positive control values for visualisation purposes; the data could also have been normalized to the negative control values. In order to find a hit compound one can look at the Z-score of the compounds’ effect on cell viability. The Z-score measures the relationship between a particular compound’s effect on cell viability to the mean effect of all the compounds (since we assume that most compounds have no effect), measured in terms of standard deviations from the mean. A threshold value can be set e.g. a hit compound is defined as any compound that lies two standard deviations above the average cell viability value. Please note that the Z-score should only be used as a hit determinant for the initial screen, for validation screens most compounds are likely to have an effect and therefore the normalised cell viability would be used to determine the hit compounds.

Quality control & visualization of screening results

This workflow also allows the user to test the robustness of the HTS by performing quality control calculations in the Quality Control metanode. HTS need to be robust and reliable in order to avoid finding false positives/negatives. Moreover, quality control can highlight flaws in the screening protocol that need to be adapted. In the “Process control data” metanode, quality control is assessed by calculating the coefficient of variation (CV), Z-prime and signal-to-background for the positive and negative controls. The CV measures the dispersion of the controls and the Z-prime measures assay quality by showing the separation between the distributions of the positive and negative controls (values between 0.5-1 are excellent, 0-0.5 acceptable and under 0 likely unacceptable for phenotypic screens). The signal-to-noise is the ratio between the positive and negative control values. It is important that there is a good window between the positive and negative values so that hits are robust.

Finally the workflow enables rapid visualization of the screening results. The quality control measures are shown in a bar chart and box plot, and the screening results are visualized in interactive scatter plots of the normalized cell viability and Z-score: specific points can be selected and then summarized in tabular form.

In summary, this workflow has three blocks: file-upload, data processing and visualization (Figure 1). Detailed descriptions of these steps are described below. And you can access the workflow and download it to try out yourself from the KNIME Hub.

A workflow for high-throughput screening, data analysis, processing, and hit identification

Figure 1: Overview of the high-throughput screening workflow

File upload: accessing the raw experimental data and the metadata (block 1)

In the first block, the raw data and metadata files are accessed using the List Files and Excel Reader nodes. Each raw data excel file describes the raw data in each well of a 384 well plate (n.b. this workflow can be applied to any plate size, as long as it is defined in the metadata). In this case the raw data is generated from a plate-reader and describes a CellTiter-Glo® assay luminescence readout (Promega, G9681); the luminescence value is proportional to the number of viable cells in culture. The raw data values need to be joined to the metadata, which provides information on the location of the compounds i.e. the well of which plate they are located in. The metadata file contains columns describing the plate ID, well position, and compound ID, along with any additional relevant information. If the metadata describes the well position in a different format to the raw data files, e.g. A01 vs. A1 vs A – 01, the well converter file can be used to help join the metadata file to the raw data files.

Data processing: joining the uploaded data, normalization and quality control calculations (block 2)

Figure 2: Data processing: for each raw data file the metadata is joined, the data is normalized and the quality controls are calculated.

This block processes the data by joining the raw data files to the metadata file, normalizing the data, and calculating quality controls. These processes are applied to each raw data file individually and then concatenated using the Loop End node.

First the raw data files are joined to the metadata file. In order to know which plate the compound is found in, the excel file name, which describes the plate ID, is added as a column within the excel file (using Variable to Table Column, Regex Split, Column Rename). The raw data files (which now have a column describing the plate ID) are combined with the metadata using the well converter if necessary. Each plate is now described by an excel file displaying the raw data, well position, plate ID, and compound ID along with any additional relevant information.

Next, the data is normalized using the Normalize Plates (POC) node of the HCS-tools community nodes. Since every raw data file (describing an individual 384 well plate) is analysed iteratively due to the Loop nodes, each compound can be normalized to the positive controls within its plate: in doing so we take into account plate-to-plate variation. Following this, the mean value and standard deviation of the positive and negative controls are calculated respectively using the GroupBy node. These values are concatenated and added as columns in the results file. The Z-score is also calculated for each compound’s cell viability result using the Normalize Plates (Z-score) node.

In order to calculate quality controls for the screen, the positive and negative control values are taken for each plate. Their CV, Z-Prime and signal/background (S/B) values are then calculated using CV, Z-Primes (PC x NC) and the Math Formula nodes respectively.

The processed data (normalized values and quality control values) are then joined together with the raw data and metadata.

Data visualization: interactive view to visualize processed/normalized data vs. controls (block 3)

Figure 3: Data visualization: visualization of quality control results and processed HTS data.

This block contains a series of configurable nodes to visualize quality control results as well as an interactive table linked to a scatter plot displaying the processed data.

First, the user can select the Z-score threshold that defines a hit compound by executing and opening the view of the “select Z-score threshold” component. Next, in order to build a descriptive scatter plot of the data, the Rule Engine node assigns group names to the positive controls, negative controls, hit compounds and non-hit compounds and the colour manager assigns particular colours to these groups for the scatter plots. Following this, the Component “Data Visualization” contains a Bar Chart, Box Plot, 2x Scatter Plot and Table View nodes. The Bar Chart and Box Plot describe the quality control values (CV values, Z-prime, S/B), and the Scatter Plots show the normalized cell viability data and the Z-score data. It is possible to interact with the scatter plots: one can select values in the scatter plots and this/these value(s) will be described in the table (Figure 4).

Figure 4: Data visualization component output

The Plate Heatmap Viewer node enables heatmaps to be displayed for all the 384 well plates (Figure 5). Heatmaps can be useful to assess the quality of the assay e.g. it is possible to assess the liquid handling performed in a screen as one can see patterns in cell seeding. In order for the Plate Heatmap Viewer to work, the Expand Well Position node must be applied. This takes a string column containing well positions e.g. C14 and appends two columns called “plateRow” and “plateColumn”.

Figure 5: Heatmap of processed data

Finally, the Excel Writer node converts the processed data into an Excel File.

Summary

This workflow is useful for anyone wanting to process and analyse large sets of raw data generated from HTS. Hit compounds can be identified using tunable criteria and visualized in an interactive scatter plot in which single points can be selected and read in tabular form. Moreover, quality control calculations such as the Z-Prime, signal/background (S/B) and CV are performed and visualised, depicting the overall robustness of the screen.

Resources

The workflow that was described in this blog article is available for download from the KNIME Hub:

HTS workflow

Authors

Catherine Hansel

Katie is conducting her PostDoc in Oscar Fernandez-Capetillo’s laboratory at the Karolinska Institutet/SciLifeLab. This lab focuses on high-throughput phenotypic screening. She is currently conducting a large screen for compounds that limit amyotrophic lateral sclerosis (ALS) associated poly(PR) dipeptide toxicity and has used this workflow to assess the group’s screening results

Schayan Yousefian graduated from the University of Heidelberg with a Master of Science in Molecular Biosciences. He is currently a Ph.D. candidate at the German Cancer Research Centre (DKFZ). In his Master thesis he investigated the role of intestinal stem cells in the gut-mediated immune response of Drosophila. His research interest centres around single-cell biology and the use of different sequencing techniques to study cell identity and cell type specific functions. Additionally, Schayan has experience in image-based analysis for high-throughput compound screenings and spatially resolved transcriptomics. Therefore, he has a keen interest in exploring computational methods for high-content data analysis.

Anna H. Klemm works as a BioImage Analyst within the BioImage Informatics Facility, SciLifeLab, Sweden. She did a PhD in cellular biophysics at the University of Erlangen-Nürnberg and after a Postdoc at the Max-Planck-Institute CBG, Dresden. Both during her PhD and the Postdoc she quantified biological processes by applying different microscopy techniques and image analysis. Before starting at SciLifelab she worked as a BioImage Analyst at the Biomedical Center, LMU Munich.

Picture of Anna Klemm: P. Waites, Uppsala University

Jordi Carreras-Puigvert is currently Assistant Professor at the Oscar Fernandez-Capetillo lab at the Karolinska Institutet/SciLifeLab. He has pioneered the use and implementation of high-throughput screens throughout his career, from exploring the DNA damage response in mouse Embryonic Stem cells, and c.elegans at Leiden University, and Leiden University Medical Centre, the Netherlands, to profiling the human NUDIX hydrolases at Karolinska Institutet, and now applying his knowledge to drug discovery against amyotrophic lateral sclerosis and cancer.

Coming Next …

If you enjoyed this, please share this generously and let us know your ideas for future blog posts.

Blog

KNIME Blog: tech

Building a Time Series Analysis ApplicationMaaritMon, 06/29/2020 - 10:26

How the core concepts of time series fit the process of accessing, cleaning, modeling, forecasting, and reconstructing time series

Authors: Daniele Tonini (Bocconi University), Maarit Widmann (KNIME), Corey Weisinger (KNIME)

Introduction

A complete time series analysis application covers the steps in a data science cycle from accessing to transforming, modeling, evaluating, and deploying time series data. However, for time series data the specific tasks in these steps differ in comparison to cross-sectional data. For example, cross sectional data are collected as a snapshot of one object at one point of time, whereas time series data are collected by observing the same object over a time period. The regular patterns in time series data have their specific terminology, and they determine the required preprocessing before moving on to modeling time series. Time series can be modeled with many types of models, but specific time series models, such as an ARIMA model, take use of the temporal structure between the observations.

In this article, we introduce the most common tasks in the journey of building a time series application. Finally, we put the theory into practice by building an example application in KNIME Analytics Platform.

Accessing Time Series

Time series have various sources and applications: daily sales data for demand prediction, yearly macroeconomic data for long term political planning, sensor data from a smart watch for analyzing a workout session, and many more. All these time series differ, for example, in their granularity, regularity, and cleanliness: We can be sure that we have a GDP value for our country for this year, and for the next ten years, too, but we cannot guarantee that the sensor of our smart watch performs stably in any exercise and at any temperature. It could also be that time series data are not available at regular intervals, but can only be collected from random event points, such as disease infections or spontaneous customer visits. What all these kinds of time series data have in common, though, is that they are collected from the same source over time.

Figure 1: Time series have many different sources, from tiny single objects such as muscles in a human body to larger entities, such as countries. What all data have in common is that they have been collected by observing the same object over time.

Regularizing and Cleaning Time Series

Once we have the time series data, the next step is to make it equally spaced at a suitable granularity, continuous, and clean. The required tasks depend on the original shape of the data and also our analytics purpose. For example, if we’re planning a one-week promotion of a product, we might be interested in more granular data than if we want to gain an overview of the sales of some product.

Sorting

Time series need to be sorted by time. When you partition data into training and test sets remember to preserve the temporal structure between the records by taking data from top/bottom for testing/training. If your data contain more than one record per timestamp, then you need to aggregate them by the timestamp. For example, when you have multiple orders per day and you’re interested in the daily sales, you need to sum the sales for each day. Furthermore, if you’re interested in the time series at another granularity than what you currently have in the data, for example, monthly sales instead of daily sales, you can further aggregate the data at the preferred granularity.

Missing Values

If some timestamps are missing, you need to introduce them to the time series in order to make it equally spaced. Sometimes the missing records are a part of the dynamics of the time series, for example, a stock market closes on a Friday and opens on a Monday.

When you introduce the missing timestamps to the data, the corresponding values are of course missing. You can impute these missing values by, for example, linear interpolation or moving average values. Remember, though, that the best technique for imputing missing values depends on the regular dynamics in the data. For example, if you inspect weekly seasonality in daily data, and a value on one Saturday is missing, then the last Saturday’s value is probably the best replacement. If the missing values are not missing at random, like the missing stock market closing prices at weekends, you can replace them by a fixed value, which would be 0 in this case. On the other hand, if the missing values are random and they occur far enough in the past, you can use the data after the missing value, and ignore the older data.

Irregular Patterns

One good way of handling rapid fluctuations and outliers is to smooth the data. Several techniques can be used, for example, moving average and exponential smoothing. Also cutting the values that lie outside the whiskers of a box plot smooths the data. Keep in mind that strong seasonality in the data might lead to a widespread box plot, and then it’s better to use a conditional box plot to detect outliers.

However, sometimes the time series is just showing a very irregular phenomenon! In such a case, you can try to make the time series more regular by extracting a subset of it, for example, by only considering the sales of one product instead of the sales of the whole supermarket, or by clustering the data.

Figure 2: Reshaping the data, handling missing values and outliers, and extracting a subset of the data are examples of cleaning and regularizing time series before moving on to the further steps in time series analysis

Exploring and Transforming Time Series

At this point, we have our time series data in the shape that is suitable for exploring it visually and numerically. The different plots and statistics reveal long and short term patterns and temporal relationships in the time series that we can use to better comprehend the dynamics of it and predict its future development.

Visual Exploration of Time series

The basic plot for exploring time series is the line plot (Fig. 3) that shows a possible direction, regular and irregular fluctuations, outliers, gaps, or turning points in the time series. If you observe a regular pattern in your time series, for example, yearly seasonality in the sales of beverages, you can then inspect each seasonal cycle (year) separately in a seasonal plot (Fig. 3). In the seasonal plot you can easily see, for example, if July was a stronger sales month this year than last year, or if the monthly sales are increasing year by year.

If you’re interested in what happens within the seasons, for example, what is the median sales in the summer months and how much and to which direction the sales vary each month, you can inspect these kinds of dynamics in a conditional box plot (Fig. 3). Yet another useful plot for exploring time series is the lag plot (Fig. 3). The lag plot shows the relationship between the current values and past values, for example, sales today and sales week before.

Classical Decomposition of Time Series

Classical decomposition, i.e. decomposing the time series into its trend, seasonalities, and residual provides a good benchmark for forecasting. The remaining part of the time series, the residual, is supposed to be stationary, and can be forecast by an ARIMA model, for example. Remember, though, that if the residual series is not stationary, some additional transformations might be required, such as first order differencing, or log transformation of the original time series.

Firstly, if the time series shows a direction, a trend, the time series can be detrended, for example, by fitting a regression model through the data, or by calculating a moving average value.

Secondly, if the time series shows a regular fluctuation - a seasonality - the time series can be adjusted for it. You can find the lag where the major seasonality occurs in the autocorrelation plot of the time series. For example, if you observe a peak at lag 7, and you have daily data, then the data will have weekly seasonality. The seasonality can be adjusted by differencing the data at the lag where the major spike occurs. If you want to adjust second seasonality in the data, you can do it by repeating the procedure for the adjusted (differenced) time series.

Finally, when you have reached a stationary time series that is ready to be modeled by for example an ARIMA model, you can do a final check with, for example, Ljung-box test for stationarity.

Figure 3: Lag plot, conditional box plot, line plot, seasonal plot and autocorrelation plot are useful for visually exploring the time series

Modeling and Evaluating Time Series

Now we move on to modeling the residual part of the time series that contains its irregular dynamics. We can do this with ARIMA models, machine learning models, neural networks, and many variations of them. We often model the residual part of the time series by these models, because it’s stationary. However, decomposing the time series is not always necessary, because some models, like for example the seasonal ARIMA model, also work for modeling non-stationary time series.

In the following we collect a few properties of these different modeling techniques, their similarities and differences, so that you can pick the best one for your use case. Remember also that it’s useful to train multiple models, and even build an ensemble of them!

ARIMA Models

ARIMA (AutoRegressive Integrated Moving Average) model is a linear regression model between the current and past values (AR-part), and also between the current and past forecast errors (MA-part). If the model has a non-zero I-part, then the data are differenced in order to make it stationary. Basic ARIMA models assume that the time series is stationary, and stationary time series don’t have predictable patterns in the long term. The declining accuracy in the long term forecasts can be seen in the increasing confidence intervals of the forecasts. Having more data is not always better for training ARIMA models: large datasets might make estimating the model parameters of an ARIMA model time consuming, as well as exaggerate the difference between the true process and the model process.

Machine Learning Models

Machine learning models use the lagged values as predictor columns, and they ignore the temporal structure between the target column and predictor columns. Machine learning models can also identify long term patterns and turning points in the data, provided that enough data are provided in the training data to establish these patterns. In general, the more irregularities the data shows, the more data are needed for training the model. When you apply a machine learning model, it’s recommended to model the residual. Otherwise, you might build a model that’s more complicated than the classical decomposition model, but which is actually not learning anything new on top of that!

Tips on Model Selection

Firstly, some phenomena are difficult to forecast, and in such a case it often makes sense to go for a simpler model and not invest resources in modeling something that cannot be forecast accurately.

Secondly, the model’s performance is not the only criterion. If important decisions are based on the results of the model, it’s interpretability might be more important than a slightly better performance. That said, a neural network might lose against a simple classical decomposition model although it forecast slightly better.

Thirdly, adding explicative variables to your model might improve the forecast accuracy. However, in such a model the explicative variables need to be forecast, too, and the increasing complexity of the model is not always worth the better accuracy. Sometimes rough estimates are enough to support the decisions: if shipping amounts are calculated in tens and hundreds, then the forecast demand doesn’t have to have a greater granularity either.

Figure 4: Available data, randomness of the data, forecast horizon, and also the model purpose and the interpretability determine which model is chosen. The line plot in the top left corner shows the forecast accuracy of an LSTM model trained with small training data. The line plots in the bottom left corner show a completely random process, as well as a turning point in the data. The line plot on the right shows the development of a time series that follows an ARIMA (2,1,1) process.

Model Evaluation

After training a model, the next step is to evaluate it. For in-sample forecasting, the test set is the training set itself, so the model process is fitted to the data that were used for training the model. For out-of-sample forecasting, the test set is subsequent to the training set in time.

One recommended error metric for evaluating a time series model is the mean absolute percentage error (MAPE), since it provides the error in a universal scale, as a percentage of the actual value. However, if the true value is zero, this metric is not defined, and then also other error metrics, like the root mean squared error (RMSE), will do. What is often recommended, though, is to NOT to use R-squared. The R-squared metric doesn’t fit the context of time series analysis because the focus is on predicting future systematic variability of the target column instead of modeling all variability in the past.

Forecasting and Reconstructing Time Series

We’re almost there! The last step is to forecast future values and reconstruct the signal.

Dynamic Forecasting

If you have a model that cannot provide accurate forecasts in the long term, dynamic deployment often improves the out-of-sample forecast accuracy. In dynamic deployment, only one point in the future is forecast at a time, and the past data are updated by this forecast value to generate the next forecast (Figure 5).

Figure 5: In dynamic deployment only one forecast is generated at a time, and this forecast is added to the past data that are used to generate the next forecast one time point further ahead in time

Restoring trend and Seasonalities

Finally, if we decompose the time series before forecasting, we need to restore the trend and/or seasonalities to the forecasts. If we adjust the seasonality by differencing the data, we start reconstructing the signal by adding values at the lag where the seasonality occurs. For example, if we had daily data y where we applied seasonal differencing at lag 7 (weekly seasonality), restoring this seasonality would require the following calculation to the forecast values y_t+1, y_t+2, ..., y_t+h:

where ti s the last time point in the training data, and h is the forecast horizon.

In order to restore second seasonality, we would repeat the step described above for the restored time series. If we wanted to restore the trend component to the time series, we would apply the regression model representing the trend to the restored time series.

Complete Time Series Application in KNIME Analytics Platform

Finally, let’s take a look at how to turn these steps into practice using KNIME Analytics Platform. The workflow Accessing Transforming and Modeling Time Series(available on the KNIME Hub) in Figure 6 shows the steps from accessing to cleaning, visually exploring, decomposing and modeling time series. For some of these tasks, we use time series components that encapsulate workflows as functionalities specific to time series: aggregate the data at the selected granularity, perform the classic decomposition, and more.

Figure 6: First steps in time series analysis: accessing, transforming, cleaning, visually exploring, and modeling time series. The workflow Accessing Transforming and Modeling Time Series is available on the KNIME Hub.

In this example, we use the Sample - Superstore data provided by Tableau. In our analysis we focus on the orders of all products from 2014 to 2017, altogether 9994 records. We start the preprocessing by reshaping the data into time series data by calculating the total sales per day. Now, we only have one value per day, but some days are missing because no orders were submitted on these days. Therefore, we introduce these days to the time series and replace the missing sales values with a fixed value 0. After that, we aggregate the data at the monthly level, and consider the average sales in each month in further analysis.

For visual exploration, we also aggregate the data at a yearly level, and we find out that there’s a turning point at the beginning of the year 2015, as the line plot on the right in Figure 7 shows. The line plot on the left shows the yearly seasonality in the data: there are two regular peaks at the end of each year, and a lower peak at the beginning of each year. We also detect yearly seasonality in the data, as shown by the major spike at the lag 12 in the ACF plot on the left. We decompose the time series into its trend, seasonality, and residual, and these components are shown in the line plot in the middle in Figure 7. The ACF plot on the right shows no significant autocorrelation in the residual series.

Figure 7: Line plots showing the yearly seasonality and the turning point, ACF plots showing the yearly seasonality in monthly data and stationarity in residual series, and a line plot showing the trend, seasonality, and residual components of the decomposed time series

Next, we model the residual series of the monthly average sales with an ARIMA model. After differencing at the lag 12, the length of the time series is 36 observations. We look for the best model with the Auto ARIMA Learner component with max order 4 for the AR and MA parts and max order 1 for the I part. The best performing model based on Akaike information criterion is ARIMA (0, 1, 4), and the resulting MAPE based on in-sample forecasts is 1.153.

Finally, we assess the model’s out-of-sample forecast accuracy. The workflow Forecasting and Reconstructing Time Series (available on the KNIME Hub) in Figure 8 shows how to forecast the daily sales in 2017 based on the monthly data in the years 2014 to 2016 (24 observations), and the winning ARIMA (0,1,4) model using the dynamic deployment approach. After that, we reconstruct the signal, in this case, restore the trend and yearly seasonality to the forecast values (12 monthly average sales values). We compare the actual and forecast values, and obtain an MAPE of 0.336.

Figure 8: Workflow to forecast the monthly average sales in 2017 by an ARIMA (0,1,4) model using dynamic deployment. After forecasting, the trend and yearly seasonality are restored to the forecast residuals, and the accuracy of the forecasts is calculated. The workflow Forecasting and Reconstructing Time Series is available on the KNIME Hub.

Summary

Time series, be it sensor data showing the behavior of a tiny object nanosecond after nanosecond, macroeconomic data for the 20th century, or something in-between, have specific analytics techniques that apply to accessing, manipulating, and modeling steps.

In this article, we have introduced you to the basics of analytics techniques for time series that help you to get started when you’re working with time series data.

Want to learn more?

Join our Time Series Analysis Panel Discussion webinar on Monday July 13. We’ll be answering questions that often come up in our time series courses and that we face when we’re working with time series data.
Take part in our next online course Introduction to Time Series Analysis on Monday July 20 to Friday July 25: five lessons plus exercises.

References

[1] Chambers, John C., Satinder K. Mullick, and Donald D. Smith. How to choose the right forecasting technique. Harvard University, Graduate School of Business Administration, 1971.

[2] Hyndman, Rob J., and George Athanasopoulos. Forecasting: principles and practice. OTexts, 2018.

Blog

KNIME Blog: tech

Cumulocity, IoT, and KNIME Analytics PlatformtarentMon, 07/06/2020 - 10:00

Today, all of our digital devices and sensors are interconnected in the Internet of Things. tarent has built an extension for KNIME Analytics Platform that enables you to connect to Software AG’s Cumulocity Iot platform so that you can use the more advanced analytics provided by KNIME on your Cumulocity data.

Author: Dietrich Wettschereck, tarent

The concept behind the Cumulocity platform is to keep IoT projects simple with a single architecture, simplifying industrial equipment connections, and no coding required. It connects and manages your devices and assets and can control them remotely. Thanks to the multitude of certified devices and available SDKs for the development of own device agents, interfacing with the platform is easily accomplished. The integrated web application framework allows easy entry into modern multi-platform visualization options thus catering to a wide variety of audiences, such as device managers, administrators, and business users.

Cumulocity, IoT, and KNIME Analytics Plataform

Cumulocity enables you to perform certain analytics operations on your data, but with the new KNIME Analytics Platform Cumulocity Extension you can perform more complex analyses, including machine learning operations, you can analyze multiple devices and also integrate data from a variety of other sources.

In this article we would like to show you a workflow that demonstrates the Cumulocity nodes. Our example is based on a bike share system in Washington DC called Capital Bikeshare. Each Capital Bikeshare bike is fitted with a sensor which sends the current check-in and check-out times of the individual bikes to a central repository. All historical data have been made available for download on Capital Bikeshare’s website. These public data have been downloaded and used for this study.

Introducing - the Cumulocity Connector Extension

The nodes in this community extension provide functionality to retrieve information about IoT devices, corresponding measurements, alarms and events from a given Cumulocity IoT platform instance. These data, possibly in combination with any other data, can be used to create new events and alarms within KNIME and write them back to Cumulocity in order to trigger the corresponding actions within Cumulocity. You can access and download them from the KNME Hub.

	Base node for the storage of the connection settings. Allows the setting of host, tenant, credentials or username / password. Required as input by all other nodes.
	Retrieves device id, type and name for all devices at given tenant.
	Retrieves information about a subset of all alarms known for the given tenant. Configuration parameters of this node allow to set a limit on the number of alarms to be retrieved, to retrieve only alarms for certain devices or for a certain time period.
	Creates one or more alarms within Cumulocity. The only required parameter is the alarm type. All other parameters are optional. Sensible default values are set. Alarm ‘severity’ for example is set to “warning” unless otherwise specified. Note that all alarms of a given type are aggregated so that only the count and time are updated whenever an alarm of the same type is created. So, if you want to keep track of the alarms for two separate devices, then you need to assign distinct alarm types to them, such as ‘device_1_overheating’ and ‘device_2_overheating’.
	Similar to the ‘Alarms Retriever’ for events.
	Similar to the ‘Alarms Creator’.
	This is essentially a helper node for testing purposes. Measurements are typically produced by IoT devices and not by an analytics platform. However, the node is quite useful when testing certain scenarios that rarely occur in reality.
	Similar to the ‘Alarms Retriever’ for measurements.

The KNIME Workflow

The purpose of this demo workflow that combines Cumulocity with KNIME is primarily to demonstrate how to use the Cumulocity nodes. They have been integrated into an existing workflow, which visualizes, analyzes and makes predictions for restocking the bike stations managed by Capital Bikeshare.

Capital Bikeshare offers a download of their bike usage data dating as far back as 2010. They also offer a live REST-API. In this workflow, we use data from 2018 and 2019 as training data and evaluate the learned model on the data of the first three months of 2020.

The task is the same as in the original IoT demo workflow (Taming the Internet of Things) to predict one of three classes, that is whether a bike station needs: to remove bikes, to add bikes, or that no action is required. Predicting three classes is easier than predicting a precise number and classification methods can be used. We only use the data that was provided by capital bikeshare (with some assumptions on the initial status of the stations at the end of 2017). The primary purpose of the workflow is to show how KNIME and Cumulocity can be used in concert.

Disclaimer: As the aim of this article is to demonstrate how the Cumulocity extension is used, our focus is therefore not in looking at optimizing the machine learning model; we also do not use any external data. A natural extension of this workflow would be for example to join weather data to the device data in order to improve the prediction quality or to use more advanced time series analysis methods for the machine learning part.

If you’d like to learn more about time series analysis, KNIME is offering a free webinar Time Series Panel Discussion on July 13 and an online course Introduction to Time Series starting July 20.

We don’t know which system Capital Bikeshare uses to manage their stations and bikes, but let’s assume that they use Cumulocity and that their devices continuously report their status to their instance of Cumulocity. There are many possible setups for this scenario, but we will assume a very simple configuration: each station represents a single device and whenever a user picks up from or returns a bike to a station an event is triggered for that respective station. We know that this is a grossly simplified setup, but it will suffice for our demo purpose.

To facilitate this setup, we converted the downloadable data into a format that can be loaded into Cumulocity. To speed things up a little bit further we also aggregated all events on an hourly basis before loading them into Cumulocity. In a real setup, each event would be written live to the database.

We ended up with data from 584 bike stations and between 90 and over 15,000 events for each station for a total of nearly 3.7 million events for our observation period of 27 months.

Let’s now look at the different parts of the workflow:

Step 1: Read station info

The very first step is to define the connection settings for the Cumulocity connector and to retrieve information about the known devices. The device retriever simply retrieves basic information about all accessible devices, we therefore filter for all ‘bike station’ devices. The downloadable data does unfortunately not provide any information about how many docks are currently occupied at each station. We simply assume that all stations are at 80% of their maximal capacity at the end of 2017 which is our initial ‘load ratio’.

Figure 1: The part of the workflow that connects with Cumulocity and retrieves the data from each bike station.

Step 2: Iterate over the known stations and add time series information to each event

We assume total independence of all stations and events. We can thus create the training data for each station regardless of what happens at the other stations. The ‘Events Retriever’ node gives us a number of options to restrict the amount of events that is retrieved:

Figure 2: The configuration dialog for the Events Retriever node.

We will however only set the parameter ‘Device IDs’ and retrieve at each step of the loop all events for a single device. We then use the ‘Lag Column’ node on the ‘load ratio’ column to create a simple time series: we add to each hourly record the load ratio of the station for the last 10 hours. We then look one hour into the future to add the target variable: do we need to add or remove bikes or is no action required?

Figure 3: Preprocessing: event data is read from Cumulocity; time series data are created, and a target variable is inferred: will it be necessary to add or remove bikes from the given station in the next hour?

Naturally, this would be the point to add external data such as information about whether the current day is a workday or a holiday, what are the current weather conditions or possibly other measurements that are retrieved from Cumulocity from other devices.

Step 3: Visualizations

KNIME offers a huge variety of visualization methods. We added a few just to get a feeling for the data. Figure 3 shows a cross-correlation matrix for a single station. It shows that the respective ‘load ratio’ features are highly correlated - as one would expect.

Figure 4: Cross-correlation matrix for the values for Station ‘10th & E St NW’.

Step 4: Machine Learning

We move on to the actual machine learning part of the workflow once we have explored the data and are satisfied that the pre-processing produced correct results. When we look at the distribution of the target variable we realize that our data set is highly imbalanced. In over 90% of the cases, no action is taken which is not surprising, but poses a problem for most machine learning methods. We simply remove ‘boring events’ to get a slightly more balanced distribution. 'Boring' events are those where no bikes are added or removed and the current load ratio is between 0.1 and 0.9. The following steps are standard (Figure 4): Split the data into a training and a hold-out data set and evaluate the learned model(s) on the hold-out set.

Figure 5: A very basic machine learning setup

We see that the performance on the hold-out dataset is quite all right, but again - in a production setting we would put more effort into this part of the workflow and most likely try something like recurrent neural networks on this time series data.

Step 5: Write back alarms to Cumulocity

The final part of the workflow shows how we can use a previously learned model to raise alarms in Cumulocity. Remember that our target variable was to predict when bikes need to be added or removed from stations within the upcoming hour. So whenever such an event is predicted, we raise an alarm so that Cumulocity can trigger the process of restocking for the respective station where the alarm was raised.

Figure 5 shows how we retrieve the raw data from Cumulocity, pre-process it in the same manner as the training data and then apply the learned model to it. The part where we raise the alarms is quite simple: whenever the model predicts a restocking event, then we raise an alarm. Please note that this still assumes independence of events which is of course not true. In reality, we would run this part of the workflow at least once every hour on the data that had changed since the last run. Using that setup, the restocking event would be reflected in the data when the workflow is run the next time and the updated load ratios would be taken into consideration when the model is applied to the new data.

Figure 6: Evaluation of the learned RF-model on independent data

Conclusion

We created a very simple restocking alert system that triggers an alarm in Cumulocity whenever a station is about to run out of bikes or is close to overstocking. We have shown only a tiny part of the capabilities of either system - KNIME or Cumulocity, since our aim was to show how these tools can be employed in concert. The open source Cumulocity Connector extension for KNIME is available from KNIME 4.1 on and can be installed like any other KNIME extension either by drag&drop from the KNIME Hub or via the File menu in KNIME.

The Cumulocity Connector extension is located on the Community Extensions (Experimental) site. If this site isn’t already enabled as an available software site, go to the File menu and select Preferences. Now click “Install/Update” ->“Available Software Sites” and select KNIME Community Extensions (Experimental) from the list that appears.

Figure 7. The “Available Software Sites” dialog. Click KNIME Community Extensions (Experimental) to access the Cumulocity extension.

Resources:

The Bike Restocking workflow on the KNIME Hub

About the author

Dietrich Wettschereck is Head of Artificial Intelligence at tarent solutions GmbH. He has 30 years of hands-on experience in Data Mining and Machine Learning. He is an experienced team lead and agile programmer; a technical all-rounder with up-to-date skills.

Blog

KNIME Blog: tech