KNIME Certification Program

February 18, 2019, 1:00 am

≪ Previous: Movie Recommendations with Spark Collaborative Filtering

KNIME Certification ProgramadminMon, 02/18/2019 - 10:00

Authors: Giuseppe di Fatta (University of Reading, UK) and Stefan Helfrich (KNIME)

Are you an expert in KNIME Analytics Platform? There is now an official way to answer this question and share it with the world: You can test your KNIME proficiency with a new certification program developed by a collaboration between academia and industry.

Professional certifications are particularly useful in the employment process to help identify key skills relevant to the job profile sought by employers. They facilitate matching the demand for skills with the offer at an earlier stage and also promote the need for the right skills. They help prospective applicants to understand the requirements in the current job market to plan their training and development more effectively. Employers can also use certifications to engage current employees in Continuous Professional Development (CDP) relevant to critical needs. While Higher Education degrees are evidence of a solid knowledge of a subject area (e.g., BSc Computer Science, MSc Data Science), certification programs tend to focus on very specific expertise and skills on industry-relevant tools and processes. Certification programs can help to ensure the right competence level is clearly identified and communicated. Certification tests are used to assess skills and knowledge for this purpose.

Today, skills in the field of data science, machine learning, and analytics are in more demand than ever. KNIME Analytics Platform is one of the leading platforms.The Data Science with KNIME Software certificates from the KNIME Certification Program are testimony of your proficiency in the open source platform for data driven innovation: they show your ability to develop, execute and deploy data analytics projects. Certificate-holders will boost their professional credibility; employers will more easily identify the right candidates to gain a competitive advantage.

About the collaborators

KNIME has teamed up with the University of Reading to develop the KNIME Certification Program. The motivation was to draw on the experience and know-how from academia and apply it to build an effective certification program. With research expertise in Data Science, Machine Learning, Big Data Analytics, and High Performance Computing for Computational Science, the Department of Computer Science of the University of Reading, headed by Dr. Giuseppe Di Fatta, was an ideal partner. The University of Reading awarded their first degree in Computer Science exactly 50 years ago in 1969. The Department of Computer Science has many years of experience in teaching Data Analytics, Data Mining, and Machine Learning, and moreover they have adopted KNIME Analytics Platform in teaching Data Analytics and Data Mining for over 10 years at undergraduate level and more recently at postgraduate level as well.

KNIME Certification Program

The certification program will consist of five levels (L1 to L5). Each level highlights a person’s expertise with different aspects and practical skills on KNIME Software as well as most current data science concepts and know how, such as data integration, data exploration, visualization, reporting, machine learning, and deployment of analysis workflows. We currently offer certificates for the first two levels of Data Science with KNIME Software: L1 and L2. To cater to every practitioner’s needs, we are in the process of specifying/defining different profiles for higher levels . While we will include profiles (and thus individual certification examinations) for a generalist data science professional, there will also be certificates covering more specific topics like text mining, big data applications and management of KNIME Server. Pass marks for the certification tests are at 70% (this is based on the grade boundary typically set in the UK undergraduate degree classification system for first-class honors degrees) and awarded certificates will be valid for 2 years. In this way employers can be reassured that an applicant with a KNIME Certificate is up to date with the latest developments in KNIME and in Data Science.

Examination

L1
- Proficiency in KNIME Analytics Platform for ETL, Data Analytics, and Visualization within the KNIME Certification Program for Data Science
- Examination: 45 minute multiple-choice questionnaire
L2
- Advanced proficiency in KNIME Analytics Platform and Basic Machine Learning within the KNIME Certification Program for Data Science
- Examination: 45 minute multiple-choice questionnaire and a data science project (up to 6 hours of work over one week)

How to study?

To prepare yourself for these certification exams, we recommend the following methods of study:

KNIME E-Learning Course
KNIME Online Courses for Beginners and Advanced Users
KNIME User Training
University of Reading Data Science courses at UG or PGT level (e.g., CS3DM16 and CSMDM16).

Where can I take the exam?

To celebrate the start of this certification program, the first set of examinations for L1 and L2 will be offered on March 22, 2019, in Berlin. The two certification examinations have undergone testing by a cohort of BSc Computer Science students (see this website for more details) before the formal launch in Berlin.

About the authors:

Dr. Giuseppe Di Fatta, Associate Professor of Computer Science

Giuseppe joined the University of Reading (UK) in 2006 and has been the Head of the Department of Computer Science since 2016. After his graduation at the University of Palermo (Italy) he ventured into the academic world at EPFL (Lausanne, CH), ICSI (Berkeley, CA), ICAR-CNR (Italy) and at the beautiful University of Konstanz, where he joined the initial KNIME development team until the first release of KNIME 1.0 in 2016. He has been adopting KNIME to teach data analytics and mining for over 10 years. His research interests include data mining algorithms, distributed and parallel computing, and data-driven multidisciplinary applications..

Dr. Stefan Helfrich, Academic Alliance Manager

Stefan is responsible for academic relations at KNIME. Before, he was working as a Bioimage Analyst at the University of Konstanz, supporting users of the local light microscopy facility with image and data analysis. Already during that time he realized that it will be crucial for the job market that people need to build up the right set of skills, which also is a major motivation for him to teach data literacy skills (using KNIME).

Blog

KNIME Blog: general

↧

KNIME and Blackjack

February 25, 2019, 1:00 am

≫ Next: How to Build Pivot Tables - A Vlog

≪ Previous: KNIME Certification Program

KNIME and BlackjackadminMon, 02/25/2019 - 10:00

Authors: Jakob Schröter & Marc Bux (KNIME)

Have you ever wondered how far you could push the boundaries of workflow design? Did you ever want to give free rein to your imagination when creating views in KNIME Analytics Platform? Do you also fancy the occasional round of cards? If the answer to either of these questions is “yes”, then this blog post is for you. Today, we’re creating our own game of blackjack in KNIME. We do this by using the Tile View (JavaScript) and CSS Editor nodes as well as the revised visual layout editor introduced in the latest KNIME release.

The new Tile View provides an alternate view of the data in a KNIME table. It’s analogous to the interactive Table View, but is particularly well suited to viewing datasets that include images or other graphics.

Sometimes you want to exercise fine control over the styling of the JavaScript views in your KNIME WebPortal applications (or in Composite Views in KNIME Analytics Platform). We support this by allowing you to specify CSS that is applied to style JavaScript views. This node provides a syntax highlighted editor with auto-completion for editing that CSS.

Are you ready for a round of blackjack? Then place your bets.

The Game Loop

In the section that follows we go through the different parts of our game loop. The outer game loop and the inner game loop.

Outer Game Loop

At the heart of most video game programs lies a so-called game loop. It iterates over the game state, adjusting it based on player input and other events. For our blackjack game, we’re not stingy with loops, nesting one loop within another. The outer game loop iterates over multiple matches of blackjack. Each round of play begins with a new starting hand being dealt to the player and bank. More specifically, the player is dealt two cards, whereas the bank is dealt a single card.

Each card that is dealt has certain properties: a rank, a suit, a numeric value, a recipient, and an associated image. The “Deal One Card” metanode template adds one random card (from an infinite deck of cards) to the list of cards currently in play. It uses the Read Images node to read vectorized card images kindly provided in form of the open source vector playing cards by Byron Knoll and Chris Aguilar. This list of cards represents the current game state, over which an inner game loop iterates. At the end of a blackjack match, a game over panel is displayed. In this game over panel, the player can decide whether they would like to play another round, leading to another iteration of the outer game loop.

Fig. 2 The internal game state is represented as a list of cards. The metanode template “Deal One Card” adds one card to the game stat

Inner Game Loop

The inner game loop updates the game state of a single match while the player interacts with the game. It displays a game panel view, in which the player is presented with their own and the bank’s hands of cards. The player is also presented with the options at their disposal. Currently, the game is limited to the options of drawing a card (hit) or passing to the bank (stand). Other options that could be added in the future are the actions to take an insurance, double down, split, or surrender.

If the player decides to draw a card (hit), the game deals one card to the player. If the player decides that they are happy with what they have (stand), the bank draws until they are at a card value of 17. At the end of the inner loop, the game state is checked and, if a termination condition is met, the match ends. Termination conditions are blackjack by the player, busted (too many cards drawn), bank busted, higher total, push (draw), and lower total.
The Game Panel
The player interface runs inside the web browser via the KNIME WebPortal (part of KNIME Server), so players don’t need to install KNIME Analytics Platform to play. If you don't have KNIME Server, you can still check out the functionality of the workflow and play the game by opening the interactive views of the wrapped metanodes inside KNIME Analytics Platform. This means that navigation through the game is manual rather than automatic as in the WebPortal.

Sticking Together the Game Panel UI

As you might know, multiple JavaScript-enabled views can be shown on one page by grouping them in wrapped metanodes and using the layout editor to define the layout. With the new visual layout editor introduced in KNIME Analytics Platform 3.7, layouting has gotten a lot easier, since what you see is now definitely what you get (have a look into the user manual for details). The game panel is implemented as a wrapped metanode and contains two Tile Views (one for the player, one for the bank) and game buttons for the actions at the player’s disposal. When the workflow is executed in the WebPortal, each iteration over the wrapped metanode “Game Panel” will be shown as a single page.

The wrapped metanode “Game Panel” looks like this:

Fig. 4 The wrapped metanode “Game Panel”, which is responsible for displaying the current game state and available player actions

Cards of the Bank and the Player

The new Tile View (JavaScript) node is used to display the cards that make up the current hands of the player and bank. The nodes are configured to only show the images without any labels.

Fig. 5 Configuration of the Tile View node

Using Quickforms for Game Buttons

For the game buttons we used the Single Selection QuickForm node. The player’s choice (e.g., hit or stand) is registered and passed to downstream nodes via flow variables. By default, the Single Selection QuickForm node just displays simple radio buttons...

Fig. 6 View provided by the metanode “Game Panel” without adjustment through the CSS Editor node

...but we want the game to look nice.

Styling the Buttons, Cards and some Animations

The new CSS Editor node allows you to add any custom CSS code and therefore nearly unlimited options for styling the views (have a look into the CSS Styling Guide for more details).

With a bit of CSS, the radio buttons of the Selection Quickform looks like this:

Fig, 7 View provided by the “Game Panel” metanode after adjustment through the CSS Editor node

We used the CSS Editor to add a nice CSS animation for the Tile View so the cards flip into view.

Fig. 8 Configuration of the CSS Editor node

The Game

So let’s fire up the WebPortal in our browser and play a round, shall we?

Winner, winner, chicken dinner! Ready to play? Then download the workflow from the KNIME public EXAMPLES Server at knime://EXAMPLES/50_Applications/50_Blackjack and run it in the WebPortal.

This little workflow showcases how KNIME can be used for interactive and visually adjustable web applications that are fun to use. It was made possible by recent additions to KNIME in Analytics Platform 3.7.1, namely (1) the styling provided by the CSS Editor node, (2) the easy-to-use layouting of the new visual layout editor, and, most importantly (3) the graphics provided by the Tile View (JavaScript) node.

What other interactive web applications are you implementing in KNIME? We’d love to hear, so hit us.

Blog

KNIME Blog: general

↧

How to Build Pivot Tables - A Vlog

March 4, 2019, 1:00 am

≫ Next: Market Simulation with KNIME: Android vs iOS

≪ Previous: KNIME and Blackjack

How to Build Pivot Tables - A VlogadminMon, 03/04/2019 - 10:00

Authors: Maarit Widmann & Casiana Rimbu

How to build a pivot table in KNIME Analytics Platform

To pivot or not to pivot, that is the question.

Did you know that a pivot table allows you to quickly summarize your data based on a group, pivot, and aggregation columns? This summary might include sums, averages, or other statistics, which the pivot table splits the statistics is a meaningful way for different subgroups and draws attention to useful information.

Fig. 1: A pivot table showing the average sunshine hours for each city in each month. This table was constructed by applying the pivoting function to a dataset that contains at least one column for month (group column), one column for city (pivot) and one column for sunshine hours (aggregation column).

Would you like to know more about how to use the Pivoting node in KNIME Analytics Platform? This vlog features three videos showing you how to use the Pivoting node, how to apply basic aggregation methods, such as sum and count, statistical aggregation methods, and the aggregation methods available for columns of type Date&Time. We also how to apply multiple aggregation methods to one or more aggregation columns.

So, settle down in your chair and get started with the Pivoting Trilogy, starring the Pivoting node.

The Pivoting Node

The video below shows you how to build a pivot table that summarizes data using the Pivoting node. We explain the elements of a pivot table, i.e. groups, pivots, the aggregation method and aggregation column and show how to make these settings in the configuration dialog of the Pivoting node.

Pivoting with Complex Aggregation Methods

This next video shows some advanced layouts of the pivot table using multiple aggregation columns, statistical aggregation methods, and aggregations to columns of type Date&Time. We show you how to apply multiple aggregation methods in the same pivot table, discuss aggregation methods for columns of the type Date&Time; we introduce a new aggregation method: Date range(day), and also the aggregation methods, mean, standard deviation and other statistical measures. Finally we show that some aggregation methods automatically disable the option to include missing values

Pivoting with Multiple Columns

The last video shows some advanced layouts of the pivot table with multiple pivoting and/or grouping columns and how to set the Pivoting node in order to achieve them. We also introduce the new Mode and (Unique) concatenate aggregation methods.

Further resources:

What’s data aggregation? An explanation is available on this video.
Examples for using the Pivoting node EXAMPLES Server: 02_ETL_Data_Manipulation/02_Aggregations/09_Examples_for_Using_the_Pivoting_Node
Pivoting in Databases EXAMPLES Server: 01_Data_Access/02_Databases/05_Pivoting_in_Databases

Blog

KNIME Blog: general

↧

Market Simulation with KNIME: Android vs iOS

March 11, 2019, 2:00 am

≫ Next: Text Encoding: A Review

≪ Previous: How to Build Pivot Tables - A Vlog

Market Simulation with KNIME: Android vs iOSadminMon, 03/11/2019 - 10:00

Author: Ted Hartnell (CTO of Scientific Strategy)

What is Market Simulation?

A market simulation is a way to model a real world market. Just as real world markets have products, features, brands, stores, locations, and competitive rivals, so does a market simulation. But what makes a market simulation truly realistic are the customers. Simulations can generate tens of thousands of virtual customers designed to mimic the purchase decisions of real world shoppers. Customers evaluate the differentiation offered by each product.

Market simulations provide a way to understand the economic complexities of a market. They are used by academics, students, and business managers to predict how customers will react to change. The change might include a change in price, a change in product assortment, or the emergence of a new competitor. These predictions lead to improved business strategies that increase market share, revenue, and profitability.

"Market Simulation" has now joined the long list of tools available on KNIME Analytics Platform. In this blog, we take a look at the Market Simulation Community Edition of nodes for KNIME that has been developed by Scientific Strategy.

Example workflows for the Community Edition are available from the Scientific Strategy website.
The workflow described in this blog post is on the publicly available KNIME EXAMPLES Server, here: EXAMPLES/40_Partners/03_ScientificStrategy/01_Android_vs_iOS

The free Market Simulation software can be downloaded directly from within KNIME Analytics Platform after enabling the “Partner Update Site” in Install Preferences, (File->Preferences->Install/Update->Available Software Sites).

Underlying Science

Market Simulation is built upon the same principles as Conjoint Analysis and mainstream economics. The simulation uses an Agent-Based Model (ABM) to replicate the decision making process of individual customers.

Customers purchase those products that give them the greatest Consumer Surplus– that is, the difference between their Willingness To Pay (WTP) for a product and its price. A customer’s WTP for a product is the sum of the “part-worth” values of its independent features.

Market Simulation on KNIME

The free Market Simulation Community Edition for KNIME is a set of more than 20 nodes dedicated to creating, tuning, and simulating markets. Almost any market can be simulated, although business-to-consumer (B2C) markets are easier to tune.

Fig. 1: Scientific Stratey's Market Simulation nodes for KNIME can be used to simulate any market – like this one for Android vs iOS phones.

KNIME Workflow: Android vs iOS

The competitive battle between Android and iOS phones illustrates how the new Market Simulation nodes work on KNIME.

Apple is the sole supplier of iOS iPhones, while Android phones are manufactured and sold by many suppliers. Apple is also the most profitable supplier of phones, with USD 1000+ iPhones, which are more expensive than comparable Android phones.

While Android suppliers are less profitable and generate less revenue, Android phones have a larger overall market share by quantity sold. Many varieties are available, with the price of Android phones ranging from very cheap to very expensive.

A KNIME workflow that simultaneously models all these market characteristics can be easily assembled with these Market Simulation nodes from Scientific Strategy.

New Market Simulation Nodes

The Android vs iOS workflow creates an Agent-Based Model (ABM) of 10,000 Virtual Customers. The products in the market include:

Apple’s iPhone
Samsung’s Galaxy
LG’s phone
Google’s Pixel
Oppo’s OnePlus

The two features that customers most value are:

Software (iOS or Android)
Branded Hardware

The prices of the phones range from USD 1,000 (for the iPhone) down to USD 500 (for the OnePlus).The manufacturing cost of the high-end Apple phone has been estimated at USD 800, while the cost of comparable high-end Android phones has been estimated to be only USD 700. While the component costs of the hardware going into each of these phones is about the same, the cost of developing and maintaining the iOS software is higher for Apple. This is because Android software is available for “free”.

Note that this workflow is only meant to illustrate how the KNIME nodes can be used to simulate a market. No specific product SKUs or market channel has been selected for simulation, nor has the model been tuned to make accurate predictions. While the Community Edition nodes do include these capabilities, they have not been illustrated here.

Value of Software

The part-worth value that individual customers have for iOS software is different to their value for Android software. Android software provides users with more options, while iOS software has a better reputation for reliability.

However, the average value for Android and iOS is about the same and has been set in the simulation to USD 500. Because there is no difference in this average value, the software is said to offer no Vertical Differentiation.

Fig. 4: The average part-worth "Value" of iOS and Android software is the same.

The new Customer Distributions node then converts these average values into two distributions. Each distribution contains the individual part-worth values for the 10,000 customers in the market.

Fig. 5: Each of the 10,000 customers (C00001 to C10000) have their own individual part-worth value for both types of software.

Value of Branded Hardware

The part-worth value customers have for branded hardware approximates the value of software at around USD 500. But in this case, the higher value for Apple’s hardware (USD 700) reflects the superior brand power of Apple. On the other hand, Oppo’s low value (USD 200) reflects the inferiority of its phone’s camera, processor, and memory.

Fig. 6: Customers place a higher average value on the Apple brand and a lower average value on Oppo hardware. “Conformity” is an indication of how similar customers perceive all Android hardware.

The new Matrix Distributions node works like the Customer Distributions node by generating customer distributions of part-worth values for each type of branded hardware.

But here the Matrix Distributions node also considers the degree of similarity between features. In this case, many customers believe that “all Android phones are the same”. This belief is reflected in the high levels of “Conformity” between the Android phones but not the Apple phone (conformity is set between 0.0 and 1.0). This conformity means that Apple has Horizontal Differentiation with respect to Android phones.

Features to Products

The part-worth values of software and hardware need to be combined into overall Willingness To Pay (WTP) values for the products. This is achieved with two new nodes:

Feature Table To List node
Product Generator node

Fig. 7: The Product Generator converts features into products.

The upstream Table Creator node describes all the features that make up the products, along with the manufacturing cost and price of each.

Fig. 8: Each phone is made up of a branded hardware Feature and a software Platform.

The Product Generator node creates a final “Product Array” and “WTP Matrix”. This is all the data needed to run a predictive simulation of the market.

Fig. 9: The Product Array summarizes each of the final products.

Fig. 10: The Willingness To Pay (WTP) Matrix matches each product to each customer.

Market Simulation

The Simulate Market node is the last new node in the workflow. It takes both the Product Array and WTP Matrix and calculates the Consumer Surplus for each customer (recall that Consumer Surplus equals WTP minus price). The Simulate Market node then predicts which product each customer will purchase (or “No Sale” if the customer finds none of the products attractive).

Fig. 11: The Simulate Market node predicts which product each of the 10,000 customers will buy based upon their individual Consumer Surplus.

The Simulate Market node replicates the decision making process of each Customer. Customers who buy the iPhone pay only the $1000 price but might have paid up to their WTP. The difference is their Consumer Surplus which reflects their overall satisfaction with their purchase. — Fig. 12: The Simulate Market node replicates the decision making process of each customer. Customers who buy the iPhone pay only the USD 1000 price but might have paid up to their WTP. The difference is their Consumer Surplus which reflects their overall satisfaction with their purchase.

Simulation Results

The results from the Market Simulation reflect the results from the real-world market:

Android has a bigger overall market share (by quantity sold) than the iPhone
Apple generates the most revenue
Apple is also much more profitable than all Android phones combined

Fig. 13: Predicted market share (by quantity) of each phone in the market.

Fig. 14: Market Simulation reflects the same results seen in the real-world, with 54% of customers buying an Android phone and only 42% buying an iOS phone. There are also 3.9% of customers who don’t buy anything.

Conclusion

Market Simulation works because the differentiation offered by each product has been quantified. On average, customers perceive no difference between Android and iOS software. Hence the software provides no Vertical Differentiation. However, the fact that individual customers disagree whether Android or iOS is better means that the software does provide Horizontal Differentiation.

The differentiation provided by the branded hardware is also a factor. While the Apple brand does offer some Vertical Differentiation, what’s more important is the fact that the Android hardware is undifferentiated (customers believe “all Android phones are the same”). This leads to greater rivalry among the Android manufacturers which drives down the price.

When all types of differentiation are quantified the dynamics of the entire market can be modeled.

Data Sources

The input parameters for this Market Simulation are loosely based upon the range of phones available in the USA. While Chinese-made phones have a large global market share, customers only make purchase decisions from the range of products available in their own geography.

The input data for this analysis was inspired by the CNET article “Why your iPhone and Android phone will cost more in 2019” by Jessica Dolcourt (2-Jan-2019):

Cost estimates were inspired by the analysis by HiSilicon found in the article: Economic Research Working Paper No. 41: Intangible assets and value capture in global value chains: the smartphone industry by Jason Dedrick and Kenneth L. Kraemer (Nov-2017):

About the Author

Ted Hartnell is the developer of the free Scientific Strategy Market Simulation (Community Edition) toolkit. Ted is also the founder of an eCommerce Optimization company called Revenue Watch. The company’s flagship product, RADAR, is built upon the same Market Simulation technology as the free Community Edition.

Ted is a speaker at KNIME Spring Summit. Register for the Summit and come and listen to him talk about "Market Simulation with KNIME: An Economic History of Beer"!

Ted has engineering and law degrees from Sydney University, and an MBA from UC Berkeley. Ted’s post-graduate market science research was conducted at Dartmouth College. The first Market Simulation tool Ted developed was at the Haas School of Business in 1999. He has spent much of the last 20 years improving it. He has also worked in the Internet of Things industry for Intel, at a Goldman Sachs subsidiary developing Wall Street’s high-frequency trading platforms, and in price optimization.

About Scientific Strategy

Companies suffer from hyper-competition, brand proliferation, cannibalization, and margin erosion. Market Simulation provides a way to understand market complexity and identify winning strategies.

Scientific Strategy is a trusted KNIME technology partner. Their Market Simulation toolkit was built using Data Analytics and Artificial Intelligence (AI). The Community Edition running on KNIME Analytics Platform is freely available for you to download and use today.

Blog

KNIME Blog: general

↧

Text Encoding: A Review

March 18, 2019, 2:00 am

≫ Next: Use Deep Learning to Write Like Shakespeare

≪ Previous: Market Simulation with KNIME: Android vs iOS

Text Encoding: A ReviewadminMon, 03/18/2019 - 10:00

Authors: Rosaria Silipo and Kathrin Melcher

The key to perform any text mining operation, such as topic detection or sentiment analysis, is to transform words into numbers, sequences of words into sequences of numbers. Once we have numbers, we are back in the well-known game of data analytics, where machine learning algorithms can help us with classifying and clustering.

We will focus here exactly on that part of the analysis that transforms words into numbers and texts into number vectors: text encoding.

For text encoding, there are a few techniques available, each one with its own pros and cons and each one best suited for a particular task. The simplest encoding techniques do not retain word order, while others do. Some encoding techniques are fast and intuitive, but the size of the resulting document vectors grows quickly with the size of the dictionary. Other encoding techniques optimize the vector dimension but lose in interpretability. Let’s check the most frequently used encoding techniques.

1. One-Hot or Frequency Document Vectorization (not ordered)

One commonly used text encoding technique is document vectorization. Here, a dictionary is built from all words available in the document collection, and each word becomes a column in the vector space. Each text then becomes a vector of 0s and 1s. 1 encodes the presence of the word and 0 its absence. This numerical representation of the document is called one-hot document vectorization.

A variation of this one-hot vectorization uses the frequency of each word in the document instead of just its presence/absence. This variation is called frequency-based vectorization.

While this encoding is easy to interpret and to produce, it has two main disadvantages. It does not retain the word order in the text, and the dimensionality of the final vector space grows rapidly with the word dictionary.

The order of the words in a text is important, for example, to take into account negations or grammar structures. On the other hand, some more primitive NLP techniques and machine learning algorithms might not make use of the word order anyway.

Also, the rapidly growing size of the vector space might become a problem only for large dictionaries. And even in this case, the word number can be limited to a maximum, for example, by cleaning and/or extracting keywords from the document texts.

2. One-Hot Encoding (ordered)

Some machine learning algorithms can build an internal representation of items in a sequence, like ordered words in a sentence. Recurrent Neural Networks (RNNs) and LSTM layers, for example, can exploit the sequence order for better classification results.

In this case, we need to move from a one-hot document vectorization to a one-hot encoding, where the word order is retained. Here, the document text is represented again by a vector of presence/absence of words, but the words are fed sequentially into the model.

When using the one-hot encoding technique, each document is represented by a tensor. Each document tensor consists of a possibly very long sequence of 0/1 vectors, leading to a very large and very sparse representation of the document corpus.

3. Index-Based Encoding

Another encoding that preserves the order of the words as they occur in the sentences is the Index-Based Encoding. The idea behind the index-based encoding is to map each word with one index, i.e., a number.

The first step is to create a dictionary that maps words to indexes. Based on this dictionary, each document is represented through a sequence of indexes (numbers), each number encoding one word. The main disadvantage of the Index-Based Encoding is that it introduces a numerical distance between texts that doesn’t really exist.

Notice that index-based encoding allows document vectors of different lengths. In fact, the sequences of indexes have variable length, while the document vectors have fixed length.

4. Word Embedding

The last encoding technique that we want to explore is word embedding. Word embeddings are a family of natural language processing techniques aiming at mapping semantic meaning into a geometric space [1]. This is done by associating a numeric vector to every word in a dictionary, such that the distance between any two vectors would capture part of the semantic relationship between the two associated words. The geometric space formed by these vectors is called an embedding space. The best known word embedding techniques are Word2Vec and GloVe.

Practically, we project each word into a continuous vector space, produced by a dedicated neural network layer. The neural network layer learns to associate a vector representation of each word that is beneficial to its overall task, e.g., the prediction of surrounding words [2].

Auxiliary Preprocessing Techniques

Many machine learning algorithms require a fixed length of the input vectors. Usually, a maximum sequence length is defined as the maximum number of words allowed in a document. Documents that are shorter are zero-padded. Documents that are longer are truncated. Zero-padding and truncation are then two useful auxiliary preparation steps for text analysis.

Zero-padding means adding as many zeros as needed to reach the maximum number of words allowed.

Truncation means cutting off all words after the maximum number of words has been reached.

Summary

We have explored four commonly used text encoding techniques:

Document Vectorization
One-Hot Encoding
Index-Based Encoding
Word Embedding

Document vectorization is the only technique not preserving the word order in the input text. However, it is easy to interpret and easy to generate.

One-Hot encoding is a compromise between preserving the word order in the sequence and maintaining the easy interpretability of the result. The price to pay is a very sparse, very large input tensor.

Index-Based Encoding tries to address both input data size reduction and sequence order preservation by mapping each word to an integer index and grouping the index sequence into a collection type column.

Finally, word embedding projects the index-based encoding or the one-hot encoding into a numerical vector in a new space with smaller dimensionality. The new space is defined by the numerical output of an embedding layer in a deep learning neural network. The additional advantage of this approach consists of the close mapping of words with similar role. The disadvantage, of course, is the higher degree of complexity.

We hope that we have provided a sufficiently general and complete description of the currently available text encoding techniques for you to choose the one that best fits your text analytics problem.

References:

[1] Chollet, Francois “Using pre-trained word embeddings in a Keras model”, The Keras Blog, 2016

[2] Brownlee, Jason “How to Use Word Embedding Layers for Deep Learning with Keras”, Machine Learning Mystery, 2017

As first published in Data Science Central.

Blog

KNIME Blog: general

↧

Use Deep Learning to Write Like Shakespeare

March 25, 2019, 2:00 am

≫ Next: KNIME Spring Summit 2019: Scene one, take one...action!

≪ Previous: Text Encoding: A Review

Use Deep Learning to Write Like ShakespeareadminMon, 03/25/2019 - 10:00

Author: Rosaria Silipo

LSTM recurrent neural networks can be trained to generate free text.
Let’s see how well AI can imitate the Bard.

“Many a true word hath been spoken in jest.”
― William Shakespeare, King Lear

“O, beware, my lord, of jealousy;
It is the green-ey’d monster, which doth mock
The meat it feeds on.”
― William Shakespeare, Othello

“There was a star danced, and under that was I born.”
― William Shakespeare, Much Ado About Nothing

Who can write like Shakespeare? Or even just spell like Shakespeare? Could we teach AI to write like Shakespeare? Or is this a hopeless task? Can an AI neural network describe despair like King Lear, feel jealousy like Othello, or use humor like Benedick? In theory, there is no reason why not if we can just teach it to.

From MIT’s The Complete Works of William Shakespeare website, I downloaded the texts of three well-known Shakespeare masterpieces: “King Lear,” “Othello,” and “Much Ado About Nothing.” I then trained a deep learning recurrent neural network (RNN) with a hidden layer of long short-term memory (LSTM) units on this corpus to produce free text.

Was the neural network able to learn to write like Shakespeare? And if so, how far did it go in imitating the Bard’s style? Was it able to produce a meaningful text for each one of the characters in the play? In an AI plot, would Desdemona meet King Lear and would this trigger Othello’s jealousy? Would tragedy prevail over comedy? Would each character maintain the same speaking style as in the original theater play?

I am sure you have even more questions. So without further ado, let’s see whether our deep learning network could produce poetry or merely play the fool.

Generating free text with LSTM neural networks

Recurrent neural networks (RNN) have been successfully experimented with to generate free text. The most common neural network architecture for free text generation relies on at least one LSTM layer.

To train our first Shakespeare simulator, I used a neural network of only three layers: an input layer, an LSTM layer, and an output layer (Figure 1).

The network was trained at character level. That is, sequences of m characters were generated from the input texts and fed into the network.

Each character was encoded using hot-zero encoding. This means that each character was represented by a vector of size n, where n is the size of the character set from the input text corpus.

The full input tensor with size [m, n] was fed into the network. The network was trained to associate the next character at position m+1 to the previous m characters.

All of this leads to the following network:

The input layer with n units would accept [m, n] tensors, where n is the size of the character set and m the number of past samples (in this case characters) to use for the prediction. We arbitrarily chose m=100, estimating that 100 past characters might be sufficient for the prediction of character number 101. The character set size n, of course, depends on the input corpus.
For the hidden layer, we used 512 LSTM units. A relatively high number of LSTM units is needed to be able to process all of these (past m characters - next character) associations.
Finally, the last layer included n softmax activated units, where n is the character set size again. Indeed, this layer is supposed to produce the array of probabilities for each one of the characters in the dictionary. Therefore, n output units, one for each character probability.

Using Deep Learning to write Shakespeare

Figure 1. The deep learning LSTM-based neural network we used to generate free text. ninput neurons, 512 hidden LSTM units, an output layer of n softmax units where n is the character set size, in this case the number of characters used in the training set.

Notice that in order to avoid overfitting, an intermediate dropout layer was temporarily introduced during training between the LSTM layer and the output dense layer. A dropout layer chooses to remove some random units during each iteration of the training phase. The dropout layer was then removed for deployment.

Building, training, and deploying the neural network

The network was trained on the full texts of “King Lear,” “Othello,” and “Much Ado About Nothing,” available from The Complete Works of William Shakespeare website, a total of 13,298 sentences.

The neural network described above was built, trained, and deployed using the GUI-based integration of Keras and TensorFlow provided by KNIME Analytics Platform.

The workflow to build and train the network is shown in Figure 2. The workflow to deploy the network to predict the final text, character by character, is shown in Figure 3.

These workflows have been copied and adapted from the workflows implemented in the blog post by Kathrin Melcher, “Once Upon a Time … by LSTM Network,” where a similar network was trained and deployed to generate free text, having trained on texts from the Grimm’s fairy tales. Both workflows are available and downloadable for free from the KNIME EXAMPLES server under 04_Analytics/14_Deep_Learning/02_Keras/11_Generate_Fairy_Tales.

In Figure 2, the brown blocks (nodes) in the Define Network Structure section (top left) build the different layers of the neural network. The data are cleaned, standardized, reshaped, and transformed by the nodes in the Preprocessing and Encoding section (lower left). Finally, the training is carried out by the Keras Network Learner node and the network is stored away for deployment.

Notice that if the training set is large, this network can take quite a long time to train. It is possible to speed it up by pointing KNIME Analytics Platform to a Keras installation for GPUs.

The deployment workflow in Figure 3 reads and uses the previously trained network to predict free text, one character after the next.

Figure 2. The training workflow trains a neural network (62 inputs -> 512 LSTM -> 62 outputs) to predict the next character in the text, based on the previous 100 characters. Training set consists of the texts of “King Lear,” “Othello,” and “Much Ado About Nothing” by William Shakespeare. Thus, the network should learn to build words and sentences in a Shakespearean style.

Figure 3. The deployment workflow takes 100 characters as the start and then generates text character after character in a loop, until 1,000 characters are generated, making the full, final Shakespeare-like text.

AI-generated Shakespearean text

Here are some results, after training the network for 50 epochs on the entire data set—the three selected theater plays—using 100 past characters for the input tensor and recursively generating 1,000 new characters.

Let’s start with 100 characters from a dialog between Othello and Desdemona (in red) and then see how our network would continue (in plain text).

- Othello
O my fair warrior!

- Desdemona
My dear Othello!

- Othello
It gives me wonder great as my content
To see you here before me. O my soul’s joy!
I am a man a worm.

- Gloucester
I know not, sweet:
I found your like a thief from the heart
That pirchas will be well.
The general speaks to be so
Turn a man, I think, besoou.

- Cassio
I pray you, sir, to lie: in this hand is not a tend and talking of it;
I would not be threaten dispatch.
Our good old friend,
Lay comforts the state, seek for him;
I will grife you be faster’d!
And the blessed course of dower:
Net forth enough to do you;
And that the Moor is defective in the letter!
Abhorre, heaven, I will go sor;
And the other too.

- Othello
I have a seet me such a trial of his speech,
That he shall live the Moor in the lies
That with daush’er
Holds it is a most poor man,
Whose welchers of the state,
A man that many call of their life
That have to lost the boy look to’t.

- Regan
Sir, to my sister?

- Oswald
I pray you, have your hand: you may receive it all to his sorrage, and makes the heavens
Cassio lies that in the heart
That I may speak: I’ll wast …

While Desdemona disappears from the dialog, a few minor characters from “King Lear” (Gloucester, Regan, and Oswald) trickle into the scene. We have created other free texts starting from different lines in the three plays, and it is interesting to notice that characters from the tragedies “Othello” and “King Lear” mix often, while characters from the comedy “Much Ado About Nothing” seem to be more self-contained.

The text is generated using words from Shakespearean English and its characteristic complex sentence structure. Not all sentences make perfect sense, but most seem to. Notice also that minor characters are assigned shorter dialogs, while Othello and Cassio, in this example, get the longest speeches—as it should be.

Using AI to write an original scene

Let’s use the first 100 characters (again in red) to trigger a new scene. Here I introduced the keyword “SCENE,” the place, the involved characters, and the name of the first character to speak. The neural network does the rest.

SCENE I.
Venice. A street.
Enter Roderigo and Iago

- Desdemona
I pray, talk me of Cassio.
Ay, so I hear the write to prive and there,
That she would seen him that present so lich wored old wat,
and the best conscionable than in this revolumance against him:
There’s son against father,
and thy father’s son and the best of our time hath no exalse your counsel watch
The worst is not a tender to the warlike isle,
That sunded must needs take the shame
which the revenges of the self-same malk than the best of our times;
keeps our fortunes fend for bearing to a strength,
Sight in their natures, letting go safely by the rack: I swear ‘tis, to be so
That she will send back my messenger.

- Gloucester
I see the rust in the stocks.

- King Lear
What a trifore be some cartiou, I can tell my way
Than should be assurather, despise my brother;
That I have passed him, tell me I was every think of fear,
That she may be honest yet he hath confess’d in him entertains
and think the next way to stain it,
That the main commet in the least
Would fail her breath,
That she may ...

Creating a new scene proved to be a much harder task than continuing an existing one. Desdemona’s talk starts well, and it is consistent with the original play. However, I am not sure I can follow it to the end.

Using AI to mix Shakespearean and Modern English

But seriously, how useful is an AI engine that produces archaic English text?

Can we somehow improve current English sentences with a touch of Shakespeare? Let’s feed our Shakespeare-trained network some modern text and see what emerges.

So, what kind of everyday, modern text could do with a bit of Shakespearean improvement? A software license agreement? Let’s see how much more interesting the network can make the start of a license agreement (in red).

This License refers to version of the GNU General Public License. Copyright also means copyright-bick,
Remade me any thing to his sword
To his salt and most hidden loose to be so for sings, but not in a libutt of his matter than that shall be sure as will be soldye
As master compary, do not live in traitor.
Bless thy five wits!

-Kent
O pity!
Sir, where is the patience now,
That this is so far from the sea and some bidings to dismantle
So many folds of save and honest.

-Brabantio
I must not think the Turk of Cassio in the strange metting the cribles of a charmer be the reviling of libe to say
That I can deceive him to the best advantage,
In her prophetic fairs of a little to presently at your powers; whereof I thank you, sir.

-Albany
Gloucester, I will prove upancy of his sport and first accuriors and guard and talking on the white.

-King Lear
Where are the thief?
Thou shalt never have the captains at the letter
To the Moor and thing we have not the better shall be sure as worth if he be anger—

-Regan
I pray you, have a countend more than think to do a proclaim’d there of my heart, Hot

The words save, honest, thief, master, traitor, and deceive seem to fit the context. Notice also that the dialogue sprouting from the start text of the license agreement interestingly involves mainly minor, less tragic characters from the plays.

For the rest, the un-understandable parts are as un-understandable to me as the original legal language. And to quote our network, “Sir, where is the patience now, …”

What deep learning learned from Shakespeare

We have reached the end of this experiment. We have trained a recurrent neural network with a hidden LSTM layer to produce free text. What have we learned?

To sum up, the network was trained on the full texts of Shakespeare’s plays “King Lear,” “Othello,” and “Much Ado About Nothing.” It learned to produce free text in Shakespearean style. It just needed a 100-character initial sequence to trigger the generation of free text.

We have shown a few different results. We started with a dialogue between Othello and Desdemona to see how the network would continue it. We also made the network write a completely new scene, based on the characters and place we provided. Finally, we explored the possibility of improving Modern English with Shakespearean English by introducing a touch of Shakespeare into the text of a license agreement. Interestingly enough, context-related words from Shakespearean English emerged in the free generated text.

These results are interesting because real Shakespearean English words were used to form more complex sentence structures even when starting from Modern English sentences. The neural network correctly recognized main or minor characters, giving them more or less text. Spellings and punctuation were mostly accurate, and even the poetic style, i.e., rhythms of the text, followed the Shakespearean style.

Of course, experimenting with data set size, neural units, and network architecture might lead to better results in terms of more meaningful dialogues.

As first published in InfoWorld.

Blog

KNIME Blog: general

↧

KNIME Spring Summit 2019: Scene one, take one...action!

April 1, 2019, 1:00 am

≫ Next: How to Automate Machine Learning

≪ Previous: Use Deep Learning to Write Like Shakespeare

KNIME Spring Summit 2019: Scene one, take one...action!adminMon, 04/01/2019 - 10:00

Author: Casiana Rimbu

In March 2019, we hosted the KNIME Spring Summit for the twelfth time. It brought together KNIME users from all over the world. If you weren't able to make it to Berlin this year, watch the live video recordings from the summit, below, to learn about what’s new in KNIME Analytics Platform and KNIME Server.

Opening Remarks by Michael Berthold, CEO (KNIME)

What's New in KNIME Analytics Platform by Bernd Wiswedel, CTO (KNIME)

What's New in KNIME Server by Jim Falgout, (KNIME)

Look out for more impressions from the Spring Summit and slides from various presentations and workshops on the KNIME Summits page!

And if you're interested in seeing when any of our courses, learnathons, or meetups are taking place near you, or online, have a browse through our Events page.

Blog

KNIME Blog: general

↧

How to Automate Machine Learning

April 8, 2019, 1:00 am

≫ Next: Data Chef ETL Battles - Today, WebLog Data for Clickstream Analysis

≪ Previous: KNIME Spring Summit 2019: Scene one, take one...action!

How to Automate Machine LearningadminMon, 04/08/2019 - 10:00

Authors: Paolo Tamagnini, Simon Schmid, and Christian Dietz

Is it possible to fully automate the data science lifecycle? Is it possible to automatically build a machine learning model from a set of data?

Indeed, in recent months, many tools have appeared that claim to automate all or parts of the data science process. How do they work? Could you build one yourself? If you adopt one of these tools, how much work would be necessary to adapt it to your own problem and your own set of data?

Usually, the price to pay for automated machine learning is the loss of control to a black box kind of model. What you gain in automation, you lose in fine-tuning or interpretability. Although such a price might be acceptable for circumscribed data science problems on well-defined domains, it could become a limitation for more complex problems on a wider variety of domains. In these cases, a certain amount of interaction with the end user is desirable.

At KNIME, we take a softer approach to machine learning automation. Our guided automation—a special instance of guided analytics—makes use of a fully automated web application to guide users through the selection, training, testing, and optimization of a number of machine learning models. The workflow was designed for business analysts to easily create predictive analytics solutions by applying their domain knowledge.

In this article, we will show the steps of this application from the business analyst point of view, when running from a web browser. In a follow-up article, we will show the behind-the-scenes implementation, explaining in detail the techniques used for feature engineering, machine learning, outlier detection, feature selection, parameter optimization, and model evaluation.

Guided Analytics for Machine Learning Automation

With guided automation, we do not aim to replace the driver by totally automating the process. Instead, we offer assistance, and we allow feedback to be gathered whenever needed throughout the modeling process. A guided automation application is developed by data scientists for end users. To be successful, it needs:

Ease of use for the end user (for example, execution from a web browser)
A set of GUI interaction points to gather preferences and display results
Scalability options
A flexible, extensive, agile data science software application running in the background

By flexible, extensive, and agile, we mean a data science application that allows for the assembly of complex data and machine learning operations as well as easy integrations with other data science tools, data types, and data sources.

In general, a guided automation application can automate the development of many kinds of machine learning models. In this case, we need to automate the following parts of the data science cycle to create a generic classification model:

Data preparation
Feature engineering
Parameter optimization
Feature selection
Model training
Model evaluation
Model deployment

As simple as the final application might seem to the end user, the system running in the background could be quite complex and, therefore, not easy to create completely from scratch. To help you with this process, we created a blueprint of an interactive application for automatic creation of machine learning classification models.

This blueprint was developed with KNIME Analytics Platform, and it is available on our public repository.

A Blueprint for Guided Automation of Machine Learning

The main concept behind the blueprint for guided automation includes a few basic steps:

Data upload
Definition of application settings through human interaction
Automated model training and optimization, based on the previously defined settings
Dashboard with performance summary and model download

Figure 1: The main process behind the blueprint for guided automation: data upload, application settings, automated model training and optimization, a dashboard for performance comparison, and model downloads.

The current process implemented in the blueprint (Figure 1) applies to a standard predictive analytics problem. However, standard is rarely the case when we deal with data problems. Often, custom processing must be applied to the input data due to a special data type, data structure, or just pre-existing expert knowledge. Sometimes, the training and test set might need to follow specific rules, for example, the time order.

A possible customization of the previous process, including custom data preprocessing and a custom train/test split, is shown in Figure 2. You can easily apply these customizations to the blueprint in Knime Analytics Platform. Thanks to its visual programming framework, no coding is required.

Figure 2: A possible customization of guided automation. In this case, custom data preparation and a custom train/test split are added to the process.

Guided Automation of Machine learning: Web Browser Step by Step

Let’s see what the guided automation blueprint looks like from a web browser via KNIME Server.

At the start, we are presented with a sequence of interaction points:

Upload the data
Select the target variable
Remove unnecessary features
Select one or more machine learning algorithms to train
Optionally customize parameter optimization and feature engineering settings
Select the execution platform

Input features can be removed based on the business analyst’s own expertise or on a measure of feature relevance. The measure of relevance we used was based on the column’s missing values and value distribution; columns with too many missing values, with values that are too constant, or with values that are too spread out are penalized.

Customizing parameter optimization and feature engineering are optional. Parameter optimization is implemented via a grid search on customizable parameter ranges. Feature engineering, if enabled, works first with a number of selected feature combinations and transformations, then with a final feature selection.

A few options are available in terms of the execution platform ranging from your own local machine (default) to a Spark-based platform or other distributed execution platforms.

The webpage template to be used for all interaction points, which are summarized in Figure 3 below, includes a description of the required task on the right and an application flow chart at the top. Future steps are displayed in gray, past steps in yellow, and the current step in just a yellow frame.

Figure 3: This diagram follows the execution of the blueprint for guided automation in the web browser: (1) upload the dataset file; (2) select the target variable; (3) filter out undesired columns; (4) select the algorithms to train; (5) define the execution environment. At the top is the flowchart that serves as a navigation bar throughout the process.

After all of the settings have been defined, the application executes the steps in the background. The selected input features will go through data preprocessing (dealing with missing values and outliers), feature creation and transformation, parameter optimization and feature selection, and final model retraining and evaluation in terms of accuracy measures and computing performance.

And here we reach the end of the guided automation journey. The application returns a dashboard where the selected machine learning models are compared in terms of accuracy and execution speed.

ROC curves, accuracy measures, gain or lift charts, and confusion matrices are calculated on the test set and displayed in the final landing page to compare accuracy measures (Figure 4).

Figure 4: The dashboard in the final page of the guided automation blueprint. The top part of the dashboard covers charts for model performance evaluation. Cumulative views for all trained models include: (1) a bar chart of accuracies (blue) and AUC scores (red); (2) ROC curves. Single views for each model include: (3) a confusion matrix heat map and (4) the cumulative gain chart. Each model gets a row in the dashboard to host the single views.

Model execution speed is evaluated during training and during deployment. Deployment execution speed is measured as the average speed to run the prediction for a single input. Thus, two bar charts show, respectively, the model training time in seconds and the average time to produce a single prediction in milliseconds (Figure 5).

All dashboard views are interactive. Plot settings can be changed, and data visualizations can be explored on the fly.

Figure 5: Here we see the part of the dashboard that displays execution speeds. The top bar chart shows the training time in seconds; the bottom bar chart shows the average time in milliseconds for a single prediction during deployment.

The models are now ready for download. At the end of the final dashboard, you will find the links to download one or more of the trained models for future usage, for example, as a RESTful API in production.

For the full guided automation experience, you can watch the application in action in this demo video: “Guided Analytics for Machine Learning Automation.”

Machine Learning for Business Analysts

In this article, we have described our blueprint for the guided automation of machine learning and illustrated the steps required. This workflow-driven web application represents our own interpretation of semi-automated (guided) machine learning applications. In our next article, we’ll look at the implementation behind the scenes.

The blueprint implemented via the KNIME Analytics Platform and described in this article can be downloaded for free, customized to your needs, and freely reused. A web-based workflow can be executed by running the same application on the KNIME Server.

Now, it is your turn to create a guided machine learning application starting from this blueprint. In this way, you can empower the business analysts to easily create and train machine learning models from a web browser.

As first published in InfoWorld.

Want to try out the blueprint at one of our after-work learnathons?

We're currently holding a series of Artificial Intelligence learnathons, entitled "Building Applications for Automated Machine Learning". These events are hands-on and free of charge. Look us up in these cities:

Also interesting to check out: Paolo and Christian's talk on Guided Automation at the recent KNIME Spring Summit in Berlin.

Blog

KNIME Blog: general

↧

Data Chef ETL Battles - Today, WebLog Data for Clickstream Analysis

April 15, 2019, 1:00 am

≫ Next: The Five Steps to Writing Your Own KNIME Extension

≪ Previous: How to Automate Machine Learning

Data Chef ETL Battles - Today, WebLog Data for Clickstream Analysisheather.fysonMon, 04/15/2019 - 10:00

Authors: Maarit Widmann, Anna Martin, Rosaria Silipo

Do you remember the Iron Chef battles?

It was a televised series of cook-offs in which famous chefs rolled up their sleeves to compete in making the perfect dish. Based on a set theme, this involved using all their experience, creativity, and imagination to transform sometimes questionable ingredients into the ultimate meal.

Hey, isn’t that just like data transformation? Or data blending, or data manipulation, or ETL, or whatever new name is trending now? In this blog series requested by popular vote, we will ask two data chefs to use all their knowledge and creativity to compete in extracting a given data set's most useful “flavors” via reductions, aggregations, measures, KPIs, and coordinate transformations. Delicious!

Want to find out how to prepare the ingredients for a delicious data dish by aggregating financial transactions, filtering out uninformative features or extracting the essence of the customer journey? Follow us here and send us your own ideas for the “Data Chef Battles” at datachef@knime.com.

Ingredient Theme: WebLog Data

Today’s dataset is the clickstream data provided by HortonWorks, which contains data samples of online shop visits stored across three files:

1. Data about the web sessions extracted from the original web log file. It contains user ID, timestamp, visited web pages, and clicks.

2. User data. This file contains birthdate and gender associated with the user IDs, where available.

3. The third file is a map of web pages and their associated metadata, e.g. home page, customer review, video review, celebrity recommendation, and product page.

Clickstream analysis is the branch of data science that collects, summarizes, and analyzes the mass of data from users by detecting patterns and relationships between actions and/or users. Some example metrics are shown in Figure 1. With this knowledge, the online shop can optimize their services, including temporary advertisements, targeted product suggestions, better web page layout, and improved navigation options.

Our data chefs are going to approach the clickstream data from three different perspectives. Haruto will focus on demographics, Momoka on web site visit behavior, and Hiroyuki on revenue. Let’s see what they find out!

Topic. Clickstream Analysis

Challenge. From web log file, web page metadata, and user data extract patterns and relationships about online shop visits

Methods. Aggregations and Visualizations

Data Manipulation Nodes. GroupBy, Pivoting, Date&Time nodes

The Competition

The data aggregations and visualizations produced by data chefs Haruki, Momoka, and Hiroyuki make the basis to train a prediction model, or to build a dashboard to investigate follow-up actions.

As shown in Figure 2, the data undergoes some preprocessing before being presented to the data chefs, involving data access, data blending, data cleaning, and feature generation. Here the raw web log file is joined with user and product data. Visits are separated based on a user ID and time-out value. User age is calculated based on the timestamp of the visit and user birthdate. The visit purchase information is generated by checking whether any click in the visit led to purchasing a product.

Now it’s time for the data chefs to begin their battle. Read on to see how each chef goes about their challenge.

Data Chef Haruto: User Demographics

Following the schema shown in Figure 1, Data Chef Haruto focuses on the demographics of customers and visitors to the online shop, which you can see in Figure 3 and are explained below.

Aggregations

Haruto’s ingredients are user age and gender. Here’s the recipe.

First, he bins user age with the Numeric Binner node into:

“Generation Z” (24 years old or less);
“Generation Y” (between 25 and 39 years old);
“Generation X” (from 40 to 59 years old);
“Baby Boomers” (over 55 years old).

Next, he calculates the number of visits and number of users according to gender and age bin.

Visualizations

Figure 4 shows the aggregated metrics by Haruto. He finds out that

The number of users and the number of visits follow a similar pattern across the four age bins. The user basis is dominated by “Generation Z” and “Generation Y”, which together make up for more or less three quarters of all users and all visits. This reflects the general trends that the younger segment of the population is more prone to internet shopping.
The web site is visited by men and women in equal measure and both genders are equally active in terms of number of visits. From these pie charts there comes no hint about possible marketing actions targeting women vs. men.

Data Chef Momoka: User Behavior

Following the original schema in Figure 1, Data Chef Momoka quantifies the behavior of the web site visitors. This is shown in Figure 5 and explained below.

Aggregations

Momoka’s ingredients are time, web page categories, and click sequences. Here’s her recipe.

First, she calculates the number of clicks and average visit duration according to weekday, time of the day, and web page category.

Next, she tracks the click behavior by following these steps, also shown in Figure 6:

She starts with a Column List Loop Start node and iterates over the columns representing subsequent clicks. Each iteration creates pairs of columns containing web page categories accessed by subsequent clicks.
She concatenates the results from each iteration and calculates the transition probability for each pair of web page categories
She extracts click sequences and then extracts those occurring at least twice

Visualizations

Figures 7 and 8 show the aggregated metrics by Data Chef Momoka. She finds out that:

There is a slight increase over the weekend in time spent on the web site, as shown by the line plot on the left in Figure 7. Probably people have more time to gather information about their possible purchases at weekends. However, the difference across business days and weekend days is really minimal. On the other hand, there is a clear difference between the time spent on the product pages and, for example, time spent reading celebrity recommendations.
There is a peak on Monday in the number of clicks on all page categories, as shown by the stacked area chart on the right in Figure 7. It seems that users read throughout the week, mainly on weekends, and proceed with more exploration, or even purchase, on Mondays. The popularity of the categories is the same as for the average visit time: the pages with most clicks are the home page and the various product pages, whereas the page with celebrity recommendations has the least number of clicks. Apparently most users do not care about what celebrities think when it comes to purchase.

Now have a look at the click behaviour in Figure 8, which shows click behavior.

The sunburst chart represents sequences of clicks occurring at least twice. Colors are associated with different page categories. The first clicks make the innermost donut. Further clicks are located in the external rings. Selecting one area inside an external ring produces the sequence of previous clicks as shown in Figure 8.

The heatmap shows the page category for the first click on the y-axis, and the page category for the next click on the x-axis. The color transfers from purple (low likelihood) to orange (high likelihood).

Data Chef Momoka finds out that:

Almost three of four visits start at either the home page or a product page, as shown by the green and yellow sections that make almost 75 % of the number of clicks in the innermost donut in the sunburst chart in Figure 8.
About half of the visits stop already at the home page or at a product page, since both the green and yellow sections in the innermost donut in Figure 8 are divided in two - one part with further clicks and one part without.
The most probable next categories are home page and a product page for all categories according to transition probabilities between two page categories shown by the heatmap in Figure 8.
Celebrity recommendation and video reviews represent the least probable next clicks for all categories. These findings are in line with the category popularity shown in Figure 7.

Data Chef Hiroyuki: Contribution to Revenue

Again, following the original schema we showed in Figure 1, Data Chef Haruto approaches the clickstream data from the perspective of generating revenue. You can see his steps in Figure 9, and they are explained below.

In his recipe, he calculates the number of visits according to weekday, time of the day, and visit purchase information.

The line plots in Figure 10 show the number of visits with and without a purchase on each day and at each time of day, normalized by the total number of visits for the same day or time of day. The purchase information defines the colors: blue for a visit with “purchase” and orange for a visit with “no purchase”. The bar charts in Figure 10 show the absolute numbers of visits by the same categories.

Data Chef Hiroyuki finds out that:

As shown by the line plot on the left, circa 60% of all visits end in a purchase during business days against 40-50% during the weekend.
As shown by the line plot on the right, the percentage of visits with a purchase decreases towards the evening and night. The highest percentage of purchases happens during working hours.
Monday is again the busiest day in number of visits, either ending with a purchase or not, as shown by the bar chart on the left in Figure 10.
The most popular times to visit are afternoon and evening, as shown by the bar chart on the right in Figure 10.

The Jury

The three data chefs complement each other perfectly, since each data chef selected a different approach. But which of them prepared the starring data dish? It’s time to find the winner.

If Haruto had had more ingredients, his data dish would have been adventurous. He only aggregated by age and gender, though. Safe bet, but unsurprising.

Momoka was creative in generating measures with just a few ingredients. She decided to aggregate by the anonymous features that every user leaves on the web page: time, order, and web page category of a click. Abreast of the times!

Apparently Hiroyuki rates being useful over being explorative. His calculations are easy to apply, though something that every online shop administrator should have been measuring for a long time already. Plus for practicability, minus for underestimating the audience.

We have reached the end of this competition. Congratulations to all of our data chefs for wrangling such interesting features from the raw data ingredients! They have all individually produced interesting results, which work extremely well together to give a more complete representation of the customer. Ultimately, the best recipe is when you put them all together!

The workflow in Figure 11 shows the clickstream analysis process, combining the approaches of all three data chefs. It is divided in three parts: data preprocessing (1), data preprocessing for visualization (2), and data visualization (3).

Do you want to try it yourself? Download the workflow shown in Figure 11 from the EXAMPLES Server under 50_Applications/52_Clickstream_Analysis.

Coming next …

If you enjoyed this, please share it generously and let us know your ideas for future data preparations. We’re looking forward to the next data chef battle.

Blog

KNIME Blog: tech

↧

The Five Steps to Writing Your Own KNIME Extension

April 29, 2019, 1:00 am

≫ Next: Will They Blend? Today: Twitter and Azure. Sentiment Analysis via API.

≪ Previous: Data Chef ETL Battles - Today, WebLog Data for Clickstream Analysis

The Five Steps to Writing Your Own KNIME Extensionheather.fysonMon, 04/29/2019 - 10:00

Author: David Kolb

A lot of people associate the word “development” and KNIME Analytics Platform with creating workflows with nodes. But what exactly is a node and where do nodes come from?

A node is the smallest programming unit in KNIME. Each node serves a dedicated task, from very simple tasks - like changing the name of a column in a table - to very complex tasks - such as training a machine learning model. A node is contained in an extension. One of the jobs a developer does at KNIME is create new extensions or nodes for existing extensions. However, as openness is very important for us at KNIME, everyone can contribute to our platform.

KNIME Extensions — Fig. 1 The diagram shows the different types of extensions and integrations within KNIME Analytics Platform.

There are lots of ways to extend KNIME, but node development, i.e. writing extensions, to add the specific functionality you or your company needs, is probably the most common.

What kinds of extensions do you find in KNIME Analytics Platform, and where do they come from?

KNIME Integrations - open source integrations for KNIME which are also developed and maintained by KNIME. They provide access to large open source projects such as Keras for deep learning, H2O for high performance machine learning, Apache Spark for big data processing, Python and R for scripting, and more.

KNIME Extensions - developed and maintained by us, here, at KNIME, to provide additional functionalities such as access to and processing of complex data types as well as the addition of advanced machine learning algorithms

Community Extensions - created and made available in KNIME Analytics Platform for free by the KNIME community. Sometimes community extensions become supported and further developed by KNIME. At KNIME, we check that all Community Extensions function properly, also putting the Trusted Community Extensions through even more stringent checks

Partner Extensions - these are the nodes developed by other companies for their own use, for example to access in-house databases or resources. Sometimes these companies decide to share their nodes with the community. Check out these extensions provided by Continental or Erl Wood Cheminformatics, for example.

So, if you’ve found yourself thinking “It would be great if there was a special node to solve a particular problem”, in this blog post we show you how quickly you can actually start writing your own extensions. Note that this article is intended to be an overview of our Create a New KNIME Extension Quickstart Guide. There you’ll find a detailed manual and further explanations of all involved steps. Think of this blog post as the first stepping stone to getting started.

The Five Steps to Write an Extension

Set up a KNIME SDK
Create a New KNIME Extension Project
Implement the Extension
Test the Extension
Deploy your Extension

Example KNIME Extension Project - Number Formatter

We have created a reference extension you can use as orientation.You can find it in the

org.knime.examples.numberformatter folder of the
knime-examples repository

The project contains all required project and configuration files and an example implementation of a simple Number Formatter example node, which performs number formatting of numeric values of the input table. This example implementation is used in the Create a New KNIME Extension Quickstart Guide which walks you through all the necessary steps involved in creating a new KNIME Extension.

1. Set Up a KNIME SDK

First you’ll need to set up a KNIME SDK. The KNIME SDK is a configured Eclipse for RCP and RAP Developers installation which contains KNIME Analytics Platform dependencies. As KNIME Analytics Platform itself is built upon Eclipse, you can directly spin up a KNIME Analytics Platform development version from within the KNIME SDK. Another nice thing is that Eclipse is also a fully fledged IDE. Hence, you can directly use it to write the actual source code.

To set up a KNIME SDK, you can follow the steps described in the readme on the knime-sdk-setup GitHub page in the SDK Setup section. The important steps you have to go through are:

Install Java
Install Eclipse
Install Git and Git LFS
Configure Eclipse/Target Platform

The rest of the readme is also worth a read as it gives a lot of useful background information.

2. Create a New KNIME Extension Project

In order to create an extension, you need to create a new KNIME Extension Project in Eclipse, which is easily done using the KNIME Node Wizard as it automatically generates all necessary files.

To do so, first install the KNIME Node Wizard as follows:

Open your Eclipse Installation Wizard at Help → Install New Software…
Enter the KNIME update site location
Search for KNIME Node Wizard and install the entry that is found.
Restart Eclipse

Now you’re ready to start the KNIME Node Wizard to create your new KNIME Extension Project. The wizard automatically generates the project structure, the plug in manifest, and all required Java classes. You just have to enter a name for your new project and node, and the wizard embeds it in the KNIME framework. This process is explained in detail in the Create a New KNIME Extension Project section of the Quickstart Guide.

After the wizard has finished, the new project is displayed in the Package Explorer view of Eclipse with the project name you gave it in the wizard dialog. At this point you should take a moment to review the structure of the project. This is explained in detail in the Project Structure section of the Quickstart Guide, showing all the necessary parts that make up a node (e.g. project files, Java classes).

3. Implement the Extension

Finished reviewing your project’s structure? Now it’s time to check some implementation details. Conveniently, the KNIME Node Wizard automatically includes the example code of the Number Formatter node from the knime-examples repository in the generated KNIME Extension Project.

This implementation is further explained in the Number Formatter Node Implementation section of the Quickstart Guide. The example code also contains detailed descriptions of the implemented methods at each line of code.

At this point your project is already ready to run. So you can either directly try out the example node or adapt the implementation to your needs by changing the implementation of the relevant classes. How to spin up KNIME Analytics Platform from your KNIME SDK is explained in the next section.

4.Test the Extension

To test your extension, follow the instructions provided in the Launch KNIME Analytics Platform section of the SDK Setup. After you have started KNIME Analytics Platform from Eclipse, the Number Formatter (or your own implementation) node will be available at the root level of the node repository. Create a new workflow using the new extension, i.e. your new node, inspect the input and output tables, and play around with the node. This is now the perfect opportunity to test whether the node behaves as you want it to. E.g. you can now find any bugs or test it to make sure you have thought about all of the possible edge cases in the implementation.

5.Deploy your Extension

The final step, after implementation and testing your node, is to deploy the extension, i.e. make it available to other people. This is done using the Deployable plug-ins and fragments wizard directly from Eclipse. Let the wizard take you through this process. See the Deploy your Extension section of the Quickstart Guide for a detailed walk through the procedure.

In this example, the node is then displayed at the top level of the node repository in KNIME Analytics Platform.

If you think your new node or extension could be valuable for others and you want to make it available as a Community Extension, you could become a community contributor. By providing Community Extensions your nodes will be installable via the Community Extension update site. Furthermore, we have the concept of Trusted Community Contributions. More information about these can be found here.

Wrapping Up...

This blog post is designed to give a rough overview about node development, which is why we haven’t looked at more advanced topics such as streaming, custom port types or views. For a full walk-through, please follow the Quickstart Guide. If you want to start with KNIME Analytics Platform development, it’s a good idea to use the described example as a reference point and adapt it to your needs to develop the functionality you want to implement.

The open source community makes KNIME the great tool it is today. If you have a node that is worth sharing, then we encourage you to become a Community Contributor. The nodes of community contributors are available for everyone via the Community Extension update site.

Reference Materials:

Want to access KNIME source code? See the Developers Section of the KNIME website. Or go directly to GitHub
If you have questions regarding development, reach out to us in the KNIME Development section of our forum.
For more information about Community Extensions please see the Community Section on the KNIME website.
The Extension Development Quickstart Guide
KNIME Examples at Bitbucket
KNIME SDK Setup on GitHub
General guidelines on how a node should behave can be found in the KNIME Noding Guidelines

We hope you’ve enjoyed this little introduction to KNIME Analytics Platform node development. Happy KNoding!

Blog

KNIME Blog: dev

↧

Will They Blend? Today: Twitter and Azure. Sentiment Analysis via API.

May 6, 2019, 1:00 am

≫ Next: Guided Automation for Machine Learning, Part II

≪ Previous: The Five Steps to Writing Your Own KNIME Extension

Will They Blend? Today: Twitter and Azure. Sentiment Analysis via API.craigcullumMon, 05/06/2019 - 10:00

Author: Craig Cullum

In this blog series we’ll be experimenting with the most interesting blends of data and tools. Whether it’s mixing traditional sources with modern data lakes, open-source devops on the cloud with protected internal legacy tools, SQL with noSQL, web-wisdom-of-the-crowd with in-house handwritten notes, or IoT sensor data with idle chatting, we’re curious to find out: will they blend? Want to find out what happens when website texts and Word documents are compared?

Today: Twitter and Azure. Sentiment Analysis via API.

The Challenge

Will They Blend Twitter and Azure — The KNIME Twitter Nodes

Staying on top of your social media can be a daunting task, Twitter and Facebook are becoming the primary ways of interacting with your customers. Social media channels have become key customer service channels, but how do you keep track of every Tweet, post, and mention? How do you make sure you’re jumping on the most critical issues, the customers with the biggest problems?

As Twitter has become one of the world's preferred social media tools for communicating with businesses, companies are desperate to monitor mentions and messages to be able to address those that are negative. One way we can automate this process is through Machine Learning (ML), performing sentiment analysis on each Tweet to help us prioritise the most important ones. However, building and training these models can be time consuming and difficult.

There’s been an explosion in all of the big players (Microsoft, Google, Amazon) offering Machine Learning as a Service or ML via an Application Programming Interface (API). This rapidly speeds up deployment, offering the ability to perform image recognition, sentiment analysis and translation without having to train a single model or choose which Machine Learning library to use!

As great as all these APIs can be, they all have one thing in common. They require you to crack open an IDE and write code, create an application in Python, Java or some other language.

What if you don’t have the time? What if you want to integrate these tools into your current workflows? The REST nodes in KNIME Analytics Platform let us deploy a workflow and integrate with these services in a single node.

In this ‘Will They Blend’ article, we explore combining Twitter with Microsoft Azure’s Cognitive Services, specifically their Text Analytics API to perform sentiment analysis on recent Tweets.

Topic. Use Microsoft Azure’s Cognitive Services with Twitter.

Challenge. Combine Twitter and Azure Cognitive Services to perform sentiment analysis on our recent Tweets. Rank the most negative Tweets and provide an interactive table for our Social Media and PR team to interact with.

Access Mode / Integrated Tools. Twitter & Microsoft Azure Cognitive Services.

The Experiment

As we’re leveraging external services for this experiment we will need;

You’ll need your Twitter developer account's API key, secret, Access token and Access token secret to use in the Twitter API Connector node. You’ll also want your Azure Cognitive Services subscription key.

Creating your Azure Cognitive Services account

When you log in to your Azure Portal. Navigate to Cognitive Services and we’ll create a new service for KNIME.

Click add and search for the Text Analytics service.
Click Create to provision your service giving it a name, location and resource group. You may need to create a new Resource group if this is your first Azure service.
Fig. 1: Click Create to provision your service giving it a name, location and resource group
Once created, navigate to the Quick Start section under Resource Management where you can find your web API key and API endpoint. Save these as you’ll need them in your workflow.
Fig. 2: The Quick Start section in Resource Management where you can find your web API key and API endpoint

The Experiment: Extracting Tweets and passing them to Azure Cognitive Services

Deploying this workflow is incredibly easy, in fact, it can be done in just 15 nodes.

The workflow contains three parts that take care of these tasks:

Extracting the data from Twitter and wrapping them into a JSON format that is compatible with the Cognitive Services API
Submitting that request to Cognitive Services
Taking the output JSON format and turning it into a structured table for reporting. Ranking the sentiment and applying a color

Twitter to Azure — Fig. 3: Workflow using Twitter nodes to perform a Twitter search and submit this to Azure Cognitive Services API via the POST request node

Azure expects the following JSON format;

{
    "documents": [
      {
        "id": "1",
        "text": "I loved the meal"
      },
      {
        "id": "2",
        "text": "I left hungry and unhappy"
      },
      {
        "id": "3",
        "text": "The service was excellent"
      }
    ]
}

KNIME Analytics Platform includes excellent Twitter nodes that are available from KNIME Extensions if you don’t already have them installed. This will allow you to quickly and easily connect to Twitter and download tweets based on your search terms.

We can take the output from Twitter, turn it into a JSON request in the above format and submit. The Constant Value Column node and the JSON Row Combiner node wrap the Twitter output with the document element as expected.

The POST Request node makes it incredibly easy to interact with REST API services, providing the ability to easily submit POST requests.

You’ll need to grab the respective URL for your region, here in Australia the URL is;

https://australiaeast.api.cognitive.microsoft.com/text/analytics/v2.0/sentiment

We can leave the Authentication blank as we’ll be adding a couple of Request Headers.

We need to add for the Header Key;

Content-Type

and Header Value;

application/json

And another Header Key;

Ocp-Apim-Subscription-Key

The Header Value for the Subscription Key will be the key provided as part of your Azure Cognitive Services you created.

Twitter and Azure — Fig, 4: The Header Value for the Subscription Key is the key provided as part of the Azure Cognitive Services you created

If you’re using the KNIME Workflow as a guide, make sure to update the Twitter API Connector node with your Twitter API Key, API Secret, Access token and Access token secret.

We can now take the response from Azure, ungroup it, and join these data with additional Twitter data such as username and number of followers to understand how influential this person is. The more influential they are, the more they may become a priority.

Reporting the Data

Once created you can use a Table View node to display the information in an interactive table, sorted by sentiment. This can be distributed to PR and Social Media teams for action, improving customer service.

To really supercharge your KNIME deployment and make this service truly accessible, you can use the WebPortal on KNIME Server to create an interactive online sentiment service for your social media team allowing them to refresh reports, submit their own Twitter queries or provide alerting so your team can jump on issues.

So were we able to complete the challenge and merge Twitter and Azure in a single KNIME workflow? Yes we were!

References:

You'll find this 15-node workflow on the publicly available EXAMPLES Server, here:
40_Partners/01_Microsoft/06_Sentiment_with_Azure_for_publish
Twitter Data on the KNIME Community Workflow Hub

Coming Next …

If you enjoyed this, please share this generously and let us know your ideas for future blends.

About the author:

Craig Cullum is the Director of Product Strategy and Analytics at Forest Grove Technology, based in Perth, Australia. With over 12 years’ experience in delivering analytical solutions across a number of industries and countries, he now heads up a passionate team of data enthusiasts, finding innovate solutions to today’s business problems. Forest Grove Technology is a KNIME trusted partner.

Blog

KNIME Blog: general

↧

Guided Automation for Machine Learning, Part II

May 13, 2019, 1:00 am

≫ Next: From A for Analytics to Z for Zika Virus

≪ Previous: Will They Blend? Today: Twitter and Azure. Sentiment Analysis via API.

Guided Automation for Machine Learning, Part IIadminMon, 05/13/2019 - 10:00

Authors:Paolo Tamagnini, Simon Schmid, and Christian Dietz

Implementing a Web-based Blueprint for Semi-automated Machine Learning, using KNIME Analytics Platform

This article is a follow-up to our introductory article on the topic, “How to automate machine learning.” In this second post, we describe in more detail the techniques and algorithms happening behind the scenes during the execution of the web browser application, proposing a blueprint solution for the automation of the machine learning lifecycle.

The price to pay for automated machine learning (aka AutoML) is the loss of control to a black box kind of model. While such a price might be acceptable for circumscribed data science problems on well-defined domains, it might prove a limitation for more complex problems on a wider variety of domains. In these cases, a certain amount of interaction with the end users is actually desirable. This softer approach to machine learning automation — the approach we take at KNIME — is obtained via guided automation, a special instance of guided analytics.

Figure 1: The main process behind the blueprint for guided automation: data upload, application settings, automated model training and optimization, dashboard for performance comparison, and model download.

As easy as the final application might look to the end user, the system running in the background can be quite complex and therefore not easy to create completely from scratch. To help you with this process, we created a blueprint of an interactive application for the automatic creation and training of machine learning classification models.

The blueprint was developed with KNIME Analytics Platform, and it is available on the KNIME Community Workflow Hub.

Guided Automation from a Web Browser

Let’s see what the guided automation blueprint looks like from a web browser via KNIME Server.

At the start, we are presented with a sequence of interaction points to:

Upload the data
Select the target variable
Remove unnecessary features
Select one or more machine learning algorithms to train
Optionally customize parameter optimization and feature engineering settings
Select the execution platform.

These steps are all summarized in Figure 2 below.

Figure 2: This diagram follows the execution of the blueprint for guided automation on a web browser: (1) upload the dataset file, (2) select the target variable, (3) filter out undesired columns, (4) select the algorithms to train, (5) define the execution environment. At the top is the flowchart that will serve as the navigation bar throughout the process.

After crunching the numbers — i.e., data pre-processing, feature creation and transformation, parameter optimization and feature selection, and final model re-training and evaluation in terms of accuracy measures and computing performance — the final summary page appears showing the model performance metrics. At the end of this final page we will find the links to download one or more of the trained models for future usage, for example, as a RESTful API in production.

For a look at the full end-user experience, you can watch the guided automation application in action in this demo video, “Guided Analytics for Machine Learning Automation.”

Want to try it out at our next after-work Guided Analytics Learnathons: Building Applications for Automated Machine Learning?

Come to Zurich on May 23 at 5:00 PM. Check out the details and register here

We're in Rome on May 28 at 6:30 PM. Find out more and sign up here

The Workflow Behind Machine Learning Automation

The workflow behind the blueprint is available on the public KNIME Examples Server from the KNIME website or via KNIME Analytics Platform under 50_Applications/36_Guided_Analytics_for_ML_Automation/01_Guided_Analytics_for_ML_Automation.

You can import the workflow into your KNIME Analytics Platform, customize it to your needs, and run it from a web browser on KNIME Server. In this video, you can find more details on how to access and import workflows from KNIME Examples Server to KNIME Analytics Platform. The blueprint workflow is shown in Figure 3 below.

Find everything about JavaScript and jQuery in the cheat sheets, read the JavaScript blog or use the free online tools.

Guided Automation for Machine Learning Part II

Figure 3: Guided automation workflow implementing all required steps and web pages: settings configuration, data preparation and model training, and final dashboard.

In the workflow, you can recognize the three phases of the web-based application:

Settings configuration: Upload, select, fine-tune, and execute the automation
Behind the scenes: Data preparation and model training
Final dashboard: Compare and download models

Settings Configuration: Upload, Select, Fine-tune & Execute the Automation

In the first part of the workflow, each light gray node produces a view, i.e., a web page with an action request. When running the workflow via a web browser to KNIME Server, these web pages introduce as many interaction points where the end user can set preferences and guide the analytics process. You can see the nodes to upload the dataset file, select the target variable, and filter certain features.

Data columns can be excluded based on their relevance or on an expert’s knowledge. Relevance is a measure of column quality. This measure is based on the number of missing values in the column and on its value distribution; columns with too many missing values or with too constant or too spread values are penalized. Moreover, columns can be manually removed to prevent data leakage.

After that, you can select the machine learning models to train, optionally introduce settings for parameter optimization and feature engineering, and finally select the execution platform. The sequence of web pages generated during the execution of these special nodes is shown in Figure 2.

Behind the Scenes: Data Preparation and Model Training

In the following phase, the number crunching takes place behind the scenes. This is the heart of the guided automation application. It includes the following operations:

Missing value imputation and (optional) outlier detection
Model parameter optimization
Feature selection
Final optimized model training or retraining

After all settings have been defined, the application executes all of the selected steps in the background.

Data partitioning. First the data set is split into training and test sets, using an 80/20 split with stratified sampling on the target variable. Machine learning models will be trained on the training set and evaluated on the test set.

Data preprocessing. Here, missing values are imputed column by column using the average value or the most frequent value. If previously selected, outliers are detected using the interquartile range (IQR) technique and capped to the closest threshold.

Parameter optimization. The parameter optimization process implements a grid search over a selected set of hyperparameters. The granularity of the grid search depends on the model and type of hyperparameters. Each parameter set is tested with a four-fold cross-validation scheme and ranked by average accuracy.

Feature engineering and feature selection. For feature engineering, a number of new artificial columns are created according to previous settings. Four kinds of column transformations can be applied:

Simple transformation on a single column (ex, x2, x3, tanh(x), ln(x))
Combining together pairs of columns with arithmetical operations
Principal component analysis (PCA)
Cluster distance transformation, where the data are clustered by the selected features and the distance to a chosen cluster center is calculated for each data point

A feature optimization process is run on the new feature set, consisting of all original features and some newly created features. The best feature set is selected by random search through a four-fold cross-validation scheme ranked by average accuracy. Parameter optimization and feature engineering can be customized. Indeed, for small datasets and easy problems, model optimization can be skipped to avoid overfitting.

Model retraining and evaluation. Finally, using the optimal hyperparameters and the best input feature set, all of the selected machine learning models are retrained one last time on the training set and reevaluated on the test set for the final accuracy measures.

Final Dashboard: Compare and Download Models

The last part of the workflow produces the views in the landing page. The node named Download Models contains prepackaged JavaScript-based views producing plots, charts, buttons, and descriptions visible in the final landing page.

ROC curves, accuracy measures, gain or lift charts, and confusion matrices are calculated on the test set and displayed in this final landing page to compare accuracy measures.

Model execution speed is evaluated during training and during deployment. Deployment execution speed is measured as the average speed to run the prediction for one input. Thus, two bar charts show respectively the model training time in seconds and the average time to produce a single prediction in milliseconds.

All dashboard views are interactive. Plot settings can be changed, and data visualizations can be explored on the fly.

The same node also produces the links at the end of the page to download one or more of the trained models for future usage.

An Easily Customized Workflow for Guided Machine Learning

We have reached the end of our journey in the realm of guided automation for machine learning.

We have shown what guided automation for machine learning is, our own interpretation for semi-automated (guided) machine learning applications, and the steps required.

We have implemented a blueprint via KNIME Analytics Platform that can be downloaded for free, customized to your needs, and freely reused.

After introducing the relevant GUI and analytics steps, we have shown the workflow implementing the application behind the web browser view.

This workflow already works for binary and multiclass classification and it can be easily customized and adapted to other analytics problems—for example, to a document classification problem or to a time series analysis. The rest is up to you.

The greatest online JavaScript tools can be found at html-css-js.com: script beautifier, compressor, cheat sheet or just read the blog.

As first published in InfoWorld.

Blog

KNIME Blog: general

↧

From A for Analytics to Z for Zika Virus

May 20, 2019, 1:00 am

≫ Next: From Modeling to Scoring: Confusion Matrix and Class Statistics

≪ Previous: Guided Automation for Machine Learning, Part II

From A for Analytics to Z for Zika VirusJeanyMon, 05/20/2019 - 10:00

Authors:Jeany Prinz

A for Analytics Platform to Z for Zika Virus

One of the great advantages of KNIME Analytics Platform is its ability to analyze diverse types of data. Today we want to move to the outer edge of the alphabet and look into data from the Zika virus. Our analysis is inspired by the Microreact project¹“Zika virus in the Americas”² and is a nice use case for the exploration of epidemiological data with KNIME Analytics Platform. Epidemiology is the study of the distribution and determinants of health-related states or events³. In this post, therefore, we will answer the question: What routes did the Zika virus take as it spread across the globe, and how did its genetic makeup change the way it did? To this end, we will investigate and visualize both geolocational and phylogenetic data from the Zika virus. Using generic JavaScript nodes, we will create our own dynamic views and wrap them into a composite interactive view.

Even if you deal with very different data on a day-to-day basis, this blog post is still of high value, as we show how to increase the flexibility of your analysis using interactive generic JavaScript views.

Zika Virus

Zika virus (ZIKV) is an RNA virus with a 10.7 kb (kilo base pairs) genome encoding a single polyprotein; it is transmitted among humans by mosquitoes. It is named after the Zika forest in Uganda, where the virus was first isolated in 1947 from a sentinel rhesus monkey⁴. In humans, ZIKV infection typically causes Zika fever accompanied by maculopapular rash, headache, conjunctivitis and myalgia.

In early 2015, a widespread epidemic of Zika fever spread from Brazil to other parts of South and North America. In February 2016, the World Health Organization declared the outbreak a Public Health Emergency of International Concern as evidence grew that Zika can cause birth defects as well as neurological problems⁵.

In order to contain and control the spread of Zika, epidemiologists need to know the paths through which the virus spreads, as well as the ways its genetic makeup changes in different locations. The visual analytics capabilities available in KNIME Analytics Platform offer a great resource to investigate these questions.

Figure 1: Workflow to analyze epidemiological data about the Zika virus. Temporal/spatial data as well as a phylogenetic information are used as inputs for a wrapped metanode that contains an interactive view.

Composite Interactive View for Epidemiological Data

We created a workflow (Fig. 1) to interactively investigate and visualize geolocational and phylogenetic data in KNIME Analytics Platform. In order to do this, we downloaded two files from the microreact project:

microreact-project-zika-virus-data.csv containing temporal (day, year month) as well as spatial (country, latitude, longitude) information of the reported Zika virus cases in csv format
microreact-project-zika-virus-tree.nwk containing the phylogenetic tree in Newick format

A phylogenetic tree is a diagram that depicts evolutionary relationships among organisms. One way of obtaining those evolutionary relationships is by comparing genomic sequences based on differences in the DNA that naturally accumulate over time. A common representation of the resulting tree structure is the Newick format⁶.

Figure 2: Inside the wrapped metanode, “Interactive view”: The view is composed of a range slider to filter by year, a detailed JavaScript table, and two generic JavaScript views: a map and a phylogenetic tree.

As Figure 1 demonstrates, the two downloaded files are used as input for a wrapped metanode named “Interactive view”. We used the Color Manager node to color-code according to the attribute, “regional”, which contains the categories Pacific Islands, Brazil, Brazil Microcephaly, Other American and Unknown. The view includes an interactive map based on the geolocation data, a range slider to let you filter by year, and an interactive table with additional information from the input csv file (see Fig. 2). In addition, it contains an interactive visualization of the phylogenetic tree. Figure 3 shows the complete composite view generated in the wrapped metanode, “interactive view”. The user can, for example, filter by year or select a specific Zika strain in the tree view which then gets selected in the map as well as in the table.

Investigation of Zika Virus Data

Looking at the phylogenetic tree in Figure 3, we find that the strain most diverse from the others is found in Asia, whereas strains collected in South America are most closely related⁷. This is also in line with the time the samples were collected, which we can easily explore using the range slider to filter by year (see view in Fig. 3). The first data point in our map is from Malaysia in 1966 followed by Micronesia in 1968. In 2010 and 2013, we find three occurrences in Cambodia, French Polynesia, and Canada (imported from Thailand). From there, the virus spread further south to Haiti. In 2015, ZIKV was reported in Brazil and subsequently in several countries of Central and South America such as Suriname, Guatemala, and Colombia.

The detailed table in the lower part of the view enables the extraction of additional information. This information includes the Zika strain, if the complete CDS (coding sequence) was extracted as well as the size of the sequence in base pairs (bp) that was available for the phylogenetic analysis. If we select a data point in the table it is selected automatically in the map, too, and the tree, and vice versa. We can also check for detailed information about a data point by selecting it in the map (see Fig 3).

Figure 3: Composite view containing the interactive map, a phylogenetic tree, a range slider, and an interactive table with additional information.

The interactivity allows us to easily investigate how the virus spread and explore the data in detail. For the dedicated KNIME JavaScript plots and tables, this interactivity is easily achieved by combining views that operate on the same table in a wrapped metanode. For the custom JavaScript views (in our case the map and the phylogenetic tree) it is worth diving a bit further into the details to see how this can be done.

Interactive Generic JavaScript Views

Figure 4 shows the configuration of the generic Interactive tree JavaScript node. The code uses Jason Davies’ Newick format parser from 2011 (https://github.com/jasondavies/newick.js).

Figure 4: Configuration of the generic Interactive tree JS view. The CSS style can be included on the left the JavaScript code as well as the Dependencies - on the right.

In the configuration you can see the block of code that enables access to the input table and use a set of predefined libraries to generate the view. The dependencies can be found in the upper part of the window. Our tree node is built with d3.js, more information about that can be found in the following blog post "From d3 example to interactive KNIME view in 10 minutes"

For the map node we integrate the leaflet.js library via this link: https://unpkg.com/leaflet@1.3.1/dist/leaflet. Hence, to be able to display the map you need an internet connection.

Interactive Generic JavaScript Views - Functionality

Here, we briefly describe the functionality that helped us to create our interactive generic JavaScript views where the user was able to filter and select the data (for details, please have a look into the full source code in the example workflow). Note that you can use this information and apply it to many different scenarios - to retrieve, select, and filter data.

-Do you have data available at the input port? How can the data be accessed and retrieved?

knimeDataTable.getColumnNames() - accesses the data and retrieves the column names in a string array, with knimeDataTable being a global JavaScript object that is created and populated automatically
knimeDataTable.getColumn(columnID) - accesses and retrieves an array of all values contained in the column with the given ID (see node description for the details and methods).

-How can you select and filter data?

knimeService is a second global object that enables you to i.e. support selection and filtering in your view

-To register a subscriber to the selection events, call the following method:

knimeService.subscribeToSelection(tableId, callback)

-You can also subscribe to a filter event through:

knimeService.subscribeToFilter(tableId, callback)

-A callback is the function to be called when a selection or filter event occurs, i.e.

filterChanged = function(data) { […] } where “data” is an object that contains information about the currently applied filter

-To unsubscribe from a filter or selection, call:

knimeService.unsubscribeFilter/Selection(tableId, callback).

-To change the selection use any of these three convenient methods:

knimeService.addRowsToSelection(tableId, rows, callback)
knimeService.removeRowsFromSelection(tableId, rows, callback)
knimeService.setSelectedRows(tableId, selectedRows, callback)

These functions helped us to achieve the interactivity in our generic JavaScript views which in turn allowed us to create a view where the user can investigate epidemiological data from the Zika virus.

Summary

Generating an interactive composite view allowed us to interactively explore how the Zika virus spread over time and to relate that to sequence similarities in the genome of the virus. This workflow is a nice example of how it is possible to handle epidemiological data with KNIME Analytics Platform and how to generate custom JavaScript views.

You can investigate the results using the Analytics Platform or deploy the workflow to KNIME Server and enjoy the view on the WebPortal. The latter gives e.g. a Zika virus expert the chance to interact with the data without having to know how to use KNIME Analytics Platform.

The KNIME workflow described in the blog post is available on the publicly available EXAMPLES Server here: 03_Visualization/04_Geolocation/09_Geo_pylogenetic_analysis_Zika

Thank you to Oleg Yasnev for contributing to this article.

References

^1."Microbiology Society Journals | Microreact: visualizing and sharing ...."
Accessed 1/31/2019

^2."Zika virus in the Americas: Early epidemiological and genetic findings ...."
Accessed 1/31/2019

^3."WHO | Epidemiology - World Health Organization."
Accessed 1/31/2019

^4."Zika virus. I. Isolations and serological specificity. - NCBI."
Accessed 1/31/2019

^5."The Emergence of Zika Virus as a Global Health Security ... - NCBI - NIH."
Accessed 1/31/2019

^6."Newick format - Wikipedia."
Accessed 1/31/2019

^7."Zika virus in the Americas: Early epidemiological and ... - Science."
Accessed 1/31/2019

Blog

KNIME Blog: general

↧

From Modeling to Scoring: Confusion Matrix and Class Statistics

May 27, 2019, 1:00 am

≫ Next: Embedding KNIME in a Manufacturing Environment

≪ Previous: From A for Analytics to Z for Zika Virus

From Modeling to Scoring: Confusion Matrix and Class StatisticsMaaritMon, 05/27/2019 - 10:00

Author:Maarit Widmann

Wheeling like a hamster in the data science cycle?
Don’t know when to stop training your model?

Model evaluation is an important part of a data science project and it’s exactly this part that quantifies how good your model is, how much it has improved from the previous version, how much better it is than your colleague’s model, and how much room for improvement there still is.

In this series of blog posts, we review different scoring metrics: for classification, numeric prediction, unbalanced datasets, and other similar more or less challenging model evaluation problems.

Today: Confusion Matrix and Class Statistics

This first blog post lauds the confusion matrix - a compact representation of the model performance, and the source of many scoring metrics for classification models.

A classification model assigns data to two or more classes. Sometimes, detecting one or the other class is equally important and bears no additional cost. For example, we might want to distinguish equally between white and red wine. At other times, detecting members of one class is more important than detecting members of the other class: an extra investigation of a non-threatening flight passenger is tolerable as long as all criminal flight passengers are found.

Class distribution is also important when you’re quantifying performances of classification models. In disease detection, for example, the number of disease carriers can be minor in comparison with the class of healthy people.

The first step in evaluating a classification model of any nature is to check its confusion matrix. Indeed, a number of model statistics and accuracy measures are built on top of this confusion matrix.

Email Classification: spam vs. useful

Let’s take the case of the email classification problem. The goal is to classify incoming emails in two classes: spam vs. useful (“normal”) email. For that, we use the Spambase Data Set provided by UCI Machine Learning Repository. This dataset contains 4601 emails described through 57 features, such as text length and presence of specific words like “buy”, “subscribe”, and “win”. The “Spam” column provides two possible labels for the emails: “spam” and “normal”.

Figure 1 shows a workflow that covers the steps to build a classification model: reading and preprocessing the data, partitioning into a training set and a test set, training the model, making predictions by the model, and evaluating the prediction results.

The workflow shown below is downloadable from the EXAMPLES Server under 04_Analytics/10_Scoring/01_Evaluating_Classification_Model_Performance and on the KNIME Workflow Hub page.

Confusion matrix and class statistics — Fig. 1: Workflow building, applying and evaluating a supervised classification model: data reading and preprocessing, partitioning, model training, prediction, and model evaluation. This workflow predicts whether emails are “spam” or “normal”. Download it from KNIME Workflow Hub or the EXAMPLES Server under 04_Analytics/10_Scoring/01_Evaluating_Classification_Model_Performance

The last step in building a classification model is model scoring, which is based on comparing the actual and predicted target column values in the test set. The whole scoring process of a model consists of a match count: how many data rows have been correctly classified and how many data rows have been incorrectly classified by the model. These counts are summarized in the confusion matrix.

In the email classification example we need to answer several different questions:

How many of the actual spam emails were predicted as spam?
How many as normal?
Were some normal emails predicted as spam?
How many normal emails were predicted correctly?

These numbers are shown in the confusion matrix. And the class statistics are calculated on top of the confusion matrix. The confusion matrix and class statistics are displayed in the interactive view of the Scorer (JavaScript) node as shown in Figure 2.

Confusion Matrix

Let’s see now what these numbers are in a confusion matrix.

The confusion matrix was initially introduced to evaluate results from binomial classification. Thus, the first thing to do is to take one of the two classes as the class of interest, i.e. the positive class. In the target column, we need to choose (arbitrarily) one value as the positive class. The other value is then automatically considered the negative class. This assignment is arbitrary, just keep in mind that some class statistics will show different values depending on the selected positive class. Here we chose the spam emails as the positive class and the normal emails as the negative class.

The confusion matrix in Figure 3 reports the count of:

The data rows (emails) belonging to the positive class (spam) and correctly classified as such. These are called True Positives (TP). The number of true positives is placed in the top left cell of the confusion matrix.

The data rows (emails) belonging to the positive class (spam) and incorrectly classified as negative (normal emails). These are called False Negatives (FN). The number of false negatives is placed in the top right cell of the confusion matrix.

The data rows (emails) belonging to the negative class (normal) and incorrectly classified as positive (spam emails). These are called False Positives (FP). The number of false positives is placed in the lower left cell of the confusion matrix.

The data rows (emails) belonging to the negative class (normal) and correctly classified as such. These are called True Negatives (TN). The number of true negatives is placed in the lower right cell of the confusion matrix.

Therefore, the correct predictions are on the diagonal with a gray background; the incorrect predictions are on the diagonal with a red background:

Measures for Class Statistics

Now, using the four counts in the confusion matrix, we can calculate a few class statistics measures to quantify the model performance.

The class statistics, as the name implies, summarizes the model performance for the positive and negative classes separately. This is the reason why its value and interpretation changes with a different definition of the positive class and why it is often expressed using two measures.

Sensitivity and Specificity

Sensitivity measures how apt the model is to detecting events in the positive class. So, given that spam emails are the positive class, sensitivity quantifies how many of the actual spam emails are correctly predicted as spam.

We divide the number of true positives by the number of all positive events in the dataset: the positive class events predicted correctly (TP) and the positive class events predicted incorrectly (FN). The model in this example reaches the sensitivity value of 0.882. This means that about 88 % of the spam emails in the dataset were correctly predicted as spam.

Specificity measures how exact the assignment to the positive class is, in this case, a spam label assigned to an email.

We divide the number of true negatives by the number of all negative events in the dataset: the negative class events predicted incorrectly (FP) and the negative class events predicted correctly (TN). The model reaches the specificity value of 0.964, so less than 4 % of all normal emails are predicted incorrectly as spam.

Recall, Precision and F-Measure

Similarly to sensitivity, recall measures how good the model is in detecting positive events. Therefore, the formula for recall is the same as for sensitivity.

Precision measures how good the model is at assigning positive events to the positive class. That is, how accurate the spam prediction is.

We divide the number of true positives by the number of all events assigned to the positive class, i.e. the sum of true positives and false positives. The precision value for the model is 0.941. Therefore, almost 95 % of the emails predicted as spam were actually spam emails.

Recall and precision are often reported pairwise because these metrics report the relevance of the model from two perspectives, also called type I error as measured by recall and type II error as measured by precision.

Recall and precision are often connected: if we use a stricter spam filter, we’ll reduce the number of dangerous emails in the inbox, but increase the number of normal emails that have to be collected from the spam box folder afterwards. The opposite, i.e. a less strict spam filter, would force us to do a second manual filtering of the inbox where some spam mails land occasionally.

Alternatively, recall and precision can be reported by a measure that combines them. One example is called F-measure, which is the harmonic mean of recall and precision:

Multivariate Classification Model

In case of a multinomial classification model, the target column has three or more values. The emails could be labeled as “spam”, “ad”, and “normal”, for example.

Similarly to a binomial classification model, the target class values are assigned to the positive and the negative class. Here we define spam as the positive class and the normal and ad emails as the negative class. Now, the confusion matrix looks as shown in Figure 6.

To calculate the class statistics, we have to re-define the true positives, false negatives, false positives, and true negatives using the values in a multivariate confusion matrix:

The cell identified by the row and column for the positive class contains the True Positives, i.e. where the actual and predicted class is spam
Cells identified by the row for the positive class and columns for the negative class contain the False Negatives, where the actual class is spam, and the predicted class is normal or ad
Cells identified by rows for the negative class and the column for the positive class contain the False Positives, where the actual class is normal or ad, and the predicted class is spam
Cells outside the row and column for the positive class contain the True Negatives, where the actual class is ad or normal, and the predicted class is ad or normal. An incorrect prediction inside the negative class is still considered as a true negative

Now, these four statistics can be used to calculate class statistics using the formulas introduced in the previous section.

Summary

In this article, we’ve laid the first stone for the metrics used in model performance evaluation: the confusion matrix.

Indeed, a confusion matrix shows the performance of a classification model: how many positive and negative events are predicted correctly or incorrectly. These counts are the basis for the calculation of more general class statistics metrics. Here, we reported those most commonly used: sensitivity and specificity, recall and precision, and the F-measure.

Confusion matrix and class statistics have been defined for binomial classification problems. However, we have shown how they can be easily extended to address multinomial classification problems.

Blog

KNIME Blog: tech

↧

Embedding KNIME in a Manufacturing Environment

June 3, 2019, 1:00 am

≫ Next: Will They Blend: Experiments in Data & Tool Blending

≪ Previous: From Modeling to Scoring: Confusion Matrix and Class Statistics

Embedding KNIME in a Manufacturing EnvironmentadminMon, 06/03/2019 - 10:00

Author:Brendan Doherty, Seagate Technlogy

In this blog post I will discuss some of the processes and steps that were taken on the journey to embed KNIME in a High Volume Manufacturing Environment within Seagate Technology.

Seagate Technology are one of the world's largest manufacturers of electronic data storage technologies and solutions. Seagate Technology creates products and services that include network attached storage, high performance computing, data protection appliances, internal hard drives, backup and recovery services, flash storage, and related solutions. They are a vertically integrated company and have manufacturing plants based in many locations worldwide. The read/write heads for the HDDs are manufactured in two locations, one of which is in Derry City, Northern Ireland. These devices are highly complex, have a long manufacturing cycle time, and generate a lot of data during their fabrication. The plant has many different groups located at the site all of which use data from a wide variety of sources on a daily basis.

The Operations Technology Advanced Analytics Group (OTAAG) within Seagate Technology recognized the potential for the application of KNIME Analytics Platform to the data centric world in which the employees of Seagate work, in order to help to make better data driven decisions and also automate many repetitive manual data tasks at a variety of Seagate sites around the world (US, Asia, Europe). After having used KNIME for a range of projects for around about a year I attended the KNIME Spring Summit in Berlin. I was enthused from this experience and from seeing how other companies were using KNIME, so I set off to develop a pathway to advocate the use of KNIME and embed it within the plant and other Seagate sites that use KNIME. Initially there were only a few users all based within the OTAAG group, but now after a lot of hard work and persistence all of the major and many of the smaller groups in the factory are showing varying levels of usage of KNIME for a wide variety of tasks. My aim now is to encourage the cross pollination and synergy between groups, where KNIME users exchange concepts and knowledge, which is all to the benefit of them and the company (see Fig. 1).

Embedding KNIME in a Manufacturing Environment — Fig. 1. Encourage cross-pollination of ideas for KNIME users. Training a number of users in each group allows for the generation of new ideas and collaboration

Requirement: a tool for all levels of users to achieve data engineering needs

The diagram in Figure 2, below, highlights some of the steps that I implemented to enable the uptake of KNIME in the plant. Prior to this point there was a definite interest and appetite to learn more about Data Engineering and Data Science across a range of groups, but due to varying levels of user experience and ability this ambition had not been achieved.

Having programmed in many languages in the past before using KNIME, I immediately saw the opportunity to advocate the use of KNIME as a tool for all levels of users to engage with and enable them to achieve their data engineering needs.

The first part of the pathway was to deliver hands on training using KNIME to many waves of users across a range of groups. By using specific factory orientated examples, which catered to all levels of users, people could immediately see the benefits and opportunities of using KNIME Analytics Platform. The training was pitched at a pace and level so as non-native programmers could easily follow and understand the examples and not feel overwhelmed by the experience.

Once the training was complete I encouraged and helped people to then complete a use case based on their own job function in order to cement learning. An enthusiastic approach is critical here and I worked with many users on a 1:1 basis in order to get projects kicked off and driven to completion. In the long run all this work benefited both the users and the company.

Newly learned skill set quickly employed

I found that most users are able to get up and running quickly after the hands on training, in many cases within a week or two. One interesting example of this was a project from an intern who had limited programming skills when they started to work in the company. The user was able to get up to speed in the use of KNIME very quickly and then saw an opportunity when they could employ their newly learned skill set. The intern created a solution, which highlighted tools that were not following complex dispatch system rules, thereby having some knock on effects in the manufacturing line. By highlighting affected tools and taking corrective action the speed at which material was moved improved and therefore the manufacturing cycle time was decreased. In a short space of time the company was able to get a return on investment from the intern, and the intern also achieved a return on investment by developing a new skill set and creating a real world solution along with gaining invaluable experience in the process.

Tool deployed to automate time-intensive task

Another good example of how an employee was able to quickly deploy KNIME to create a business solution is from the Engineering department. Once the engineer had completed some hands on KNIME training they were able to see the potential to automate a very time intensive task for their group. This task involved a rotating list of engineers having to trawl through a variety of Google documents and databases on a daily basis in order to put together a document that identified which products were on hold in the manufacturing line. Our factory is highly integrated with Google tools, and so the Google API integration with KNIME¹ has been of real benefit in many cases. This automated solution now saves around 90 hours of engineering time per month. It also reduces cycle time by letting staff in this department tackle products in a hold state as soon as they come into the office each day, instead of having to wait on a report to be compiled and sent out.

Early adopter use case sessions & Citizen Data Science program improve engagement

A shared area and Google+ site where KNIME users could communicate and share useful documents and suggestions were setup, all in a bid to improve engagement and communication with the ever growing KNIME community in the factory.

Once I felt we had enough critical mass of projects and users I then facilitated early adopter use case sessions whereby users who have benefited from using KNIME presented to their colleagues on the solutions they developed. This word of mouth advertising helped gather traction from people who had previously not engaged in using KNIME.

At KNIME Spring Summit 2018 some of my colleagues (Allan Luk and Eric Lin) presented on the Citizen Data Science program, which they were rolling out in Seagate Technology, an important element of this is using KNIME for Guided Analytics. This initiative really dovetailed well with embedding KNIME in the factory.

Empowering people to discover and apply Machine Learning

Now that there are many people onsite using KNIME, this has also worked well with the Citizen Data Science program. It has empowered people who previously may have not considered the discovery and application of machine learning techniques to dip their toes into the world of Data Science!

Seagate Technology presentation at KNIME Summit: Have a look at the slides Brendan Doherty presented together with his colleague, Scott Morrison, during the KNIME for Business session at KNIME Spring Summit 2019 in Berlin.

About the Author:

Brendan Doherty is a Data Scientist within the Operations Technology Advanced Analytics Group (OTAAG) in Seagate Technology and is based in their site in Derry, Ireland. With over 12 years of experience in Data Analytics, Business Intelligence and Automation, he works to deliver innovative solutions, along with training and evangelizing KNIME within Seagate Technology.

Seagate is the global leader in data storage solutions, developing amazing products that enable people and businesses around the world to create, share and preserve their most critical memories and business data.

References

^1.Find out more about Google API Integration in KNIME:

Blog post: Will They Blend? A Recipe for Delicious Data - Part 2: The New Google Sheets Nodes in KNIME
Workflow: Google Sheets meets Excel - download the workflow from the KNIME Community Workflow Hub
Nodes: Infos about all the Google Connection nodes in KNIME on the KNIME Community Workflow Hub

Blog

KNIME Blog: general

↧

Will They Blend: Experiments in Data & Tool Blending

June 11, 2019, 12:00 am

≫ Next: From Modeling to Scoring: Finding an Optimal Classification Threshold based on Cost and Profit

≪ Previous: Embedding KNIME in a Manufacturing Environment

Will They Blend: Experiments in Data & Tool BlendingLukasaTue, 06/11/2019 - 09:00

Today: ML Algorithms meet Domain Experts

Author:Lukas Altenkamp, Gemmacon

In the past decade the explosion of collected data, the strong increase of computer resources, and the evolution of machine learning (ML) algorithms have led to applications that can perform single activities - such as image recognition - to a human or even superhuman performance. For every data scientist, the value of this technology is beyond doubt, and potential applications can be found in every part of industry, in almost every company department even. On the other hand, ML is based on finding patterns in a dataset and applying them to unseen data. Results are based on probability rather than intuition – accepting these results can be very difficult or even impossible.

The domain expert

In a lot of company departments that could potentially use ML applications, these tasks have usually been carried out by employees - domain experts - often with impressive accuracy, thanks to their years of experience, intuition, and excellent knowledge of the data, the problem, and sometimes even undocumented additional information. The switch to using automated algorithms to carry out this task improves speed dramatically, but an improvement in quality is not always guaranteed. This can result in the domain experts mistrusting and ultimately rejecting these kinds of applications, their fear of being replaced by machines notwithstanding.

Who is responsible?

The organizational structures of many (large) companies rely on clearly defined responsibilities – what happens if a prediction of an ML algorithm was wrong and this has led to additional costs? Who is responsible for that? The developer? The domain expert? No one? Having clear procedures is not only recommended but usually mandatory; this often calls for processes needing to be redesigned and involves considerable discussion between members of multiple departments – a bad premise for introducing ML applications quickly.

Building Trust in Machine Learning

We know what it’s like to build trust in colleagues. We believe we understand how they think, and our trust is reinforced following positive experiences with the results they produce. This makes us feel safe to trust their work. ML methods, in contrast, have such a large complexity, that an understanding of the results is only possible for the simplest algorithms. Visualizations and explanations of the methodology often provide just a rough picture as to how the results have been ascertained – and some algorithms are intrinsically so sophisticated that simple reasoning is impossible!

For a domain expert without any background in data science it can hardly be anything other than a black box. So, when will the expert trust the algorithm as naturally as we trust electricity and that the light will go on at a press of the switch?

In our approach we try to combine elements of Guided Analytics to address this problem and involve the domain expert in the process. In this way domain expertise and algorithmic performance can complement each other.

This has several advantages:

The domain expert retains control over the process – they can decide if the algorithm can be trusted, and inspect, review, and change results
Responsibilities are clear: the result is approved by an employee
Domain knowledge that is not documented in the data can still be incorporated
The domain expert is not replaced by the automated software but is given a tool that supports them to perform their task faster better. With time and positive experiences, trust in the algorithms and acceptance of the methods can be developed

A Workflow for Semi-automated Data Blending

To show how such a process can be implemented based on an example, we created a workflow that blends two different datasets. The datasets are from a fictive online shop, which separates order data into purchase and customer data. Each dataset consists of unique columns and overlapping columns, which appear in both datasets. However, they do not match perfectly: some entries can be incorrect, have misspellings or be formatted differently, meaning that the standard joiner node would fail in most of the cases. Tab. 1 shows the structure of the dataset.

Domain experts meet machine learning — Tab. 1: Overview of the columns of the two data sources

The workflow that we use in this example is shown below.The four boxes highlight the different steps of the workflow.

Load data and view
Compute similarity measures, choose accuracy, and perform automatic matching
Manual inspection
Show and save the results

The workflow can be basically split into two parts: the first part represents the ML algorithm to match corresponding rows. In our case we use simple numeric and string distance metrics to compute the total distance (difference) between the entries of the overlapping columns of each data row of the first source with each data row of the second table. The closest pairs are matched and joined. The second part of the workflow enables user interaction and invites the domain expert to inspect, review, and change the result of the algorithm. This workflow generalizes well, as the algorithmic part serves simply as an example and can be exchanged by any ML algorithm; the interactive views can be adapted easily to different use cases.

Interaction via the KNIME WebPortal

Domain expert perspective

To begin with, we want to discuss the workflow from the domain expert’s perspective – via the KNIME WebPortal.

Introduction

The workflow starts with an introductory, informative view. The first view shows the data, as well as some additional information about the problem. Since there are multiple overlapping columns, the domain expert can choose which ones are used for the matching algorithm, e.g. remove any particularly erroneous columns in particular.

Define a threshold for manual verification

After these introductory steps the matching algorithms compute distances (differences) measures between each pair of rows from the two input tables.

The Automated vs. Manual Matching page on the KNIME WebPortal in Figure 2 shows one of the key pages: A histogram, which shows the distribution of the distance for the best matches from each row of the first input source. This distribution gives the domain expert a quick overview on how the algorithm has performed. Is the algorithmic matching generally confident? Are there outliers?

Using the slider, the domain expert can define a threshold. Below, the matching is performed automatically, above, each match has to be verified by the expert. Depending on trust, time, and own expertise,the domain expert is fully flexible in how they control the algorithm.

Manual inspection

After defining the threshold, the workflow loops iteratively through all questionable matches, prints out the datapoint from the first source (purchase data) and possible matches (from the customer data) in a descending order. Quantities, such as distance or precision, can be shown too, and provide valuable additional information to the domain expert.

Results

The results are shown after each of the datapoints has been manually verified or changed. Optionally the joined table can also be downloaded.

The Matching Algorithm

We would like to take you behind the curtain of the matching algorithm! It is based on computing distances between the entries of the overlapping columns, which can be numeric, string, or other types such as date and time (not in this workflow).

Distance measures

A specific distance definition is needed for each type. For string columns, the Levenshtein distance can be used, for example, to compare two strings. This distance is normalized to the maximum distance appearing in the comparison, so that the values are in the range [0,1]. This distance is computed for every pair of rows and every string column.

Note: The Levenshtein Distance is one of the most famous string metrics for measuring the difference between two strings. It is the minimum number of operations (i.e. deletions, insertions, or substitutions) performed on a single character to transfor one of the strings into the other.

For numeric columns the computation is slightly different. Here one can calculate easily the numeric difference between the two numbers, normalize this to the maximum difference of that column, so that one obtains values in a range of [0,1] as well. The total distance can be obtained by averaging over all columns.

Adapt weights

Domain knowledge can be introduced to the distance measures for example by modifying the weights of each column. The distance values for each column are in the range [0,1], so that each column has the same weight to the total distance. However, if, due to some other piece of information, you expect to have some columns that match better than others, you can adapt the weights and increase the influence of specific columns to the total distance. An easy implementation is found in the Column Expression node, where we just doubled the distance of the ID column – this encodes the information, that we expect the ID to have less errors than the other columns.

Conclusion

The workflow shows a way to blend the power of ML algorithms with the domain knowledge of the end user - the domain expert. Using both sources can improve the quality of the results and puts the domain expert back in full control, making the ML application a powerful and supportive tool rather than a black box where decisions are made but not transparently reasoned.

The workflow can be easily adapted to other ML problems than matching different data sources.

References

Workflow: You can download and try it out from the KNIME Hubor from the publicly available KNIME EXAMPLES Server: 40_Partners\04_GEMMACON\01_Semi_Automated_ML
Useful docs: KNIME WebPortal User Guide

Guided Analytics Webinar: Come behind the scenes of Guided Analytics at our upcoming webinar on June 18 at 6:00 PM CEST. Find out more and register here (it's free!)

-----------------------

Coming Next …

If you enjoyed this, please share this generously and let us know your ideas for future blog posts.

About the Author

After his Ph.D. in particle physics Lukas joined Gemmacon as a data scientist and consultant with a passion for transforming business problems into data science projects. Beside the development of data pipelines, the implementation of machine learning algorithms and the design of visualization, a key ingredient is close collaboration with domain experts in order to integrate their knowledge into the solutions.

GEMMACON is a KNIME trusted partner. The company is known for providing simple, effective solutions and passion for digitalization within the automotive industry. As a consulting company the focus is on process, project and quality management. The innovative Quality Analytics approach raises efficiency, understanding and visualization of data to a whole new level. GEMMACON creates the maximum added value for its customers through the reduction of warranty and goodwill costs, the growth of service revenue, and increased customer satisfaction.

Blog

KNIME Blog: general

↧

From Modeling to Scoring: Finding an Optimal Classification Threshold based on Cost and Profit

June 17, 2019, 1:00 am

≫ Next: KNIME Analytics Platform 4.0: Components are for Sharing

≪ Previous: Will They Blend: Experiments in Data & Tool Blending

From Modeling to Scoring: Finding an Optimal Classification Threshold based on Cost and ProfitMaaritMon, 06/17/2019 - 10:00

Authors:Maarit Widmann (KNIME) and Alfredo Roccato (Data Science Trainer and Consultant)

Wheeling like a hamster in the data science cycle? Don’t know when to stop training your model?

In this series of blog posts, we review different scoring metrics: for classification, numeric prediction, unbalanced datasets, and other similar more or less challenging model evaluation problems.

Today: Penalizing and Rewarding Classification Results with a Profit Matrix

Confusion matrix and class statistics summarize the performance of a classification model: the actual and predicted target class distribution, accuracy of the assignment into the positive class, and the ability to detect the positive class events. However, these statistics do not consider the cost of a mistake, that is, a prediction into the wrong target class.

If the target class distribution is unbalanced, predicting events correctly into the minority class requires high model performance, whereas predicting events into the majority class can easily happen by chance. Wouldn't it be useful to take this into account, and weight the results differently when evaluating the model performance?

Ultimately, the final goal of the classification determines whether it makes sense to introduce a cost to certain types of classification results. Cost is useful when incorrect predictions into one target class have more serious consequences than incorrect predictions into the other class(es). Or, put another way, correct predictions into one class have more favourable consequences than correct predictions into the other class(es). For example, not detecting a criminal passenger at the airport security control has more serious consequences than mistakenly classifying a non-threatening passenger as dangerous. Therefore, these two types of incorrect predictions should be weighted differently. No cost is needed if all target classes are equally interesting or important, and the consequences of a wrong prediction into one target class is as bad as it is for the other classes. This is the case when we predict the color of a wine, for example, or the gender of a customer.

From Model Accuracy to Expected Profit

In addition to accuracy statistics, the performance of a classification model can be measured by expected profit. The profit is measured in a concrete unit defined by the final goal of the classification.

When we use classification results in practice, we assign each predicted class a different treatment: Criminal passengers are arrested, non-threatening passengers are let through. Risky customers are not extended credit, creditworthy customers are! And so on. The most desirable classification results produce profit, such as the security of an airport, or the money that a credit institute makes. We measure this profit in a predefined unit such as the number of days without a terror alarm, or euros. The most undesirable results bring about cost - a terror alarm at the airport, or money lost by a bank - and we measure the cost in the same unit as the profit.

Here, we assess the accuracy and expected profit of a classification model that predicts the creditworthiness of credit applicants. In a credit scoring application, predicting individual customer behavior has a consequence in terms of profit (or loss). Refusing good credit can cause loss of profit margins (commercial risk). Approving credit for high risk applicants can lead to bad debts (credit risk).

Optimizing Classification Threshold

A classification model predicts a positive class score for each event in the data. By default the events are assigned to the positive class if their score is higher than 0.5, and otherwise to the negative class. If we change the classification threshold, we change the assignment to the positive and negative class. Consequently, the values of accuracy and expected profit change as well.

Data

In this example, we use the well-known German Credit Data Set, as taken from the University of California Archive for Machine Learning and Intelligent Systems.

The dataset is composed of 1000 customers. The input variables are the individual characteristics of the subjects, like socio-demographic, financial and personal, as well as those related to the loan, such as the loan amount, the purpose of the subscription, and wealth indicators. The target is the evaluation of the credit applicant's creditability by the bank (2 = risky, and 1 = creditworthy).

In this dataset, 700 applicants (70%) are classified as creditworthy and 300 (30%) as risky.

We refer to the risky customers as the positive class and the creditworthy customers as the negative class.

Workflow to Produce Expected Profit for Different Classification Thresholds

The workflow shown in Figure 1 starts with data access and preprocessing. To assess the predictive capabilities of the model, the initial dataset is divided into two tables of equal size, respectively named the training set and the validation set. Next, a logistic regression model is trained on the training set to predict the applicants’ creditworthiness.

Inside the “Profit by threshold” metanode, applicants in the validation set are assigned to the two creditability classes “risky” and “creditworthy” based on the positive class scores that are predicted by the logistic regression model, and a classification threshold. The classification is repeated multiple times, starting with a low value of the threshold and increasing it for each iteration. The output table of the metanode contains the accuracy statistics and expected profit as obtained using the different threshold values and a predefined profit matrix.

Finally, the model performance statistics for different threshold values are shown in an interactive composite view as produced by the “Profit Views” component.

You can download this workflow from the:

KNIME Hub
EXAMPLES Server under knime://EXAMPLES/04_Analytics/10_Scoring/02_Optimizing_Classification_T…

Profit Matrix

To evaluate misclassification in terms of expected profit, a profit matrix is requested for assigning cost to undesirable outcomes.

We introduce a negative cost (-1) to the False Negatives - risky applicants who are approved a credit - and a positive profit (0.35) to the True Negatives - creditworthy applicants who are approved a credit. The profit matrix in Table 1 shows the cost and profit values for these classification results.

The values of cost and profit introduced in Table 1 are based on the following hypothesis ¹: Let’s assume that a correct decision of the bank would result in 35% profit at the end of a specific period, say 3-5 years. If the opposite were true, i.e. the bank predicts that the applicant is creditworthy, but it turns out to be bad credit, then the loss is 100%.

Calculating Expected Profit

The following formulas are used to report the model performance in terms of expected profit:

where p is the share of the positive (risky) class events of all data.

where n is the number of credit applicants.

More generally, assuming that the class with negative risk potential is defined as the positive class, an average profit for a classification model with a profit matrix can be calculated using the following formula:

where n is the number of events in the data.

In this example, we have 500 credit applicants in the validation set with an average loan of 10 000 €. 70% of the applicants are creditworthy and 30% are risky. Let’s first calculate a baseline for the profit statistics without using any classification model:

If we approve a credit for all of the applicants, the expected loss is 225,000 €.

Next, let’s calculate what the expected profit is when we evaluate the creditworthiness using a classification model and we weigh the outcomes with the profit matrix.

The minimum threshold for the positive class to achieve non-zero profit ² can be calculated from the cost matrix as

This value can be adjusted empirically as described below.

The workflow shown in Figure 2 iterates over different thresholds to the positive class scores that have been predicted by a classification model, here a logistic regression model. The threshold values range from 0 to 1 with a step size of 0.01. The workflow produces the overall accuracy for each value of the threshold by comparing the actual (unaltered in each iteration) and predicted (altered in each iteration) target class values. In order to calculate the expected profit, classification results from each iteration are weighted by the values in the profit matrix. In the output table of this workflow, every row corresponds to a value of the classification threshold, furthermore, the model accuracy statistics, average profit per applicant, average amount per applicant, and total average amount are shown for each classification threshold.

Results

The interactive composite view in Figure 3 shows how the values of four different model performance indicators develop if the value of the classification threshold increases from 0 to 1. The performance indicators are: 1. Overall accuracy (line plot in the top left corner) 2. Total average amount (line plot in the top right corner), 3. Average profit per applicant (line plot in the bottom left corner), and 4. Average amount per applicant (line plot in the bottom right corner).

Based on an empirical evaluation, the optimal threshold is 0.51 in terms of overall accuracy, and 0.27 in terms of expected profit. Table 2 represents the performance of the logistic regression model using the default and optimized threshold values in terms of overall accuracy and average profit per applicant:

0.113 profit per applicant gives an average amount of 1,130 € and, based on 500 applicants, the total average amount is 565,000 €.

The undeniable advantage of using a model is justified by the evidence of 565,000 € versus -225,000€.

References

¹ Wang, C., & Zhuravlev, M. An analysis of profit and customer satisfaction in consumer finance. Case Studies In Business, Industry And Government Statistics, 2(2), pages 147-156, 2014.

² C. Elkan. The foundations of cost-sensitive learning. In Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence, pages 973-978, 2001.

Blog

KNIME Blog: tech

↧

KNIME Analytics Platform 4.0: Components are for Sharing

June 27, 2019, 1:00 am

≫ Next: Will They Blend: KNIME meets OrientDB

≪ Previous: From Modeling to Scoring: Finding an Optimal Classification Threshold based on Cost and Profit

KNIME Analytics Platform 4.0: Components are for Sharingmichael.bertholdThu, 06/27/2019 - 10:00

The most visible difference are the new components. They let you bundle functionality for sharing and reusing.

It may be useful to briefly take a look back at what we had until now: Metanodes and Wrapped Metanodes. The former were a simple way to structure and organize your workflows better by collapsing part of them into little gray boxes. “Wrapping” a Metanode enabled you to turn this collapsed set of nodes into more of a real KNIME node by explicitly defining internal and external dependencies. Being able to use your (or your colleagues’) Wrapped Metanodes within your own workflow allowed you to build upon existing expertise.

In recent releases we have added more and more functionality, allowing you to add more abstractions to Wrapped Metanodes and thereby creating sophisticated dialogs and composite views.

With the ability to share not only workflows but also such useful “workflow pieces” with the community using the KNIME Hub, we wanted to make this difference of “workflow organization” vs. “building block creation” more obvious. Plus, “Wrapped Metanodes” is a bit of a mouthful to say, so we’ve grabbed the opportunity to get rid of this terminology. Starting with version 4.0, they are called “Components” and we can talk about building and sharing components. We have also deprecated the Quickform Nodes (which used to handle configuration as well as inputs and were often perceived as overly complex), and have replaced them with dedicated nodes for component configuration, input/output, and visualization widgets.

In short:

Metanodes allow you organize your workflows better: you can take part of a larger workflow and collapse it into a gray box that hides that part of the workflow’s functionality. It also makes it easier for others to understand what your workflow does as you can structure it a bit more hierarchically.

Components really are KNIME nodes that you create with a KNIME workflow. They encapsulate and abstract functionality, can have their own dialog, and can have their own sophisticated, interactive views. Components can be reused in your own workflows but also shared with others: via KNIME Server or the KNIME Hub. They can also represent web pages in an Analytical Application deployed to others via KNIME Server.

Configuration, Component I/O, and Widget nodes inside a component allow you to explicitly model what parameters can be adjusted when using a component inside your own workflow (the component’s configuration), which data get passed in (or out) of the component, and which visualizations are used to compose the component’s view.

On the EXAMPLES mount point in KNIME Analytics Platform (Fig. 1) you can already find various types of new components to use: some just sophisticated methods for time series analysis, others offering preconfigured visualizations for specific data types, yet other ones focusing on specific application areas such as cohort analyses for financial data analysis. And, of course, others allowing you to build your own Guided Analytics workflows by reusing components with interactive views for data loading, feature selection, and automated machine learning.

Components on EXAMPLES in KNIME Analytics Platform — Fig. 1: EXAMPLES mount point in KNIME Analytics Platform.

Check out the rest of the release features here, as well as in the detailed changelog. We’re also hosting a webinar on July 25 highlighting all the new stuff! The best way to check it out, though, is by trying it for yourself. Open KNIME and go to File -> Update KNIME to get the latest version, or download and install it here.

Blog

KNIME Blog: tech

↧

Will They Blend: KNIME meets OrientDB

July 8, 2019, 1:00 am

≫ Next: The KNIME Hub - Share and Collaborate

≪ Previous: KNIME Analytics Platform 4.0: Components are for Sharing

Will They Blend: KNIME meets OrientDBRedfieldMon, 07/08/2019 - 10:00

Today: SQL and NoSQL - KNIME meets OrientDB. Will They Blend?

Author:Artem Ryasik, Redfield

The Challenge

Today’s challenge is to take some simple use cases – using ETL- to pull data from text files and place them in a database; extract and manipulate graph data, and visualize the results in OrientDB, the open source NoSQL database management system. We want to export table data from an SQLite database and a csv file and place them in a graph storage – mixing together SQL and NoSQL. Will they blend?

The Dataset

Our dataset is ESCO data – the European Standard Classification of Occupations. ESCO is the multilingual classification of European Skills, Competences, Qualifications and Occupations. The information in this dataset is about different areas of the job market. ESCO classification identifies and categorizes skills, competences, qualifications and occupations. Similar occupations are combined into related sections defined by special codes. These codes form a taxonomy in which the leaves of the tree are occupations and branches refer to different job areas. As ESCO data also contain information about the skills that are necessary for the occupations, it seems pretty obvious that a good way to present this data as a graph!

OrientDB

We want to use OrientDB for graph data storage. It's a multi-model open source NoSQL database management system that combines the power of graphs with document, key/value, reactive, object-oriented, and geospatial models into a single scalable, high-performance operational database.

So why do we want to use OrientDB even though KNIME supports working with JDBC-compatible databases? The reason is that the concept behind the JDBC driver is not a good match when working with graphs; it’s not compatible with graph traversal. But as we want to use a graph to visualize our results after taking the data from an SQLite database and a csv file we needed a way to connect our data with OrientDB – which is a graph database.

In order to do so, we have developed five OrientDB nodes that use native OrientDB Java API:

OrientDB Function: The node is used for calling the server functions. OrientDB supports storing user-defined functions that can be written in SQL and JavaScript. It allows the user to execute some complex operations and queries without writing a script for it every time.
OrientDB Query: Supports executing idempotent operations i.e. those that do not change data in the database. This way the node enables information to be extracted from the database. OrientDB has its own SQL dialect, which supports not only basic functions as any other SQL dialect, but also provides special graph traversal algorithms.
OrientDB Connection: a node that allows you to create a connection to a remote or local OrientDB server. Here the user can specify the location and port of the database, its name, provide login and password, or use KNIME credentials. Once the connection is successfully created it can be propagated to other OrientDB
OrientDB Execute: this node handles batch requests, this feature is very handy when you need to upload a large amount of data to the database. This way the user can specify the batch script or create it with the use of template.nodes.
OrientDB Command: to enable executing non-idempotent operations i.e. those that can change data in the database. Consequently this node is used to insert, update, and delete data in the database. This node has 3 modes that make the work with it pretty flexible, we will discuss these modes further in the post.

Tip: Look up the node you want on the KNIME Hub and then drag it into your workflow to start using it right away.

The Experiment

In our example, we use all these nodes and cover most of their modes and configurations in order to perform ETL from a relational database and a .csv file to graphs. We’ll also show you how you can export these data to KNIME and use network nodes for analysis and visualization.

Topic. Blending SQL and NoSQL; OrientDB integration with KNIME

Challenge. Perform ETL to OrientDB with KNIME, extract data from OrientDB and analyze it with KNIME

Access Mode / Integrated Extensions. OrientDB nodes, KNIME Network Mining Extension, JavaScript Views extension.

OrientDB is a very flexible database, organized based on graph schemas:

Schema-Full: here you have a schema that defines the types of the classes and their properties
Schema-Less: in this mode, the users can add as many new attributes as they want, on the fly
Schema-Hybrid: here, both the above modes can be combined

It is also possible to use an OOP approach (Object-Oriented-Programming) in order to create different classes of vertices and edges.

In this post, we are going to have three classes of vertices and four classes of edges that will be inherited from generic V and E classes for vertices and edges consequently. And, we will define a simple schema for these classes where all properties will be of String type.

Here is the script to create the schema:

create class ISCO extends V;
create property ISCO.conceptUri String;
create property ISCO.code String;
create property ISCO.preferredLabel String;
create property ISCO.conceptType String;
create property ISCO.description String;

create class Occupation extends V;
create property Occupation.conceptUri String;
create property Occupation.conceptType String;
create property Occupation.preferredLabel String;
create property Occupation.altLabels String;
create property Occupation.description String;

create class Skill extends V;
create property Skill.conceptType String;
create property Skill.conceptUri String;
create property Skill.skillType String;
create property Skill.reuseLevel String;
create property Skill.preferredLabel String;
create property Skill.altLabels String;
create property Skill.description String;

create class IS_A extends E;
create class REQUIRES_OPTIONAL extends E;
create class REQUIRES_ESSENTIAL extends E;
create class SIMILAR_TO extends E;

We created the following vertex classes:

ISCO (International Standard Classification of Occupations) class, which stands for ISCO code used for categorising occupations by different areas
Occupation class, which defines a specific job in the ISCO classification system
Skill class, which is used as the requirement for the occupation.

Now let’s talk about edge classes. We have the following:

IS_A class, which is used to create ISCO taxonomy tree and connect it to the occupations
Two REQUIRES_ESSENTIAL and REQUIRES_OPTIONAL edges to specify the skills necessary for the specific occupation
SIMILAR_TO edge class, which will be used for defining similar occupations

Building the Workflow

Now that we have created a schema, we can start building the first workflow to fill the database. In this workflow (Figure 4) we read data from three tables stored in SQLite database – ISCO, Occupation and Skill, and one csv file that defines the skill requirements for the occupations. This workflow can be downloaded from:

The KNIME Hub

The SQLite database already contains the ISCO codes, occupations and skills, as a table. In order to analyze these data you need to extract all the tables into KNIME and do several expensive joins between these 3 tables. Obviously, these operations take a lot of time and computer resources, so as we are wanting to analyze the connections between several entities, it is more efficient to store such data as a graph. Reading data from an SQLite database is pretty straightforward; we used the Database Reader node. This node automatically returns data as a KNIME table. We can now propagate this table to the OrientDB nodes.

Once the data are read and put into KNIME tables, we can now connect and upload them into the database with the help of the Connection, Command, and Execute nodes. First we need to create the connection. The connection node settings are pretty simple (Figure 5).

You specify the following:

Connection pool size
Database URL
Port
Name
There are two type of credentials: basic - login/password pair, and KNIME credentials, which can be created for the workflow.

Once you have created the connection, it can be propagated to the other nodes.

Uploading ISCO codes

The next step will be to upload ISCO codes. For this task, we will use the Command node, which has three modes (Figure 6):

«Use column with command» – here the user can provide a column that contains a string with queries that will be executed (execution goes row by row)
«Use SQL statement» – in this mode the user can type the query into a text box. It is also possible to include flow variables in the query body
«Write table to class» – here the user has to choose the vertex class that already exists in the database and then choose which columns will be uploaded to these class properties. Column names should be the same as the properties names in the class, otherwise new properties are created. This mode can also be used for updating information – to do this user has to activate the:
- «Use upsert» mode. This Upsert operation checks if the object with unique index already exists in the database. The included columns are used in the search, so the user does not have to specify certain columns. If the object exists, new values from selected columns are written. If the object does not exist, a new object is created.

Note: A good point about the Command node is that it returns the uploaded result as JSON. It is also fault-tolerant - this way if a record cannot be uploaded or updated for any reason, the user will have this message in the JSON output as well, without any runtime errors.

Now we want to use this JSON output and convert it to a KNIME table. After that, these data are used for creating edges. This way we can create a taxonomy of the ISCO codes, e.g. going from 0 to 01, to 011, to 0110. So once we have extracted the @rids of the created ISCO vertices (Figure 6), we create queries for edge-creation with the help of the String Manipulation node. The process of query-creation is performed in the «Create ISCO Taxonomy» component. This metanode has an output table, which contains a column with the queries that will be used in the Execute node (Figure 7).

Here we specify a name of the column with the query as the body of the script and activate the mode called «Generate by template» (Figure 8). This way the node automatically puts all the values from the selected column in to a script body. We use the default batch size here, as the script is not large, and we do not provide any return construction as we do not expect to get any results - our aim is to create edges.

We perform the same operations to upload the Occupation and Skill vertices:

Read and preprocess the files;
Use the Command node to upload them;
Extract the output to get @rids in order to use them for other edge creation

The next step is to create edges between the Occupation and ISCO vertices. The idea here is to join Occupation and the leaves of the ISCO vertices (these vertices have 4-digit ISCO code). Basically, Occupation should be connected to the last hierarchy level ISCO vertex. For this purpose, we join the ISCO table with the Occupation table by ISCO codes. We then use the Execute node again to create the edges, but in this case we create a template of the batch script in the text box, as shown in Figure 6. We put the similar string into a text box as we created for the previous Execute node, but also use columns as the wildcards for Occupation @rid values and ISCO @rid values.

In order to create edges between the Occupation vertices and the Skill vertices we need to read an additional table. The «Occupation-skill links reader» component takes care of this. This file also contains information about whether the skill is essential or optional for a specific job.

Joining

Now, we need to join this table with the results of the Command nodes, so as to upload the Occupation and Skill vertices. We join conceptUri with skillUri and occupationUri. These two joins are pretty expensive operations; fortunately we only need to do them once. After that, we create a script template again with the help of the String Manipulation node and propagate it to the Execute node.

That was the first part of the post – ETL. Now let’s consider some use cases for how these data can be extracted from OrientDB and used for analysis with some KNIME extensions.

Extracting Data from OrientDB and Analyzing with KNIME

This workflow has two branches, along which we are going to extract and visualize the sub-graphs. This graph analysis is handled by the KNIME Network Mining Extension. This extension contains nodes that allow you to create, write, and read graphs; there are also several nodes for graph visualization (https://www.knime.com/network-mining). See Figure 9 where these parts of the workflow are enclosed in blue boxes.

You can download this workflow from:

The KNIME Hub

First, we connect to the same database with Connection node.
Next, we create two independent queries to search for those skills essentially required for «technical director» and «flying director», see Figure 10. To run these queries we use the Query node.

Configuring this node is similar to the settings in the Command and Execute nodes. The Connection node can automatically fetch metadata from the database: classes and their properties. The user could use these metadata and flow variables as wildcards in the query.

TIP: Choose the schema type depending on the result of the query. There are three main modes:

Dynamic schema - general schema mode, which creates an output table according to the query results
Class schema - returns a table according to the selected class
JSON - returns the result as a table with JSON values

Since we want to extract information from different classes and specify which columns should be included in the result, we are going to use dynamic schema mode.

Preparing Extracted Data for Visualization

In order to prepare the extracted data for visualization, we need to create a network context with the Network Creator. No settings are needed for this node, we can simply start adding extracted and processed query results, i.e. the skills essential for the selected occupations.

We can now put them into the network we created, using the Object Inserter nodes. In the settings we need to specify IDs and labels for the nodes and edges. As you can see we extracted the data from the database in pairwise format, so we can specify the start and end vertices, and the edge type that connects them in the settings of Object Inserter, see Figure 11.

To add the result of the second query, all you need to do is propagate the network to another Object Inserter node and provide it with the table from the second query.

Network Visualizations

There are two nodes you can use for network visualizations:

The Network Viewer (JavaScript)
The Network Viewer

The most important settings in these nodes are the layouts and label setup. Both nodes provide a Plot View with interactive elements, meaning you can change the visualization parameters on the fly, and after that generate an image in PNG and SVG formats.

OrientDB Function Node

The lower branch of the workflow takes the first record from the second query («flying director» skills) and with the help of a server-defined function, returns all those occupations that require the same skills as «flying director». The functions can be written in SQL or JavaScript. We are going to use the following JS function:

return orient.getDatabase().query("select expand(outE(?).inV().inE(?).outV()) 
from " + fromRID, edgeType, edgeType);

This function has two arguments – the @rid of the start vertex and the type of the edge that is used for traversal. The Function node automatically fetches the list of functions that are defined by the user in the database. Once the user has selected the function, the arguments must be provided. This can be done in two ways – by using wildcards from the table columns or flow variables. In the first case the function is called for every record in the column, in the second case just once, see Figure 10.

The node returns the result as a JSON table since it is impossible to predict the output structure. This way, we need to convert it into a KNIME table, post-process and get rid of the duplicate @rids. After that, we take just the first 10 records from the top, and create for each of them the query to create an edge of SIMILAR_TO class to connect to the original «flying director» vertex. Next, we visualize the result in another network instance (Figure 13).

The Results

In this experiment we wanted to see if we could blend SQL and NoSQL in a workflow so as to be able to extract and manipulate graph data based in an OrientDB database. We used our OrientDB nodes to be able to do this, making it very easy to blend KNIME capabilities for ETL, data analysis, and visualization, utilizing the power of graph data representation, provided by OrientDB.

One of the benefits of using a graph as a storage is that many search operations can be reduced to searching a path i.e. a graph traversal. In the second part of the blog post we considered the use case of finding occupations that require similar skills. Using SQLite or any other relational database, this operation would usually require at least three joins. And these join operations would be required every single time we carried out such a search. By moving to a graph-based database, all of this information is already stored as edges. This approach saves a lot of time.

So yes, we were able to blend SQL and NoSQL and KNIME met OrientDB!

Future Work

Graph data representation provides huge benefits areas such as crime investigation and fraud analysis, social network analysis, life science applications, logistics, and many more. We plan to write more posts about integrating KNIME and OrientDB and reviewing other interesting use cases. Furthermore we continue to work on developing the OrientDB nodes in KNIME, so new features will be reviewed in the future articles.

References

Summary of Redfield's OrientDB nodes on GitHub
Links to these example workflows on the KNIME Hub
- Create_ESCO_network_with SQLite
- Network_and_OrientDB
How to Install the OrientDB nodes
- Go to the File menu in KNIME Analytics Platform 4.0 and select Install KNIME Extensions
- In the dialog that appears, expand KNIME Partner Extensions and select OrientDB Connection Nodes

------------------------------------

About Redfield

Redfield is a KNIME Trusted Partner. The company is fully focused on providing advanced analytics and business intelligence since 2003. We implement KNIME Analytics Platform for our clients and provide training, planning, development, and guidance within this framework. Our technical expertise, advanced processes, and strong commitment enable our customers to achieve acute data-driven insights via superior business intelligence, machine and deep learning. We are based in Stockholm, Sweden.

Blog

KNIME Blog: general

↧

The KNIME Hub - Share and Collaborate

July 15, 2019, 1:00 am

≫ Next: How to pick the best approach to data science

≪ Previous: Will They Blend: KNIME meets OrientDB

The KNIME Hub - Share and CollaboratepaolotamagMon, 07/15/2019 - 10:00

Authors: Paolo Tamagnini & Christian Dietz

Where To Get Answers to Your Data Science Questions?

When I start a new data science project with KNIME Analytics Platform, there are always a few questions I need to ask myself before I even pull in a single node to my blank workbench.

“Can I train this kind of a model in KNIME?”
“Which KNIME nodes will I need for this task?”
“Has anyone else put together a use case like this with KNIME before?”
“Can I download any KNIME workflows as inspiration?”

To answer all these questions, all I need to do is ask the KNIME Hub. The KNIME Hub has been available at hub.knime.com since March 2019 but many new features have now been added with the release of KNIME Analytics Platform 4.0.

Before we delve into the Hub’s more complex features, let’s have a look at some more basic examples first.

My particular focus at KNIME is Guided Analytics, particularly for machine learning automation, so if I am looking for a precise machine learning model, let’s say “XGBoost” or “Logistic Regression”, I can type the name into the search box and the KNIME Hub finds me a list of all the relevant nodes (Fig. 1) for me to scroll through and inspect. The search looks for nodes, extensions, components, and workflows not only among our own KNIME example workflows and components, but also among the workflows and components built by you, the community.

Figure 1. The KNIME Hub research listing the most relevant nodes for the query “XGBoost”

I can use the All, Nodes, and Workflows tabs to narrow down or widen my search. For example, if instead I am more interested in finding a complete analysis rather than a single key piece, I might want to know if someone else has already used KNIME for a certain use case: “sentiment analysis” or “fraud detection”, for example. In this case it means I’m not just looking for single nodes but for an entire KNIME workflow. Here, I type in my search term, click the Workflows tab, and the Hub shows me a list of all the workflows that match my query: "Fraud Detection". (Fig. 2).

Figure 2. The KNIME Hub lists workflows matching the query “Fraud Detection”

Those are just simple queries, now let’s see more precisely what the KNIME Hub can do.

Let’s say I want to start a new project where I need to measure the performance of a predictive model. I open hub.knime.com on my web browser and I type “Model Performance” in the search box. I then select the first hit in the list, which is the workflow: “Evaluating Classification Model Performance” (Fig. 3).

Figure 3. The KNIME Hub showing the workflow Maarit uploaded for explaining how you can inspect the performance of a model

The webpage for this workflow shows me a lot of useful information like the layout of the workflow with all of its nodes and branches; I can find more information about the author and her authentic KNIME Forum profile picture, review the associated license, and find a short link - handy to quickly share the web page with my coworkers.

If I want to use this workflow as a jump-start solution that I could then tailor to my data, I can simply download and open it in my KNIME Analytics Platform. On Windows operating systems, I can even open the workflow directly by clicking the Open Workflow button. This automatically downloads and opens the workflow in KNIME (Fig. 4).

Figure 4: Video showing how - on Windows - you can open the workflow directly by clicking the Open workflow button. This automatically downloads and opens the workflow in KNIME Analytics Platform. On other operating systems, click Download workflow to open it in KNIME Analytics Platform.

However sometimes you might have questions about the usage of the workflow.

For example I might wonder “why the ROC Curve node was used? How does this node work?”

Scrolling through the list of the nodes used in the workflow, below the workflow image, I can select this node and open the web page describing the node (Fig. 5).

Figure 5. The KNIME Hub is able to display all the info available about a node via a web page

The node description, the same description you’ll find in KNIME Analytics Platform, is shown, along with linked external resources such as academic papers, blog posts, and videos.

So, in addition to reading the technical information about the specific node (its ports, functions, implemented algorithm, ...), I can scroll to see a list of workflows where it is being used and also see what other nodes are used in combination with it. I can see in the list that the ROC Curve is used in the workflow “Evaluating Classification Model Performance”.

No matter how much information is available, sometimes I might have more questions and need to talk to the workflow author directly, in this case Maarit, to ask a particular question. I can do this by scrolling down to the bottom of the page, where I can comment below the workflow to start a discussion and ask my questions (Fig. 6). The discussion will also be referenced on the KNIME Forum where experienced KNIME users will find and answer your questions.

Figure 6. Two users (Maarit and I) discussing the workflow “Evaluating Classification Model Performance” directly on the KNIME Hub

There are still more features you need to see about the KNIME Hub.

Let’s say I opened a workflow and I now want to use the ROC Curve node. I go to my Node Repository in KNIME and look for the node, but can’t find it. This is probably because my installation lacks the required KNIME Extension. We noticed in past years that finding the right extension for a given node and installing it can be cumbersome and time consuming.

You can now drag nodes from the KNIME Hub instead. In fact you can drag any node image displayed on KNIME Hub to your KNIME Analytics Platform via the web browser. The node is added to the workflow just like it would have been from the Node Repository. If an extension is required, this is automatically detected and a window appears asking you to install it (Fig. 7).

Figure 7. This video shows how you can simply drag and drop a node from the KNIME Hub to KNIME Analytics Platform to use it on your own data. Any node image you find on hub.knime.com can be used to drag and drop the pictured node

The KNIME Hub for Collaboration

Often when you're working on a big project you end up collaborating with different data scientists. The data science team needs to agree on a number of things; the KNIME Hub gives you the opportunity to share the nodes and workflows you are proposing to the rest of the team via simple shortened links. And now with the KNIME Analytics Platform 4.0 release you can also share your own workflows and components.

Sharing your own workflow on the KNIME Hub is quite easy. First, you need to update KNIME Analytics Platform and connect to the new My-KNIME-Hub mount point in your KNIME Explorer. Double click My-KNIME-Hub and a new dialog appears in your browser window. You can now log in, using the same account you already use on KNIME Forum, or you can register a new KNIME.com account. After logging in you can upload your workflows to the KNIME Hub by simply dragging and dropping from your LOCAL workspace to your My-KNIME-Hub just like the video shows (Fig. 8). Right click on an uploaded workflow and select “Open > in KNIME Hub” to see the associated public web page.

Figure 8. This video shows how you can share your workflow via the KNIME Hub. Authenticate first with your KNIME account or register. You can then simply drag and drop your workflows via the new My-KNIME-Hub mountpoint

To edit things like title, description, keywords for the search engine and external links you need to change the workflow’s metadata before uploading. To do so, select the workflow you want to edit in the KNIME Explorer. You’ll see the Description panel displaying the current information. You can edit the so-called metadata by simply interacting with the panel, typing in all the infos and saving them. Once the workflow is updated in your My-KNIME-Hub personal space, you will see the web page showing the same info you just typed in (Fig. 9).

Figure 9: This video shows how to edit the metadata of a workflow via the Description panel. The information is then displayed on the public workflow web page on KNIME Hub

In your v4.0 of KNIME Analytics Platform, you might notice an additional box on the Welcome page. This is the KNIME Hub Search panel. Access this panel via “View > KNIME Hub Search” and search nodes and workflows directly from KNIME Analytics Platform (Fig. 10).

Figure 10: This video shows how to query for nodes and workflows via the KNIME Hub Search panel directly from KNIME Analytics Platform.

Conclusion

Stay tuned as in the coming months more features of the KNIME Hub will come out! In the meantime go ahead: search, download, share and comment your data science projects with the KNIME Hub!

Blog

KNIME Blog: general

↧