Quantcast
Channel: KNIME news, usage, and development
Viewing all 561 articles
Browse latest View live

Setting up the KNIME Python extension. Revisited for Python 3.0 and 2.0

$
0
0

As part of the v3.4 release of KNIME Analytics Platform, we rewrote the Python extensions and added support for Python 3 as well as Python 2. Aside from the Python 3 support, the new nodes aren’t terribly different from a user perspective, but the changes to the backend give us more flexibility for future improvements to the integration. This blog post provides some advice on how to set up a Python environment that will work well with KNIME as well as how to tell KNIME about that environment.

The Python Environment

We recommend using the Anaconda Python distribution from Continuum Analytics. There are many reasons to like Anaconda, but the important things here are that it can be installed without administrator rights, supports all three major operating systems, and provides all of the packages needed for working with KNIME “out of the box”.

Get started by installing Anaconda from the link above. You’ll need to choose which version of Python you prefer (we recommend that you use Python 3 if possible) but this just affects your default Python environment; you can create environments with other Python versions without doing a new install. For example, if I install Anaconda3 I can still create Python 2 environments.

read more


The KNIME Model Process Factory

$
0
0
The KNIME Model Process Factoryknime_adminMon, 05/08/2017 - 11:06

Authors: Iris Adä & Phil Winters

The benefits of using predictive analytics is now a given. In addition, the Data Scientist who does that is highly regarded but our daily work is full of contrasts. On the one hand, you can work with data, tools and techniques to really dive in and understand data and what it can do for you. On the other hand, there is usually quite a bit of administrative work around accessing data, massaging data and then putting that new insight into production - and keeping it there.

In fact, many surveys say that at least 80% of any data science project is associated with those administrative tasks. One popular urban legend says that, within a commercial organization trying to leverage analytics, the full time job of one data scientist can be described as building and maintaining a maximum of four (yes 4) models in production - regardless of the brilliance of the toolset used. There is a desperate need to automate and scale the modelling process, not just because it would be good for business (after all, if you could use 29000 models instead of just 4, you would want to!) but also because otherwise we data scientists are in for a tedious life.

At the recent KNIME Spring Summit in Berlin, one of the most well received presentations was that of the KNIME Model Process Factory, designed to provide you with a flexible, extensible and scalable application for running and monitoring very large numbers of model processes in an efficient way.

The KNIME Model Factory is composed of a white paper, an overall workflow, tables that manage all activates and a series of workflows, examples and data for learning to use the Factory.

Video 1. The Model Factory in Action! Here is the orchestrating workflow triggering dependent workflows during execution.

A few highlights include:

  • Workflow Orchestration. A workflow acts as the art director of the whole process, by organizing, monitoring, triggering, and automating – that is by orchestrating - all workflows involved in the model process factory.
  • Model Monitoring. The KNIME Model Factory includes a number of workflows for initializing, loading, transforming, modeling, scoring, evaluating, deploying, monitoring, and retraining data analytics models.
  • Reuse Best Practices. The workflows and the whitepaper also show the common best practice for packaging sub-workflows for quick, controlled, and safe reuse by other workflows
  • Call Remote Workflows. The whole orchestration factory relies heavily on calling remote workflows; that is on the Call Remote Workflow node.
  • Triggering Model Retraining. An important part of model monitoring is to know when exactly to start the retraining procedure. A few workflows in the KNIME Model Process Factory are dedicated to check whether model performance has fallen below a specified accuracy threshold and to retrigger its retraining, if needed.
  • Full Working Examples. As usual, we provide full working example workflows - including data - to show how to handle typical modelling process tasks and conditions.

Anyone using KNIME can take advantage of the KNIME Model Factory. It is available via the KNIME EXAMPLES server under 50_Applications/26_Model_Process_Management50_Applications/26_Model_Process_Management* and can run on KNIME Analytics Platform, which means it is open source and free. Major benefits can be realized in terms of automation and interfacing by using the KNIME Model Factory with KNIME Server.

There is a tremendous amount of information available - far too much for a blog entry - and we would encourage you to look at the white paper.


* The link will open the workflow directly in KNIME Analytics Platform (requirements: Windows; KNIME Analytics Platform must be installed with the Installer version 3.2.0 or higher)

Data Chef ETL Battles. What can be prepared with today’s data?

$
0
0
Data Chef ETL Battles. What can be prepared with today’s data?rsMon, 05/22/2017 - 11:14

Do you remember the Iron Chef battles

It was a televised series of cook-offs in which famous chefs rolled up their sleeves to compete in making the perfect dish. Based on a set theme, this involved using all their experience, creativity and imagination to transform sometimes questionable ingredients into the ultimate meal.

Hey, isn’t that just like data transformation? Or data blending, or data manipulation, or ETL, or whatever new name is trending now? In this new blog series requested by popular vote, we will ask two data chefs to use all their knowledge and creativity to compete in extracting a given data set's most useful “flavors” via reductions, aggregations, measures, KPIs, and coordinate transformations. Delicious!

Want to find out how to prepare the ingredients for a delicious data dish by aggregating financial transactions, filtering out uninformative features or extracting the essence of the customer journey? Follow us here and send us your own ideas for the “Data Chef Battles” at datachef@knime.com.

Ingredient Theme: Customer Transactions. Money vs. Loyalty.

Author: Rosaria Silipo
Data Chefs: Haruto and Momoka

Ingredient Theme: Customer Transactions

Today’s dataset is a classic customer transactions dataset. It is a small subset of a bigger dataset that contains all of the contracts concluded with 9 customers between 2008 and now.

The business we are analyzing is a subscription-based business. The term “contracts” refers to 1-year subscriptions for 4 different company products.

Customers are identified by a unique customer key (“Cust_ID”), products by a unique product key (“product”), and transactions by a unique transaction key (“Contract ID”). Each row in the dataset represents a 1-year subscription contract, with the buying customer, the bought product, the number of product items, the amount paid, the payment means (card or not card), the subscription start and end date, and the customer’s country of residence.

Subscription start and end date usually enclose one year, which is a standard duration for a subscription. However, a customer can hold multiple subscriptions for different products at the same time, with license coverages overlapping in time.

What could we extract from these data? Finding out more about customer habits would be useful. What kind of information can we collect from the contracts that would describe the customer? Let’s see what today’s data chefs are able to prepare!

Topic. Customer Intelligence.

Challenge. From raw transactions calculate customer’s total payment amount and loyalty index.

Methods. Aggregations and Time Intervals.

Data Manipulation Nodes. GroupBy, Pivoting, Time Difference nodes.

The Competition

There are many different ways to describe a customer based on their series of transactions. Some describe the customer buying power, others loyalty over time, and others buying behavior. All approaches are valid. They simply produce different ”flavors” of information, which can be combined together to get the full picture of the customer.

Data Chef Haruto: Customer Buying Power

Haruto has decided that for this experiment, money is the most informative feature. The “amount” column contains information about money for each contract data row; “amount” is the price paid by the customer for that subscription.

In figure 1, the workflow upper branch - embedded in the “Money” labelled square – is from Data Chef Haruto.

Buying Power as the total amount paid throughout the years

The simplest and most direct way to describe a customer buying power is to just sum up all values in the “amount” column. This will give us the full monetary worth of the customer from the first contract to today’s date. The isolated GroupBy node at the top of the branch performs exactly this aggregation, by grouping on “Cust_ID” and calculating the sum(“amount”) for each detected group.

Buying Power as the total amount paid year after year

A second maybe more sophisticated approach is to calculate the total amount of money generated each year. For this, Haruto used a Date Field Extractor node to extract the year from the contract date. Then he calculated the sum of values in the “amount” column for both each year and each “Cust_ID”. Here, this aggregation is performed by a Pivoting node and not by a GroupBy node. The Pivoting node indeed produces the same integration (sum on groups) as the GroupBy node, but:

  • Groups are identified by values in at least 2 columns
  • The output data table is organized in a matrix-like style, showing values from one or more groups as column headers – in our case the years – and values from the other group(s) as RowID – in our case the Cust_IDs.

The advantage of this second approach is provided by the additional details in customer spending behavior across time.

The two resulting features can be joined on Cust_ID with a Joiner node. The final data table describes each customer through the total paid amount for all of the years and the amount paid year after year.

The Pivoting node will necessarily generate empty data cells for those years where a customer did not buy any of the company’s products. In this case, though, missing values correspond to 0 money value. We can fix that, using a Missing Value node to replace all empty data cells with a 0.

Figure 1. Final Workflow 02_Customer_Trx_Money_vs_Loyalty. Upper part named "Money" describes customers’ buying power. Lower part labelled "Loyalty" associates a loyalty index to each customer. This workflow is available on the KNIME EXAMPLES Server under 02_ETL_Data_Manipulation/06_Date_and_Time_Manipulation/02_Customers_Trx_Money_vs_Loyalty02_ETL_Data_Manipulation/06_Date_and_Time_Manipulation/02_Customers_Trx_Money_vs_Loyalty*


(click on the image to see it in full size)

Data Chef Momoka: Customer Loyalty

Momoka has a more idealistic view of the world and decided to describe the customers in terms of their loyalty rather than money. Again, there are many ways to spell “loyalty”.

The workflow lower branch - embedded in the “Loyalty” labelled square – is provided by Data Chef Momoka.

Loyalty as the number of days between the first and the last subscription start date

The easiest way to describe loyalty is probably by the number of days the customer has held a subscription. This number of days can be calculated in a number of different ways.

  • As number of days between the start of first and start of last subscription. This could be achieved by calculating the range in column “start_date” with a GroupBy node. However, this does not cover the full extension of the last subscription.
  • As number of days between start of first subscription and end of last subscription. This could be obtained by sorting the data by Cust_ID and “start_date” and by extracting the first “start_date” and the last “end_date” for each customer; then by calculating the number of days in between with a Time Difference node. However, we must remember that this approach is not bullet-proof either, as it does not take into account possible periods of time without any subscription.
  • As total number of days covered by subscriptions on a given product. In this case, a GroupBy node grouping on “Cust_ID” and “product” and calculating the number of days between first “start_date” and last “end_date”, as described above, could have worked. However, this would not take into account subscriptions to different products overlapping in time.
  • As total number of days covered by subscriptions on one product or the other.This leads to a more detailed time alignment procedure, contained in the Time Alignment metanode and shown in figure 2. The Time Alignment metanode is located in the second branch of the Loyalty part of the workflow. A Time Difference node follows the Time Alignement metanode and calculates the number of days between the “start_date” and the “end_date” of each one of these coverage periods for each customer. The final GroupBy node sums up all those number of days for each customer. This feature is named “effective #days”.

Figure 2. Content of Time Alignment metanode. For each “Cust_ID”, the number of days between current subscription/row “start_date” and “end_date” for the previous subscription/row is calculated. If this number of days is > 0, the current subscription/row is just an extension of the previous one. If < 0 then is a new subscription start.


(click on the image to see it in full size)

The absolute number of days is already an interesting loyalty feature. Momoka though decided to express it as the ratio in [0,1] of the effective number of days over the total number of days between the very first subscription and the current date (Feb 1 2017). To do that, the GroupBy and the Time Difference node – in the upper branch of the “Loyalty” part of the workflow in figure 1 - calculate the total number of days between the “start_date” of the earliest subscription in the data set and the current date (Feb 1 2017). The loyalty index is then obtained as:

“effective #days” / “total # days”

The final workflow can be admired in figure 1 and can be found on the EXAMPLES server in: 02_ETL_Data_Manipulation/06_Date_and_Time_Manipulation/02_Customers_Trx_Money_vs_Loyalty02_ETL_Data_Manipulation/06_Date_and_Time_Manipulation/02_Customers_Trx_Money_vs_Loyalty*

The Jury

The final part of the workflow joins the money describing features together with the loyalty index to feed a Javascript Scatter Plot node.

The Interactive scatter plot visualizes all 9 customers on a money vs. loyalty space. On the y-axis we find the loyalty index and on the x-axis the total amount of money derived from the customer contracts across all years. Here we manually selected the top 2 customers, which happen to have Cust_ID “Cust_1” and “Cust_3”. The following 2 nodes automatically extract the data rows for these two selected customers. The Radar Plot Appender node at the end produces a radar plot for the amount of money paid each year by each customer.

In the resulting table (Fig. 3) we see that “Cust_3” has bought subscriptions for more than 6000$ over the years, mainly between 2009 and 2012. Therefore the corresponding loyalty index is only 0.5. On the other hand, “Cust_1” has bought subscriptions for less money, yet he has spread them more evenly across the years, producing a higher loyalty index of 0.67 (Fig. 3).

Figure 3. Resulting Data Table, where selected customers are described in terms of loyalty, buying power, and purchase distribution across all years.

We have reached the end of this competition. Congratulations to both our data chefs for wrangling such interesting features from the raw data ingredients! Oishii!

Coming next …

If you enjoyed this, please share it generously and let us know your ideas for future data preparations.

We’re looking forward to the next data chef battle. The theme ingredient there will be a time series dataset describing energy consumption.

 


* The link will open the workflow directly in KNIME Analytics Platform (requirements: Windows; KNIME Analytics Platform must be installed with the Installer version 3.2.0 or higher)

Data Chef ETL Battles. What can be prepared with today’s data?

$
0
0
Data Chef ETL Battles. What can be prepared with today’s data?adminMon, 05/22/2017 - 15:55

Do you remember the Iron Chef battles

It was a televised series of cook-offs in which famous chefs rolled up their sleeves to compete in making the perfect dish. Based on a set theme, this involved using all their experience, creativity and imagination to transform sometimes questionable ingredients into the ultimate meal.

Hey, isn’t that just like data transformation? Or data blending, or data manipulation, or ETL, or whatever new name is trending now? In this new blog series requested by popular vote, we will ask two data chefs to use all their knowledge and creativity to compete in extracting a given data set's most useful “flavors” via reductions, aggregations, measures, KPIs, and coordinate transformations. Delicious!

Want to find out how to prepare the ingredients for a delicious data dish by aggregating financial transactions, filtering out uninformative features or extracting the essence of the customer journey? Follow us here and send us your own ideas for the “Data Chef Battles” at datachef@knime.com.

Ingredient Theme: Customer Transactions. Money vs. Loyalty.

Author: Rosaria Silipo
Data Chefs: Haruto and Momoka

Ingredient Theme: Customer Transactions

Today’s dataset is a classic customer transactions dataset. It is a small subset of a bigger dataset that contains all of the contracts concluded with 9 customers between 2008 and now.

The business we are analyzing is a subscription-based business. The term “contracts” refers to 1-year subscriptions for 4 different company products.

Customers are identified by a unique customer key (“Cust_ID”), products by a unique product key (“product”), and transactions by a unique transaction key (“Contract ID”). Each row in the dataset represents a 1-year subscription contract, with the buying customer, the bought product, the number of product items, the amount paid, the payment means (card or not card), the subscription start and end date, and the customer’s country of residence.

Subscription start and end date usually enclose one year, which is a standard duration for a subscription. However, a customer can hold multiple subscriptions for different products at the same time, with license coverages overlapping in time.

What could we extract from these data? Finding out more about customer habits would be useful. What kind of information can we collect from the contracts that would describe the customer? Let’s see what today’s data chefs are able to prepare!

Topic. Customer Intelligence.

Challenge. From raw transactions calculate customer’s total payment amount and loyalty index.

Methods. Aggregations and Time Intervals.

Data Manipulation Nodes. GroupBy, Pivoting, Time Difference nodes.

The Competition

There are many different ways to describe a customer based on their series of transactions. Some describe the customer buying power, others loyalty over time, and others buying behavior. All approaches are valid. They simply produce different ”flavors” of information, which can be combined together to get the full picture of the customer.

Data Chef Haruto: Customer Buying Power

Haruto has decided that for this experiment, money is the most informative feature. The “amount” column contains information about money for each contract data row; “amount” is the price paid by the customer for that subscription.

In figure 1, the workflow upper branch - embedded in the “Money” labelled square – is from Data Chef Haruto.

Buying Power as the total amount paid throughout the years

The simplest and most direct way to describe a customer buying power is to just sum up all values in the “amount” column. This will give us the full monetary worth of the customer from the first contract to today’s date. The isolated GroupBy node at the top of the branch performs exactly this aggregation, by grouping on “Cust_ID” and calculating the sum(“amount”) for each detected group.

Buying Power as the total amount paid year after year

A second maybe more sophisticated approach is to calculate the total amount of money generated each year. For this, Haruto used a Date Field Extractor node to extract the year from the contract date. Then he calculated the sum of values in the “amount” column for both each year and each “Cust_ID”. Here, this aggregation is performed by a Pivoting node and not by a GroupBy node. The Pivoting node indeed produces the same integration (sum on groups) as the GroupBy node, but:

  • Groups are identified by values in at least 2 columns
  • The output data table is organized in a matrix-like style, showing values from one or more groups as column headers – in our case the years – and values from the other group(s) as RowID – in our case the Cust_IDs.

The advantage of this second approach is provided by the additional details in customer spending behavior across time.

The two resulting features can be joined on Cust_ID with a Joiner node. The final data table describes each customer through the total paid amount for all of the years and the amount paid year after year.

The Pivoting node will necessarily generate empty data cells for those years where a customer did not buy any of the company’s products. In this case, though, missing values correspond to 0 money value. We can fix that, using a Missing Value node to replace all empty data cells with a 0.

Figure 1. Final Workflow 02_Customer_Trx_Money_vs_Loyalty. Upper part named "Money" describes customers’ buying power. Lower part labelled "Loyalty" associates a loyalty index to each customer. This workflow is available on the KNIME EXAMPLES Server under 02_ETL_Data_Manipulation/06_Date_and_Time_Manipulation/02_Customers_Trx_Money_vs_Loyalty02_ETL_Data_Manipulation/06_Date_and_Time_Manipulation/02_Customers_Trx_Money_vs_Loyalty*

Data Chef Momoka: Customer Loyalty

Momoka has a more idealistic view of the world and decided to describe the customers in terms of their loyalty rather than money. Again, there are many ways to spell “loyalty”.

The workflow lower branch - embedded in the “Loyalty” labelled square – is provided by Data Chef Momoka.

Loyalty as the number of days between the first and the last subscription start date

The easiest way to describe loyalty is probably by the number of days the customer has held a subscription. This number of days can be calculated in a number of different ways.

  • As number of days between the start of first and start of last subscription. This could be achieved by calculating the range in column “start_date” with a GroupBy node. However, this does not cover the full extension of the last subscription.
  • As number of days between start of first subscription and end of last subscription. This could be obtained by sorting the data by Cust_ID and “start_date” and by extracting the first “start_date” and the last “end_date” for each customer; then by calculating the number of days in between with a Time Difference node. However, we must remember that this approach is not bullet-proof either, as it does not take into account possible periods of time without any subscription.
  • As total number of days covered by subscriptions on a given product. In this case, a GroupBy node grouping on “Cust_ID” and “product” and calculating the number of days between first “start_date” and last “end_date”, as described above, could have worked. However, this would not take into account subscriptions to different products overlapping in time.
  • As total number of days covered by subscriptions on one product or the other.This leads to a more detailed time alignment procedure, contained in the Time Alignment metanode and shown in figure 2. The Time Alignment metanode is located in the second branch of the Loyalty part of the workflow. A Time Difference node follows the Time Alignement metanode and calculates the number of days between the “start_date” and the “end_date” of each one of these coverage periods for each customer. The final GroupBy node sums up all those number of days for each customer. This feature is named “effective #days”.

Figure 2. Content of Time Alignment metanode. For each “Cust_ID”, the number of days between current subscription/row “start_date” and “end_date” for the previous subscription/row is calculated. If this number of days is > 0, the current subscription/row is just an extension of the previous one. If < 0 then is a new subscription start.

The absolute number of days is already an interesting loyalty feature. Momoka though decided to express it as the ratio in [0,1] of the effective number of days over the total number of days between the very first subscription and the current date (Feb 1 2017). To do that, the GroupBy and the Time Difference node – in the upper branch of the “Loyalty” part of the workflow in figure 1 - calculate the total number of days between the “start_date” of the earliest subscription in the data set and the current date (Feb 1 2017). The loyalty index is then obtained as:

“effective #days” / “total # days”

The final workflow can be admired in figure 1 and can be found on the EXAMPLES server in: 02_ETL_Data_Manipulation/06_Date_and_Time_Manipulation/02_Customers_Trx_Money_vs_Loyalty02_ETL_Data_Manipulation/06_Date_and_Time_Manipulation/02_Customers_Trx_Money_vs_Loyalty*

The Jury

The final part of the workflow joins the money describing features together with the loyalty index to feed a Javascript Scatter Plot node.

The Interactive scatter plot visualizes all 9 customers on a money vs. loyalty space. On the y-axis we find the loyalty index and on the x-axis the total amount of money derived from the customer contracts across all years. Here we manually selected the top 2 customers, which happen to have Cust_ID “Cust_1” and “Cust_3”. The following 2 nodes automatically extract the data rows for these two selected customers. The Radar Plot Appender node at the end produces a radar plot for the amount of money paid each year by each customer.

In the resulting table (Fig. 3) we see that “Cust_3” has bought subscriptions for more than 6000$ over the years, mainly between 2009 and 2012. Therefore the corresponding loyalty index is only 0.5. On the other hand, “Cust_1” has bought subscriptions for less money, yet he has spread them more evenly across the years, producing a higher loyalty index of 0.67 (Fig. 3).

Figure 3. Resulting Data Table, where selected customers are described in terms of loyalty, buying power, and purchase distribution across all years.

We have reached the end of this competition. Congratulations to both our data chefs for wrangling such interesting features from the raw data ingredients! Oishii!

Coming next …

If you enjoyed this, please share it generously and let us know your ideas for future data preparations.

We’re looking forward to the next data chef battle. The theme ingredient there will be a time series dataset describing energy consumption.

 


* The link will open the workflow directly in KNIME Analytics Platform (requirements: Windows; KNIME Analytics Platform must be installed with the Installer version 3.2.0 or higher)

The Wisdom of the KNIME Crowd: the KNIME Workflow Coach

$
0
0
The Wisdom of the KNIME Crowd: the KNIME Workflow CoachadminWed, 06/07/2017 - 10:23

Everyone who has heard of KNIME Analytics Platform knows that KNIME has nodes. Thousands of them! The resources under the Learning Hub as well as the hundreds of public examples within KNIME Analytics Platform are all designed to get you up to speed with KNIME and its nodes. But those that know best how to use KNIME nodes are KNIME users themselves. What if we could capture all their insight and experience in understanding which nodes to use when and in what order and give you a recommendation? Well that is exactly what the KNIME Workflow Coach does.

It gathers the usage data of all KNIME users who have registered to have their data collected anonymously and makes recommendations based on that data to you the user. Since a picture is worth a thousand words, let’s take a brief look at the Workflow Coach in the following short video:

Video. The KNIME Workflow Coach in Action

In this video, we open the workflow coach and use it to build a simple machine learning workflow. And indeed since KNIME users know how to build machine learning workflows; the wisdom of the community gives us the correct sequence of nodes we need to be successful.

In the second example, we see how the Workflow coach draws on community experience for recommending the sequence of steps required to do a more complex sequence of tasks such as text mining.

All users of KNIME have access to the Workflow Coach and the wisdom of the KNIME crowd.

And organizations with KNIME Server have an extra advantage. You can set up the workflow coach so that it is using the consolidated usage wisdom of your internal KNIME experts using KNIME Server rather than the KNIME population at large.

So next time you are working in KNIME, try out the KNIME Workflow Coach!

The Wisdom of the KNIME Crowd: the KNIME Workflow Coach

$
0
0
The Wisdom of the KNIME Crowd: the KNIME Workflow CoachphilWed, 06/07/2017 - 10:23

Everyone who has heard of KNIME Analytics Platform knows that KNIME has nodes. Thousands of them! The resources under the Learning Hub as well as the hundreds of public examples within KNIME Analytics Platform are all designed to get you up to speed with KNIME and its nodes. But those that know best how to use KNIME nodes are KNIME users themselves. What if we could capture all their insight and experience in understanding which nodes to use when and in what order and give you a recommendation? Well that is exactly what the KNIME Workflow Coach does.

It gathers the usage data of all KNIME users who have registered to have their data collected anonymously and makes recommendations based on that data to you the user. Since a picture is worth a thousand words, let’s take a brief look at the Workflow Coach in the following short video:

Video. The KNIME Workflow Coach in Action

In this video, we open the workflow coach and use it to build a simple machine learning workflow. And indeed since KNIME users know how to build machine learning workflows; the wisdom of the community gives us the correct sequence of nodes we need to be successful.

In the second example, we see how the Workflow coach draws on community experience for recommending the sequence of steps required to do a more complex sequence of tasks such as text mining.

All users of KNIME have access to the Workflow Coach and the wisdom of the KNIME crowd.

And organizations with KNIME Server have an extra advantage. You can set up the workflow coach so that it is using the consolidated usage wisdom of your internal KNIME experts using KNIME Server rather than the KNIME population at large.

So next time you are working in KNIME, try out the KNIME Workflow Coach!

Topic Extraction: Optimizing the Number of Topics with the Elbow Method

$
0
0
Topic Extraction: Optimizing the Number of Topics with the Elbow Methodknime_adminMon, 06/19/2017 - 10:56

Authors: Andisa Dewi and Kilian Thiel

In a social networking era where a massive amount of unstructured data is generated every day, unsupervised topic modeling has became a very important task in the field of text mining. Topic modeling allows you to quickly summarize a set of documents to see which topics appear often; at that point, human input can be helpful to make sense of the topic content. As in any other unsupervised-learning approach, determining the optimal number of topics in a dataset is also a frequent problem in the topic modeling field.

In this blog post we will show a step-by-step example of how to determine the optimal number of topics using clustering and how to extract the topics from a collection of text documents, using the KNIME Text Processing extension.

You might have read one or more blog posts from the Will They Blend series. This blog post series discussed blending data from varied data sources. In this article today, we’re going to turn that idea on its head. We collected 190 documents from RSS feeds of news websites and blogs for one day (06.01.2017). We know that the documents are divided largely into two categories, sports and barbeques. In this blog post we want to separate the sports documents and barbeque documents by topic and determine which topics were most popular on that particular day. So, the question is will they unblend?

Note that the workflow associated with this post is available for download in the attachment section of this post, as well as on the KNIME Example Server.

Figure 1. Topic extraction workflow. This workflow can be downloaded from the KNIME EXAMPLES Server 08_Other_Analytics_Types/01_Text_Processing/17_Topic_Extraction_with_the_Elbow_Method08_Other_Analytics_Types/01_Text_Processing/17_Topic_Extraction_with_the_Elbow_Method*


(click on the image to see it in full size)

Reading Text Data

The first step starts with a "Table Reader" node, reading a table that already contains news texts from various news websites in text document format. An alternative would be to fetch the news directly from user-defined news websites by feeding their RSS URLs to the “RSS Feed Reader” node, which would then output the news text directly in text document format. The output of the "Table Reader" node is a data table with one column containing the document cells.

Figure 2. Reading textual data from file or from RSS News feeds.

Text Preprocessing I

The raw data is subsequently preprocessed by various nodes provided by the KNIME Text Processing extension. All of the preprocessing steps are packed in the “Preprocessing” wrapped meta node.

First, all of the words in the documents are POS tagged using the “Stanford tagger” node. Then a “Stanford Lemmatizer”(*) is applied to extract the lemma of each word. After that, punctuation marks are removed by the "Punctuation Erasure" node; numbers and stop words are filtered, and all terms are converted to lowercase.

(*) A Lemmatizer has a similar function to a stemmer, which is to reduce words to their base form, however a lemmatizer goes further and uses morphological analysis to help remove the inflectional ending or plural form of a word and convert it to its base form, e.g the lemma of the word saw can be either see or saw depending on the position of the word in a sentence, and whether the word is a verb or a noun.

Figure 3. Basic text preprocessing: lemmatization, filtering of numbers, stop words etc. and case conversion.

Finding the optimal number of topics

Once the data have been cleaned and filtered, the “Topic Extractor” node can be applied to the documents. This node uses an implementation of the LDA (Latent Dirichlet Allocation) model, which requires the user to define the number of topics that should be extracted beforehand. It would be relatively easy if the user already knew how many topics they wanted to extract from the data. But oftentimes, especially in unstructured data such as text, it can be quite hard to estimate upfront how many topics there are .

There are a few methods you can choose from to determine what a good number of topics would be. In this workflow, we use the “Elbow” method to cluster the data and find the optimal number of clusters. It is o that the optimal number of clusters relates to a good number of topics.

Text Preprocessing II

Figure 4. Filtering of words based on frequency in corpus.

To use clustering, we need to preprocess the data again, this time by extracting the terms that we want to use as features for the document vectors. All of the preprocessing steps are packed into another wrapped meta node called “Preprocessing”. Basically the steps involve creating a bag of words (BoW) of the input data. It can be useful to remove terms that occur very rarely in the whole document collection as they don’t have a huge impact on the feature space, especially if the dimension of the BoW is very large. The document vectors are created next. For more details on how these steps are performed, please have a look at this post on Sentiment Analysis.

Note: After creating the document vectors, if the dimension of the feature space is still too large, it can be useful to apply PCA to reduce the dimensionality but keeping the loss of important information minimal.

The Elbow Method

Figure 5. Loop to compute k-means clusterings based in different values of k.


(click on the image to see it in full size)

Now that we have converted our data into document vectors, we can start to cluster them using the “k-Means” node. The idea of the Elbow method is basically to run k-means clustering on input data for a range of values of the number of clusters k (e.g. from 1 to 20), and for each k value to subsequently calculate the within-cluster sum of squared errors (SSE), which is the sum of the distances of all data points to their respective cluster centers. Then, the SSE value for each k is plotted in a scatter chart. The best number of clusters is the number at which there is a drop in the SSE value, giving an angle in the plot.

Figure 6. Calculation of sum of squared errors for a clustering with a given k.

As we already mentioned, the “k-Means” node is applied to cluster the data into k clusters. The node has two output data ports: the first port contains a data table of all the document vectors and their correspondent cluster IDs to which they are assigned, and the second port contains vectors of all the cluster centers. Next, we use the “Joiner” node to join both output data tables based on their cluster IDs. The goal is to get both the document vectors and the vector of their respective cluster centers in each row so as to make the calculation easier. After that, the “Java Snippet” node is used to calculate the squared distance between a vector and its cluster center vector in each row. The SSE value for this certain k number of clusters is the sum of all the squared distances, where the sum is calculated by the “GroupBy” node.

In order to look for an elbow in a scatter chart this calculation is run over a range of numbers of k, which in our case is 1 to 20. We achieve this by using Loop nodes. The value k of the current iteration is provided as the flow variable and also controls the setting “k” of the k-Means node.

Figure 7. Plot of the sum of squared errors for all clusterings.

The “Scatter Plot” node is used to generate a scatter chart of the number of clusters k against the value of SSE. You will see that the error decreases as k gets larger. This makes sense because the more clusters there are, the smaller the distances between the data points and their cluster centers. The idea of the Elbow method is to choose the number of clusters at which the SSE decreases abruptly. This produces a so-called "elbow" in the graph. In the plot above you can see that the first drop is after k=6. Therefore, a choice of 7 clusters would appear to be the optimal number. The optimal number of clusters is determined automatically in the workflow, by taking the distances of subsequent SSE values and sorting the step with the largest distance on top. The related number k is then provided as the flow variable to the Topic Extractor node.

Note that the Elbow method is heuristic and might not always work for all data sets. If there is not a clear elbow to be found in the plot, try using a different approach, for example the Silhouette Coefficient.

Extracting the Topics

Once we have determined the possible optimal number of topics for the documents, the Topic Extractor node can be executed. The node assigns a topic to each document and generates keywords for each topic (you can specify just how many keywords should be generated for each topic in the node dialog). The number of topics to extract is provided via a flow variable.

Figure 8. Topic extraction with the Topic Extractor node.

The extracted words for each topic are listed below:

  • Topic 1: lions, tour, continue, zealand, world, round, player, cricket, gatland, england
  • Topic 2: recipe, barbecue, bbq, chicken, lamb, kamado, pork, sauce, smoked, easy
  • Topic 3: player, sport, tennis, court, margaret, derby, murray, mangan, world, declaration
  • Topic 4: league, cup, manchester, club, season, united, city, goal, team, juventus
  • Topic 5: bbq, sauce, recipe, turkey, brine, chicken, thanksgiving, breast, hunky, mad
  • Topic 6: football, wenger, league, arsène, manager, arsenal, woods, tiger, fan, david
  • Topic 7: home, nba, cavaliers, lebron, final, james, los, golden, mets, team

Topic 1 is clearly about English sports. Topic 2 is about smoked bbq with chicken, lamb and pork. Topic 3 about tennis, Topics 4 and 6 about football (soccer). Topic 5 is again about bbq but focuses more on turkey and Thanksgiving. Topic 7 seems to be about American football. We have therefore found two topics about barbeques and five topics about sports.

We can now apply the “Tag Cloud” node to visualize the topics' most popular terms in an appealing manner. To do that, the keywords/terms generated by the “Topic Extractor” node have to be counted by their occurrences over the whole corpus. The steps are illustrated below.

Figure 9. Counting of extracted topic words in corpus.

First of all, the “Dictionary Tagger” node is applied to tag only the topic terms. The goal is to filter out all terms that are not related to topics. This is done by using the “Modifiable Term Filter” node, which keeps those terms that have been tagged before and thus set to unmodifiable. After that, the BoW is created and the occurrences of each term are counted using the “GroupBy” node. Then for each topic a tag cloud is created, and the number of occurrences is reflected in the font size of each term.

The figures below shows the tag cloud for the topics English Sports, Thanksgiving barbeque, and football.

Figure 10. Tag cloud above showing topic English Sports.

Figure 11. Tag cloud above showing the Thanksgiving barbeque topic.

Figure 12. Tag cloud above showing the football topic.

It can be the case that the most popular terms are not really keywords, but rather some sort of stop words. To avoid this, the “IDF” node was used to calculate the IDF value for each term. All terms with low IDF values have been filtered out to make sure that only important terms will represent a particular topic.

The result of our workflow is satisfying as we have managed to cluster most of the data correctly and matched the extracted topics with the actual topics in the data.

The workflow used for this blog post can be downloaded from the KNIME EXAMPLES server under 08_Other_Analytics_Types/01_Text_Processing/17_Topic_Extraction_with_the_Elbow_Method08_Other_Analytics_Types/01_Text_Processing/17_Topic_Extraction_with_the_Elbow_Method*.

 


* The link will open the workflow directly in KNIME Analytics Platform (requirements: Windows; KNIME Analytics Platform must be installed with the Installer version 3.2.0 or higher)

Will They Blend? Experiments in Data & Tool Blending. Today: OCR on Xerox Copies meets Semantic Web. Have Evolutionary Theories changed?

$
0
0
Will They Blend? Experiments in Data & Tool Blending. Today: OCR on Xerox Copies meets Semantic Web. Have Evolutionary Theories changed?Dario CannoneMon, 07/03/2017 - 11:04

In this blog series we’ll be experimenting with the most interesting blends of data and tools. Whether it’s mixing traditional sources with modern data lakes, open-source devops on the cloud with protected internal legacy tools, SQL with noSQL, web-wisdom-of-the-crowd with in-house handwritten notes, or IoT sensor data with idle chatting, we’re curious to find out: will they blend? Want to find out what happens when IBM Watson meets Google News, Hadoop Hive meets Excel, R meets Python, or MS Word meets MongoDB?

Follow us here and send us your ideas for the next data blending challenge you’d like to see at willtheyblend@knime.com.

Today: OCR on Xerox Copies meets Semantic Web. Have Evolutionary Theories changed?

Author: Dario Cannone, Data Analyst, at Miriade S.P.A., Italy

The Challenge

Scientific theories are not static over time. As more research studies are completed, new concepts are introduced, new terms are created and new techniques are invented. This is of course also true for evolutionary theories. That is, evolutionary theories themselves have evolved over time!

In today’s challenge we are going to show how the theory of evolution has evolved from the first Darwin’s formulation to the most recent discoveries.

The foundation stone of evolutionary biology is considered to be the book “On the Origin of Species” (1859) by Charles Darwin. This book contains the first revolutionary formulation of the theory of evolutionary biology. Even though the book at the time has produced a revolution in the approach to species evolution, many of the concepts illustrated there might seem now incomplete or even obsolete. Notice that it was published in 1859, when nothing was known about DNA and very little about genetics.

In the early 20th century, indeed, the Modern Synthesis theory reconciled some aspects of Darwin’s theory with more recent research findings on evolution.

The goal of this blog post is to represent the original theory of evolution as well as the Modern Synthesis theory by means of their main keywords. Changes in the used keywords will reflect changes in the presented theory.

Scanned Xerox copies of Darwin’s book abound on the web, like for example at http://darwin-online.org.uk/converted/pdf/1861_OriginNY_F382.pdf. How can we make the contents of such copies available to KNIME? This is where Optical Character Recognition (OCR) comes into play.

On the other side, to find a summary of the current evolutionary concepts we could just query Wikipedia, or better DBPedia, using semantic web SPARQL queries.

Xerox copies on one side, read via OCR, and semantic web queries on the other side. Will they blend?

Topic. Changes in the theory of evolution.

Challenge. Blend a Xerox copy of a book with semantic web queries.

Access Mode. Image reading, OCR library, SPARQL queries.

The Experiment

Reading the Content of a PNG Xerox Copy via Optical Character Recognition (OCR)

Darwin’s book is only available in printed form. The first part of our workflow will try to extract the content of the book from its scanned copy. This is only possible using an Optical Character Recognition software.

We are in luck! KNIME Analytics Platform integrates the Tesseract OCR software as the KNIME Image Processing - Tess4J Integration. This package is available under KNIME Community Contributions - Image Processing and Analysis extension (see how to install KNIME Extensions). Let’s continue step by step, after the package installation.

  • The Image Reader node reads the locally stored image files of the pages of the book “On the Origin of Species”.
  • The read images are sent to the Tess4j node, which runs the Tesseract OCR library and outputs the recognized texts as Strings, one text String for each processed PNG page file.
  • Each page text is then joined with the corresponding page image, converted from ImgPlusValue to PNG format in a KNIME PNGImageCell data cell. The goal of this step is to allow later on for visual inspection of the processed pages via the Image Viewer node.
  • Notice that only the “Recap” and the “Conclusions” sections of Darwin’s book are processed. That should indeed be enough to extract a reasonable number of representative keywords.

This part of the workflow is shown in the upper branch of the final workflow in figure 1, the part labelled as “Optical Character Recognition of Xerox Copy”.

Figure 1. Final workflow where OCR (upper branch) meets Semantic Web Queries (lower branch) to show the change in keywords of original and later formulations of the theory of evolution. This workflow is available on the KNIME EXAMPLES server under 99_Community/01_Image_Processing/02_Integrations/03_Tess4J/02_OCR_meets_SemanticWeb99_Community/01_Image_Processing/02_Integrations/03_Tess4J/02_OCR_meets_SemanticWeb*


(click on the image to see it in full size)

Querying the Semantic Web: DBPedia

On the other side, we want to query DBPedia for the descriptions of the modern evolutionary theories. This part can be seen in the lower branch of the workflow in figure 1, the one named “Querying the Semantic Web: DBPedia”.

  • First, we establish a connection to DBpedia SPARQL endpoint: http://dbpedia.org/sparql.
  • Then we make three queries on the 3 pages “Modern evolutionary synthesis”, “Extended evolutionary synthesis” and “Evolutionary developmental biology” respectively.
  • The results from the 3 queries are collected with a Concatenate node.

Blending Keywords from the Xerox Copy and from the Semantic Web Queries

We have now texts from the book and texts from DBPedia. We want to distill all of them to just a few representative keywords and blend them together.

  • Both the two branches of the workflows pass through “Document Creation and Preprocessing I” wrapped metanode. Here standard text processing functions are applied, such as: case conversion, punctuation removal, number filtering and POS tagging.
  • In the following wrapped metanode, named “Preprocessing II”, the terms are extracted and their relative frequencies are computed. The two lists of terms are then joined together. The column “presence” marks the terms common to both datasets with a 2 and the terms found in only one dataset with a 1.
  • The two tag clouds are created, one from the terms in Darwin’s book and the other from the terms in the DBpedia search results. Words common to both datasets are colored in red.
  • Finally, we can isolate those innovative terms, used in the description of the new evolutionary theories but not in Darwin’s original theory. This is done with a “Reference Row Filter” node and displayed with a “Javascript Table View”.

The final workflow is available on the EXAMPLES server under:
99_Community/01_Image_Processing/02_Integrations/03_Tess4J/02_OCR_meets_SemanticWeb99_Community/01_Image_Processing/02_Integrations/03_Tess4J/02_OCR_meets_SemanticWeb*

The Results

Figure 2 shows the tag cloud with the terms from Darwin’s book “On the Origin of the Species”, while figure 3 shows the tag cloud from the terms found in the results from DBPedia queries.

Natural Selection is a central concept in Darwin’s evolutionary theory and this is confirmed by the central position of the two words, “natural” and “selection”, in the tag cloud in figure 2. The two words are in red. This means that the same terms are also found in modern evolutionary theories, even though in a less central position inside the tag cloud (Fig. 3).

Interestingly enough the word “evolution” is not found in Darwin’s book. Although this term was soon associated with Darwin’s theories and became popular, Darwin himself preferred to use the concept of “descent with modification” and even more “natural selection”, as we have remarked earlier.

Words like “species” and “varieties” also play a central role into Darwin’s theory. Indeed, the whole theory spawned from the observation of the species variety on Earth. On the opposite words like “modern”, “synthesis”, and “evolution” are the cornerstone of the modern evolutionary theories.

Figure 2. Word Cloud generated from the “Recap” and “Conclusions” sections in the book “On the Origin of the Species” by Charles Darwin


(click on the image to see it in full size)

Figure 3. Word Cloud generated from results of DBPedia queries: “Modern evolutionary synthesis”, “Extended evolutionary synthesis” , and “Evolutionary developmental biology”


(click on the image to see it in full size)

What has been learned from the publication time of the book “On the Origin of Species” to the current time? One thing that Darwin for sure could not know is genetics!

If we look for the word “gene” in the table view of the Javascript Table View node, we surely find “genetics”, “gene”, and a number of other related words! Remember that this table view displays the terms found in the description of modern evolutionary theories but not in Darwin’s original book (Fig. 4).

Figure 4. List of “genetics” related words present in the modern evolutionary theories (as derived from the queries to DBPedia) and not present yet in Darwin’s original book (as derived from the OCR processing of the scanned PNG images of the book pages). Such words are all listed in the table view of the Javascript Table View node in the final workflow.

From the results of this small experiment we understand that the evolutionary theory has itself evolved from Darwin seminal work to the modern research studies.

Also, in this experiment, we successfully blend data from a scanned Xerox copy and data from DBPedia. We have used the Tesseract OCR integration to extract the book content and SPARQL queries to extract DBPedia descriptions … and yes, they blend!

Coming Next …

If you enjoyed this, please share it generously and let us know your ideas for future blends.

We’re looking forward to the next challenge. For the next challenge we will tackle the Spreadsheet world, by trying to blend an Excel Sheet with a Google Sheet. Will they blend?

 


* The link will open the workflow directly in KNIME Analytics Platform (requirements: Windows; KNIME Analytics Platform must be installed with the Installer version 3.2.0 or higher)


Empower Your Own Experts! Continental Wins the Digital Leader Award

$
0
0
Empower Your Own Experts! Continental Wins the Digital Leader AwardrsMon, 07/17/2017 - 10:21

Continental, a leading automotive supplier, recently won the Digital Leader Award 2017 in the category “Empower People” for bringing big data and analytics closer to its employees with KNIME. Arne Beckhaus is the man behind this project. We are lucky enough today to welcome him for an interview in our KNIME blog.

Rosaria: Arne, congratulations on winning the Digital Leader Award. We are very pleased to hear that your project builds on KNIME. Can you tell us more about it?

Arne: Thanks for the invitation. The essence of our project is to bring data analytics skills to non-IT employees in our business units. So we are talking about colleagues from purchasing, logistics, and even HR who have neither a programming nor a data science background, but have interesting problems to solve. For them, we implemented an internal training program about data analytics and big data that is completely based on KNIME products. Participating users often bring their data problems to the training and, if the problem is too complex, we support them by drafting an initial workflow. In this way, our users have the chance to solve their analytics problem and be trained in KNIME at the same time, thus optimizing learning speed and skill development.

Rosaria: What range of problems are you tackling with this approach?

Arne: Typically, our business managers and employees talk about big data as soon as data volumes exceed the capabilities of spreadsheet logic. My experience is that only a fifth of the real world problems actually require cluster computing in the league of Spark and Hadoop, which KNIME addresses via the KNIME Big Data Extension. For the remaining 80% of our real-world problems, we can make use of the great selection of standard KNIME nodes.

Rosaria: How do you make sure your users build valid models without a proper background in data science?

Arne: There is a large difference between building a real-time scoring predictor model on gazilions of data and the everyday data processing challenges of our thousands of business users. Most of our time, including data scientists’ time, is spent in what the IT people call ETL operations. Data cleaning, data blending, aggregations, data reorganization, etc … An example: A colleague from logistics used to work with a manual Excel-based filtering and visualization process, which took 10 minutes per part number that was analyzed. This work could not be prepared proactively in advance for the thousands of part numbers in the systems. It could only be executed on demand. In parallel to this colleague’s training, together, we developed a KNIME workflow to automatically prioritize an entire business unit’s critical spots in the supply chain on a weekly basis. This expertise-based solution generates valuable insights without any model training. So even though only typical ETL nodes are used in KNIME, I prefer to call these domain knowledge based workflows ‘deterministic analytics’.

Rosaria: By the way, with a Custom Node Repository in the KNIME server you could even restrict the analytics nodes available to your users.

Arne: Thanks for the tip. Looks like another great feature of KNIME’s customizability and open nature. This was actually one of the main reasons to select KNIME Analytics Platform for our project. The innovation speed, extensibility, openness, and smart surrounding commercial offers convinced us from day one.

Rosaria: Can you tell us more about your training approach? What is special about it?

Arne: At first, we only use domain related examples. So lots of automotive examples instead of B2C e-commerce, pharmaceutical, etc. That makes it easier for our users to relate to their real life challenges. Since we run this training internally with our own resources, we can also offer easy to consume (and easy to schedule) training cycles. A typical training wave consists of a 3-hour module per 5 consecutive weeks.

Rosaria: What are your key learnings from the project?

Arne: Reviewing our progress so far, I think there are three key success factors.

  • Most important has been the definition of our target group: business users without programming skills. We enable them to utilize their domain knowledge for self-service analytics.
  • Then, the combination of training and pilot projects has given us immediate results.
  • Last but not least, we created awareness for the 80/20 rule in data analytics: Most of the effort lies in data integration and this can be done by the data owners themselves. I would even go beyond that and say there is a second 80/20 rule: from a business user perspective, 80% of our problems don’t require advanced analytics but can be solved by deterministic data workflows, e.g. by prioritizing thousands of cases according to a user-defined criticality KPI.

Rosaria: Thanks a lot Arne for these insights and the choice of KNIME as your chosen analytics platform. Congratulations again for winning the Digital Leader Award! Any final words from your side?

Arne: Thanks for the opportunity to share our findings. Please keep up the great work at KNIME! As a final remark, I would like to inspire others to follow our approach: You will be rewarded by enthusiastic employees who can unleash their creativity in working with data while at the same time leading to better business decisions on all levels of the organization!

Will They Blend? Experiments in Data & Tool Blending. Today: A Recipe for Delicious Data: Mashing Google and Excel Sheets

$
0
0
Will They Blend? Experiments in Data & Tool Blending. Today: A Recipe for Delicious Data: Mashing Google and Excel SheetsamartinMon, 07/24/2017 - 10:47

In this blog series we’ll be experimenting with the most interesting blends of data and tools. Whether it’s mixing traditional sources with modern data lakes, open-source devops on the cloud with protected internal legacy tools, SQL with noSQL, web-wisdom-of-the-crowd with in-house handwritten notes, or IoT sensor data with idle chatting, we’re curious to find out: will they blend? Want to find out what happens when IBM Watson meets Google News, Hadoop Hive meets Excel, R meets Python, or MS Word meets MongoDB?

Follow us here and send us your ideas for the next data blending challenge you’d like to see at willtheyblend@knime.com.

Today: A Recipe for Delicious Data: Mashing Google and Excel Sheets

The Challenge

Don’t be confused! This is not one of the data chef battles, but  a “Will they blend?” experiment - which, just by chance, happens to be on a restaurant theme again.

A local restaurant has been running its business relatively successfully for a few years now. It is a small business. An Excel Sheet was enough for the full accounting in 2016. To simplify collaboration, the restaurant owner decided to start using Google Sheets at the beginning of 2017. Now (2017 with Google Sheets) she faces the same task every month of calculating the monthly and YTD revenues and comparing them with the corresponding prior-year values (2016 with Microsoft Excel). 

The technical challenge at the center of this experiment is definitely not a trivial matter: mashing the data from the Excel and Google spreadsheets into something delicious… and digestible. Will they blend?

Topic. Monthly and YTD revenue figures for a small local business.

Challenge. Blend together Microsoft Excel and Google Sheets.

Access Mode. Excel Reader and REST Google API for private and public documents.

Before diving into the technicalities, let’s spend a few words on the data. Both the Excel file and the Google spreadsheet contain the restaurant bill transactions with:

  • date and time (DATE_TIME),
  • table number (TABLE)
  • bill amount (SUM)

The Experiment

The Excel Spreadsheet can be read easily with an Excel Reader node (see blog post: “Will they blend? Hadoop Hive meets Excel”).

The bigger problem here is accessing the Google Sheet. It is possible to access a Google Sheet document via dedicated Google Sheets REST API services. We send a Request for the Google Spreadsheet ID and the range of the data cells to be read and we get the corresponding values back in the Response.

Of course, together with the request parameters, some kind of recognition token is required. This can be a simple API key for a public Google Spreadsheet or an OAuth2 authorization token for a private Google Spreadsheet. In this experiment, we will discuss both access scenarios: to a public Google document and a private Google document.

Accessing a PUBLIC Google Spreadsheet Document via REST API

A Google Sheet document is public when its link sharing option is on. It can be on for anyone, for anyone with the link, for anyone in your domain, and for anyone in your domain with the link.

A public Google Sheet can be accessed with a GET Request through the Google Sheets REST API. A valid URL has the following format:

https://sheets.googleapis.com/v4/spreadsheets/SPREADSHEET_ID/values/SPREADSHEET_NAME!RANGE?key=API_KEY&majorDimension=MAJOR_DIMENSION

where:

  • SPREADSHEET_ID is the unique identification string for that spreadsheet. It can be found between the "/d/" and the "/edit" in the URL of the spreadsheet.
     
  • SPREADSHEET_NAME is the name of the spreadsheet.
     
  • RANGE is the data cell range to be expressed in the A1 notation.
     
  • API_KEY is a personal identification key for each Google user. In case of a public Google Sheet, the REST Request does not need to be authorized, but the user needs to be identified by means of a unique identifier, such as the Google API key. A Google API key can be acquired through the following steps:
    • Activate the Google Sheets API service in the Google API Console. To do so, open the Library page in the Google API console, find the Google Sheets API, click it to enable the service.
    • In the same page, open the Credentials page on the left, and get an API key via the menu option Create Credentials -> API key.
  • MAJOR_DIMENSION is an optional component specifying whether to operate on the rows or the columns of the spreadsheet. For faster response processing, we set this parameter to “COLUMN”.

For more options (for example, subset of fields or multiple ranges) please check the Google documentation.

The Google Sheet API then returns the data within the specified range of the selected spreadsheet using a JSON representation.

For example, a GET Request to access the data in data cell rectangle A1:C5761 in sheet named “2017” of the document with document ID could be:

https://sheets.googleapis.com/v4/spreadsheets//values/2017!A1:C5761?key=&majorDimension=COLUMN

In KNIME Analytics Platform, a GET Request to a REST service is sent via the GET Request Node. The Request string shown above is customized for the selected spreadsheet and data cell range, fed into the GET Request node, and used as URL Column setting in the Connection Settings tab. The other node settings are kept at the default values, which means no authentication, no request headers, basic response header only (Status and Content-type).

Storing the data cell range, the spreadsheet ID, and the API key in three different flow variables, we can build the right GET Request String through a String Manipulation node on the fly for a different document, sheet, and data cell range at each run.

Even better. The String Manipulation node and the three String Input Quickform nodes were encapsulated into a wrapped metanode. Consequently, the settings in the QuickForm tab of the metanode configuration window set the values for the three flow variables necessary to customize the GET Request string (Fig. 1).

Note. You only need to insert the Spreadsheet ID, the API Key, and the data cell range to customize the GET Request appropriately, when executing the metanode. There is no need to open the metanode and the nodes inside it!

The metanode output is the Request String and feeds the GET Request node.

Figure 1. QuickForm tab of the wrapped metanode named “GET Request Wizard”. The three settings set the values of the three flow variables used to customize the Request String to access a public Google Sheet document. You only need to insert the Spreadsheet ID, the API Key, and the data cell range to customize the GET Request appropriately when executing the metanode.

The final workflow is shown in figure 2 and can be downloaded from the KNIME EXAMPLES server from 01_Data_Access/05_REST_Web_Services/04_Public_Google_Sheet-meets-Excel01_Data_Access/05_REST_Web_Services/04_Public_Google_Sheet-meets-Excel*.

The upper branch reads the restaurant transactions from 2016 from an Excel spreadsheet. The lower branch, labelled “2017 Restaurant Transactions”, retrieves the 2017 data from the Google Sheet. And, just as we want it to, it builds the GET Request in the “Google Sheets Wizard” wrapped metanode and sends it to the Google Spreadsheet REST API service via the GET Request node.

The service Response is parsed and imported into a KNIME table by means of a JSON To Table node inside the “JSON Processing” metanode.

The rest of the workflow calculates the revenues by month as total sum and as Year To Date (YTD) cumulative sum. While the monthly total sum is calculated with a GroupBy node, the YTD cumulative sum is calculated with a Moving Aggregation node.

Figure 2. This workflow blends together data from an Excel spreadsheet and data from a public Google Sheet document. This workflow is available on the KNIME EXAMPLES server under 01_Data_Access/05_REST_Web_Services/04_Public_Google_Sheet-meets-Excel01_Data_Access/05_REST_Web_Services/04_Public_Google_Sheet-meets-Excel*.


(click on the image to see it in full size)

Accessing a PRIVATE Google Sheets Document via REST API

When accessing a private Google document, things become more complicated. A simple API key is not enough anymore. We need to acquire an access token through the OAuth 2.0 protocol

1. Register the application with the Google Spreadsheet API service

  1. Activate the Google Sheets API in the Google API Console. To do so, open the Library page in the Google API console, find the Google Sheets API, click it to enable the service.
  2. Register the application and create the credentials for authorization using the Credentials page, via the menu option Create credentials -> OAuth client ID. Then choose “Other” as application type. It will result in a new OAuth 2.0 client ID producing a new Client ID and Client Secret parameters. Please, note these two parameters.

2. Send a Request to Google's OAuth 2.0 server to get an authorization code

Note. For security reasons this is still a manual step, since it requires logging in into your private Google account.

The Request String should have the following format:

https://accounts.google.com/o/oauth2/v2/auth?client_id=CLIENT_ID&redirect_uri=REDIRECT_URI&response_type=RESPONSE_TYPE&scope=SCOP

where:

  • CLIENT_ID is the Client ID parameter that we obtained when we registered our application for an OAuth 2.0 authorization (see section above).
     
  • REDIRECT_URI defines the channel for the Google authorization response. Since we’re going to manually copy and paste the authorization code, the value in the Request String was set to urn:ietf:wg:oauth:2.0:oob (more options are also available).
     
  • RESPONSE_TYPE specifies the type of response. It should be set to code for installed applications.
     
  • SCOPE regulates the access degree to the document by managing the set of permitted operations. Several scopes might be set in the same Request String separated by an “&” symbol.

Here is an example of a POST Request String to require an authorization code to access a private document:

https://accounts.google.com/o/oauth2/v2/auth?client_id=&redirect_uri=urn:ietf:wg:oauth:2.0:oob&response_type=code
&scope=https://www.googleapis.com/auth/spreadsheets.readonly&https://www.googleapis.com/
auth/spreadsheets

Now we paste the Request String in the browser URL box. We are then redirected to a Google consent form to login and authorize the application to access the data. If the authorization is successful, we are provided with an authorization code, which we will use in the next step. 

3. Exchange authorization code for access token

To exchange the authorization code obtained in the previous step we need to send a POST Request to Google's OAuth 2.0 server. To send a POST Request to a REST API service we can use the POST Request node. In particular, we will be using a POST Request node with the following configuration settings.

In the Connection Settings tab we set the request URL:

https://www.googleapis.com/oauth2/v4/token

In the Request Headers tab we add the following (Header Key:, Value) pairs, both of the type Constant:

Host: www.googleapis.com
Content-Type: application/x-www-form-urlencoded

In the Request Body tab we specify the POST request parameters by means of a String without white spaces:

code=AUTHORIZATION_CODE&client_id=CLIENT_ID&client_secret=CLIENT_SECRET&
redirect_uri=REDIRECT_URI&grant_type=GRANT_TYPE

where:

  • AUTHORIZATION_CODE was obtained in the previous set of manual steps.
     
  • CLIENT_ID is the Client ID parameter obtained in the previous steps when authorizing the application on the Credentials page of the Google API Console.
     
  • CLIENT_SECRET is the Client Secret also obtained in the previous steps when authorizing the application on the Credentials page of the Google API.
     
  • REDIRECT_URI defines the channel for the Google authorization Response. Since we’re going to manually copy and paste the authorization code, the value in the Request String was set to urn:ietf:wg:oauth:2.0:oob (more options are also available).
     
  • GRANT_TYPE defines the OAuth 2.0 access process. Here the value must be set to authorization_code.

The parameter String for our POST Request then looks something like this:

code=4/KfUg4b7ZUQ1lZeHGlephvwSLQDHMBFOMeFoKzFJ3Vs4&client_id=&client_secret=&redirect_uri=urn:ietf:wg:oauth:2.0:oob&grant_type=authorization_code

The Response to this POST Request is a JSON object which holds a short-lived access token and a refresh token.

The refresh token can be used to obtain a new access token without generating a new authorization code. When using the refresh token to generate a new access token, the parameter String in the POST Request becomes:

refresh_token=REFRESH_TOKEN&client_id=CLIENT_ID&client_secret=CLIENT_SECRET&
redirect_uri=REDIRECT_URI&grant_type=GRANT_TYPE

where:

  • REFRESH_TOKEN is the refresh token we obtained together with the access token
     
  • GRANT_TYPE must be set to refresh_token.

Let’s create a wizard wrapped metanode to produce the right parameter String for the POST Request to generate a new access token. A String Manipulation node and three String Input Quickform nodes are encapsulated into a wrapped metanode. The three Quickform nodes set the Authorization Code (or Refresh Token), the Client ID, and the Client Secret. The configuration window of the wrapped metanode, named “POST Request Wizard”, is shown in figure 3.

Note. You only need to insert the Authorization Code (or Refresh Token), the Client ID, and the Client Secret to customize the parameter String for the POST Request that generates a new access token. There is no need to open the metanode and the nodes inside it!

The metanode output contains the appropriate parameter String and feeds the POST Request node.

Figure 3. QuickForm tab of the wrapped metanode named “POST Request Wizard”. The three settings set the values of the three flow variables used to customize the parameter String to exchange an authorization code (or a refresh token) for an access token to access a private Google Sheet document. You only need to insert the Authorization Code /Refresh token, the Client ID, and the Client Secret to customize the parameter String for the POST Request.

The created parameter string will then be used as a request body in the POST Request node. In the Request Body tab, we check the “Use column’s content as body” radio button and we set that column as the value.

The output of the POST Request node contains the Response from the Google Auth2.0 service. The Response contains the desired access token (finally!), which is JSON structured. The JSON structure is converted into a KNIME table inside the JSON Processing metanode, by means of a JSON to Table node. The output of the JSON Processing metanode is the access token.

4. Use access token in GET Request to extract data from Google Sheet

As in the case of a public Google Sheet document, data can be accessed with a GET Request node. Similarly, we need to build a Request String, but with the access token instead of the API key:

https://sheets.googleapis.com/v4/spreadsheets/SPREADSHEET_ID/values/SPREADSHEET_NAME!RANGE?access_token=ACCESS_TOKEN&majorDimension=MAJOR_DIMENSION

In our case, the sample Request String will look something like that:

https://sheets.googleapis.com/v4/spreadsheets//values/2017!A1:C5761?access_token=&majorDimension=COLUMN

The GET Request is then prepared in the “GET Request Wizard (Private Access)” wrapped metanode in a similar way as described in the previous section for the “GET Request Wizard (Public Access)” wrapped metanode.

The final workflow is shown in figure 4 and can be downloaded from the KNIME EXAMPLES server from 01_Data_Access/05_REST_Web_Services/05_Private_Google_Sheet-meets-Excel01_Data_Access/05_REST_Web_Services/05_Private_Google_Sheet-meets-Excel*.

The upper branch reads the restaurant transactions from 2016 from an Excel Sheet. The lower branch, labelled “2017 Restaurant Transactions”, retrieves the 2017 data from a private Google Sheet. And indeed, it sends a POST Request to get the access token and then a GET Request to the Google Spreadsheet REST API service.

The Response to the last GET Request is parsed and imported into a KNIME table by means of a JSON To Table node inside the second “JSON Processing” metanode.

The rest of the workflow calculates the revenues by month as total sum and Year To Date (YTD) cumulative sum. While the monthly total sum is calculated with a GroupBy node, the YTD cumulative sum is calculated with a Moving Aggregation node.

Figure 4. This workflow blends together data from an Excel spreadsheet and from a private Google Sheet document. This workflow is available on the KNIME EXAMPLES server under 01_Data_Access/05_REST_Web_Services/05_Private_Google_Sheet-meets-Excel01_Data_Access/05_REST_Web_Services/05_Private_Google_Sheet-meets-Excel*.


(click on the image to see it in full size)

The Results

The two bar charts below show the restaurant revenues in euros as a total monthly sum and as a Year To Date (YTD) monthly cumulative sum respectively, for both groups of transactions in 2016 (light orange, from the Excel file) and in 2017 (darker orange, from the Google Sheet document).

We are happy to see that the small restaurant we used as an example has increased its business sales in 2017 with respect to the same months in 2016.

We are also happy to see that Google Sheets and Microsoft Excel spreadsheets really do blend!

In this experiment, indeed, we retrieved and blended data from an Excel spreadsheet and a Google Sheet document. We actually ran two experiments: one retrieving data from a Google Sheet document with public access and the other retrieving the same data from a Google Sheet document with private access. In both experiments our Excel spreadsheets and Google Sheets documents blended easily, to produce a delicious dish for our restaurant business.

Figure 5. Total monthly revenues for our restaurant in year 2016 (light orange on the left) and in year 2017 (darker orange on the right). Business in 2017 seems to be better than in 2016.

Figure 6. Cumulative YTD revenues for our restaurant in year 2016 (light orange on the left) and in year 2017 (darker orange on the right). Also here, business in 2017 seems to be better than in 2016.

So, if you are asked by your friend running a local business whether you can blend data from Excel spreadsheets and Google Sheet documents, the answer is: Yes, they blend!

Coming Next …

If you enjoyed this, please share it generously and let us know your ideas for future blends.

We’re looking forward to the next challenge. There we will tackle the world of CRM applications, trying to bring together Salesforce and SugarCRM. Will they blend?

 


* The link will open the workflow directly in KNIME Analytics Platform (requirements: Windows; KNIME Analytics Platform must be installed with the Installer version 3.2.0 or higher)

Setting up the KNIME Python extension. Revisited for Python 3.0 and 2.0

$
0
0
Setting up the KNIME Python extension. Revisited for Python 3.0 and 2.0greglandrumMon, 07/31/2017 - 09:57

As part of the v3.4 release of KNIME Analytics Platform, we rewrote the Python extensions and added support for Python 3 as well as Python 2. Aside from the Python 3 support, the new nodes aren’t terribly different from a user perspective, but the changes to the backend give us more flexibility for future improvements to the integration. This blog post provides some advice on how to set up a Python environment that will work well with KNIME as well as how to tell KNIME about that environment.

The Python Environment

We recommend using the Anaconda Python distribution from Continuum Analytics. There are many reasons to like Anaconda, but the important things here are that it can be installed without administrator rights, supports all three major operating systems, and provides all of the packages needed for working with KNIME “out of the box”.

Get started by installing Anaconda from the link above. You’ll need to choose which version of Python you prefer (we recommend that you use Python 3 if possible) but this just affects your default Python environment; you can create environments with other Python versions without doing a new install. For example, if I install Anaconda3 I can still create Python 2 environments.

Once you’ve got Anaconda installed, open a shell (linux), terminal (Mac), or command prompt (Windows) and create a new Python environment for use inside of KNIME:

  conda create -y -n py35_knime python=3.5 pandas jedi

If there are additional packages you’d like to install, go ahead and add them to the end of that command line. If you’d like to install Python 2.7 instead of 3.5, just change the version number in the command.

In order to use this new Python environment from inside of KNIME, you need to create a script (shell script on linux and the Mac, bat file on Windows) to launch it.

If you are using linux or the Mac, here’s an example shell script for the Python environment defined above:

  #! /bin/bash  # start by making sure that the anaconda directory is on the PATH  # so that the source activate command works.  # This isn't necessary if you already know that  # the anaconda bin dir is on the PATH  export PATH="PATH_WHERE_YOU_INSTALLED_ANACONDA/bin:$PATH"    source activate py35_knime  python "$@" 1>&1 2>&2

You will need to edit that to replace PATH_WHERE_YOU_INSTALLED_ANACONDA with wherever you installed Anaconda. I named this script py35.sh, made it executable (“chmod gou+x py35.sh”), and put it in my home directory.

If you are using Windows, here’s a sample bat file:

  @REM Adapt the directory in the PATH to your system  @SET PATH=C:\tools\Anaconda3\Scripts;%PATH%  @CALL activate py35_knime || ECHO Activating py35_knime failed  @python %*

You will need to edit that to replace C:\tools\Anaconda3 with wherever you installed Anaconda. I named the file py35.bat and put it in my home directory.

You now have everything required to use Python in KNIME. Congrats!

Configuring KNIME

Once you have a working Python environment you need to tell KNIME how to find it. Start by making sure that you have the new Python nodes - KNIME Python Integration (Labs, supports Python 2 & 3 - installed in KNIME Analytics Platform. Once you have these installed (and have restarted KNIME, if necessary), configure Python using the Preferences page KNIME → Python (Labs):

Figure 1. KNIME Python Preferences page. Here you can set the path to the executable script that launches your Python environment.

On this page you need to provide the path to the script/bat file you created to start Python. If you like, you can have configurations for both Python 2 and Python 3 (as I do above). Just select the one that you would like to have as the default.

If you’ve completed the steps above and after you click “Apply” KNIME shows the correct version number for Python in the dialog, you’re ready to go. Enjoy using the powerful combination of KNIME Analytics Platform and Python!

Note. A note for those using older versions of KNIME or the old Python nodes.

The instructions here for setting up a conda environment for using Python inside of KNIME and creating the shell script/batch file for invoking that environment will also work for older versions of KNIME. In that case you can only use Python 2 and need to be sure to include protobuf as one of the packages in your conda create command.

 

Wrapping up

This post showed you how to install an Anaconda Python environment that can be used with the KNIME Python integration and then how to configure KNIME Analytics Platform to use that environment. In a future post we’ll show some interesting things that you can do with the combination of KNIME and Python.

Will They Blend? Experiments in Data & Tool Blending. Today: Blending Databases. A Database Jam Session

$
0
0
Will They Blend? Experiments in Data & Tool Blending. Today: Blending Databases. A Database Jam SessionrsMon, 04/10/2017 - 15:19

In this blog series we’ll be experimenting with the most interesting blends of data and tools. Whether it’s mixing traditional sources with modern data lakes, open-source devops on the cloud with protected internal legacy tools, SQL with noSQL, web-wisdom-of-the-crowd with in-house handwritten notes, or IoT sensor data with idle chatting, we’re curious to find out: will they blend? Want to find out what happens when IBM Watson meets Google News, Hadoop Hive meets Excel, R meets Python, or MS Word meets MongoDB?

Follow us here and send us your ideas for the next data blending challenge you’d like to see at willtheyblend@knime.com.

Today: Blending Databases. A Database Jam Session

The Challenge

Today we will push the limits by attempting to blend data from not just 2 or 3, but 6 databases!

These 6 SQL and noSQL databases are among the top 10 most used databases, as listed in most database comparative web sites (see DB-Engines Ranking, The 10 most popular DB Engines …, Top 5 best databases). Whatever database you are using in your current data science project, there is a very high probability that it will be in our list today. So, keep reading!

What kind of use case is going to need so many databases? Well, actually it’s not an uncommon situation. For this experiment, we borrowed the use case and data sets used in the Basic and Advanced Course on KNIME Analytics Platform. In this use case, a company wants to use past customer data (behavioral, contractual, etc…) to find out which customers are more likely to buy a second product. This use case includes 6 datasets, all related to the same pool of customers.

  1. Customer Demographics (Oracle). This dataset includes age, gender, and all other classic demographic information about customers, straight from your CRM system. Each customer is identified by a unique customer key. One of the features in this dataset is named “Target” and describes whether the customer, when invited, bought an additional product. 1 = he/she bought the product; 0 = he/she did not buy the product. This dataset has been stored in an Oracle
  2. Customer Sentiment (MS SQL Server). Customer sentiment about the company has been evaluated with some customer experience software and reported in this dataset. Each customer key is paired with customer appreciation, which ranges on a scale from 1 to 5. This dataset is stored on a Microsoft SQL Server
  3. Sentiment Mapping (MariaDB). This dataset here contains the full mapping between the appreciation ranking numbers in dataset # 2 and their word descriptions. 1 means “very negative”, 5 means “very positive”, and 2, 3, and 4 cover all nuances in between. For this dataset we have chosen storage relatively new and very popular software: a MariaDB
  4. Web Activity from the company’s previous web tracking system (MySQL). A summary index of customer activity on the company web site used to be stored in this dataset. The web tracking system associated with this dataset has been declared obsolete and phased out a few weeks ago. This dataset still exists, but is not being updated anymore. A MySQL database was used to store these data.
  5. Web Activity from the company’s new web tracking system (MongoDB). A few weeks ago the original web tracking system was replaced by a newer system. This new system still tracks customers’ web activity on the company web site and still produces a web activity index for each customer. To store the results, this system relies on a new noSQL database: MongoDB. No migration of the old web activity indices has been attempted, because migrations are costly in terms of money, time, and resources. The idea is that eventually the new system will cover all customers and the old system will be completely abandoned. Till then, though, indices from the new system and indices from the old system will have to be merged together at execution time.
  6. Customer Products (PostgreSQL). For this experiment, only customers who already bought one product are considered. This dataset contains the one product owned by each customer and it is stored in a PostgreSQL

The goal of this experiment is to retrieve the data from all of these data sources, blend them together, and train a model to predict the likelihood of a customer buying a second product.

The blending challenge of this experiment is indeed an extensive one. We want to collect data from all of the following databases: MySQL, MongoDB, Oracle, MariaDB, MS SQL Server, and PostgreSQL. Six databases in total: five relational databases and one noSQL database.

Will they all blend?

Topic. Next Best Offer (NBO). Predict likelihood of customer to buy a second product.

Challenge. Blend together data from six commonly used, SQL and noSQL databases.

Access Mode. Dedicated connector nodes or generic connector node with JBDC driver.

The Experiment

Let’s start by connecting to all of these databases and retrieving the data we are interested in.

Relational Databases

Data retrieval from all relational SQL-powered databases follows a single pattern:

  1. Define Credentials.
  • Credentials can be defined at the workflow level (right-click the workflow in the KNIME Explorer panel and select Workflow Credentials). Credentials provided this way are encrypted.
  • Alternatively, credentials can be defined in the workflow using a Credentials Input node. The Credentials Input node protects the username and password with an encryption scheme.
  • Credentials can also be provided explicitly as username and password in the configuration window of the connector node. A word of caution here. This solution offers no encryption unless a Master Key is defined in the Preferences page.
  1. Connect to Database.

    With the available credentials we can now connect to the database. To do that, we will use a connector node. There are two types of connector nodes in KNIME Analytics Platform.​
  • Dedicated connector nodes. Some databases, with redistributable JDBC driver files, have dedicated connector nodes hosted in the Node Repository panel. Of our 6 databases, MySQL, PostgreSQL, and SQL Server enjoy the privilege of dedicated connector nodes. Dedicated connector nodes encapsulate the JDBC driver file and other settings for that particular database, making the configuration window leaner and clearer.
  • Generic connector node. If a dedicated connector node is not available for a given database, we can resort to the generic Database Connector node. In this case, the JDBC driver file has to be uploaded to KNIME Analytics Platform via the Preferences -> KNIME -> Database Once the JDBC driver file has been uploaded, it will also appear in the drop-down menu in the configuration window of the Database Connector node. Provided the appropriate JDBC driver is selected and the database hostname and credentials have been set, the Database Connector node is ready to connect to the selected database. Since a dedicated connector node was missing, we used the Database Connector node to connect to Oracle and MariaDBdatabases.
  1. ​​​​Select Table and Import Data.

    Once a connection to the database has been established, a Database Table Selector node builds the necessary SQL query to extract the required data from the database. A Database Connection Table Reader node then executes the SQL query, effectively importing the data into KNIME Analytics Platform.

It is comforting that this approach - connect to database, select table, and extract data– works for all relational databases. It is equally comforting that the Database Connector node can reach out to any database. This means indeed that with this schema and with the right JDBC driver file I can connect to all existing databases, including vintage versions or those of rare vendors.

NoSQL Databases

Connecting to a NoSQL database, such as MongoDB, follows a different node sequence pattern.

In KNIME Labs, a MongoDB sub-category hosts a few nodes that allow you to perform basic database operations on a MongoDB database. In particular, the MongoDB Reader node connects to a MongoDB database and extracts data according to the query defined in its configuration window.

Credentials here are required within the configuration window and it is not possible to provide them via the Credentials Input node or the Workflow Credentials option.

Data retrieved from a MongoDB database are encapsulated in a JSON structure. No problem. This is nothing that the JSON to Table node cannot handle. At the output of the JSON to Table node, the data retrieved from the MongoDB database are then made available for the next KNIME nodes.

Figure 1. This is the part of the workflow that blends data from six different databases: MySQL, MongoDB, SQL Server, Oracle, MariaDB, and PostgreSQL.


(click on the image to see it in full size)

Train a Predictive Model

Most of our six datasets contain information about all of the customers. Only the web activity datasets alone do not cover all customers. However, together they do. The old web activity dataset is concatenated with the new web activity dataset. After that, all data coming from all of the different data sources are adjusted, renamed, converted, and joined so that one row represents one customer, where the customer is identified by its unique customer key.

Note. Notice the usage of a GroupBy node to perform a deduplication operation. Indeed, grouping data rows on all features allows for removal of identical rows.

The resulting dataset is then partitioned and used to train a machine learning model. As machine learning model, we chose a random forest with 100 trees and we trained it to predict the value in “Target” column. “Target” is a binary feature representing whether a customer bought a second product. So training the model to predict the value in “Target” means that we are training the model to produce the likelihood of a customer to buy a second product, given all that we already know about her/him.

The model is then applied to the test set and its performance evaluated with a Scorer node. The model accuracy was calculated to be around 77%.

Measuring the Influence of Input Features

A very frequent task in data analytics projects is to determine the influence of the input features on the trained model. There are many solutions to that, which also depend on the kind of predictive model that has been adopted.

A classic solution that works with all predictive algorithms is the backward feature elimination procedure (or its analogous forward feature construction).

Backward Feature Elimination starts with all N input features and progressively removes one to see how this affects the model performance. The input feature whose removal lowers the model’s performance the least is left out. This step is repeated until the model’s performance is worsened considerably. The subset of input features producing a high accuracy (or a low error) represents the subset of most influential input features. Of course, the definition of high accuracy (or low error) is arbitrary. It could mean the highest accuracy or a high enough accuracy for our purposes.

The metanode, named “Backward Feature Elimination” and available in the Node Repository under KNIME Labs/Wide Data/Feature Selection, implements exactly this procedure. The final node in the loop, named “Feature Selection Filter”, produces a summary of the model performance, for all steps where the input feature with lowest influence had been removed.

Remember that the Backward Feature Elimination procedure becomes slower with the higher number of input features. It works well with a limited number of input features, but avoid using it to investigate hundreds of them.

In addition, a random forest offers a higher degree of interpretability with respect to other machine learning models. One of the output ports of the Random Forest Learner node provides the number of times an input feature has been the candidate for a split and the number of times it has actually been chosen for the split, for levels 0, 1, and 2 across all trees in the forest. For each input feature, we subsequently defined a heuristic measure of influence, borrowed from the KNIME whitepaper “Seven Techniques for Data Dimensionality Reduction”, as:

  influence index = Sum(# splits) / sum(# candidates)

The input features with highest influence indices are the most influential ones on the model performance.

Figure 2. Content of the metanode “Backward Feature Elimination” adapted for a random forest predictive model.


(click on the image to see it in full size)

The final workflow is shown in figure 3 and it is downloadable from the KNIME EXAMPLES server under: 01_Data_Access/02_Databases/08_Database_Jam_Session01_Data_Access/02_Databases/08_Database_Jam_Session*.

In figure 3 you can see the five parts of our workflow: Credentials Definition, Database Connections and Data Retrieval, Data Blending to reach one single data table, Predictive Model Training, and Influence Measure of Input Features.

Figure 3. This workflow blends data from 6 different databases: MySQL, MongoDB, SQL Server, Oracle, MariaDB, and PostgreSQL. The blended dataset is used to train a model to predict customer’s likelihood to buy a second product. The last nodes measure input features’ influence on the final predictions.


(click on the image to see it in full size)

The Results

Yes, data from all of these databases do blend!

In this experiment, we blended data from six, SQL based and noSQL, databases - Oracle, MongoDB, MySQL, MariaDB, SQL Server, and PostgreSQL– to reach one single data table summarizing all available information about our customers.

In this same experiment, we also trained a random forest model to predict the likelihood of a customer buying a second product.

Finally, we measured each input feature’s influence on the final predictions, using a Backward Feature Elimination procedure and a heuristic influence measure based on the numbers of splits and candidates in the random forest. Results from both procedures are shown in figures 4 and 5. Both figures show the prominent role of Age and Estimated Yearly Income and the negligible role of Gender, when predicting whether customer will buy a second product.

Figure 4. Bar Rendering of the influence indices calculated for all input features.


(click on the image to see it in full size)

Figure 5. Accuracy for subsets of input features from the configuration window of the Feature Selection Filter node.


(click on the image to see it in full size)

This whole predictive and influence analysis was made possible purely because of the data blending operation involving the many different database sources. The main result is therefore another yes! Data can be retrieved from different databases and they all blend!

The data and use case for this post are from the basic and advanced course on KNIME Analytics Platform. The course, naturally, covers much more and goes into far more detail than what we have had the chance to show here.

Note. Just in case, you got intrigued and you want to know more about the courses that KNIME offers, you can refer to the course web page on the KNIME web site. Here you can find the courses schedule and a description of their content. In particular, the slides for the basic and advanced course can now be downloaded for free from https://www.knime.org/course-materials-download-registration-page.

Coming Next …

If you enjoyed this, please share it generously and let us know your ideas for future blends.

We’re looking forward to the next challenge. There we will tackle Teradata databases and KNIME table files. Will they blend?

 


* The link will open the workflow directly in KNIME Analytics Platform (requirements: Windows; KNIME Analytics Platform must be installed with the Installer version 3.2.0 or higher)

Will They Blend? Experiments in Data & Tool Blending. Today: Teradata Aster meets KNIME Table. What is that chest pain?

$
0
0
Will They Blend? Experiments in Data & Tool Blending. Today: Teradata Aster meets KNIME Table. What is that chest pain?knime_adminMon, 04/24/2017 - 11:41

In this blog series we’ll be experimenting with the most interesting blends of data and tools. Whether it’s mixing traditional sources with modern data lakes, open-source devops on the cloud with protected internal legacy tools, SQL with noSQL, web-wisdom-of-the-crowd with in-house handwritten notes, or IoT sensor data with idle chatting, we’re curious to find out: will they blend? Want to find out what happens when IBM Watson meets Google News, Hadoop Hive meets Excel, R meets Python, or MS Word meets MongoDB?

Follow us here and send us your ideas for the next data blending challenge you’d like to see at willtheyblend@knime.com.

Today: Teradata Aster meets KNIME Table. What is that chest pain?

Author: Kate Phillips, Data Scientist, Analytics Business Consulting Organization, Teradata

 

The Challenge

Today’s challenge is related to the healthcare industry. You know that little pain in the chest you sometimes feel and you do not know whether to run to the hospital or just wait until it goes away? Would it be possible to recognize as early as possible just how serious an indication of heart disease that little pain is?

The goal of this experiment is to build a model to predict whether or not a particular patient with that chest pain has indeed heart disease.

To investigate this topic, we will use open-source data obtained from the University of California Irvine Machine Learning Repository, which can be downloaded from http://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/. Of all datasets contained in this repository, we will use the processed Switzerland, Cleveland, and VA data sets and the reprocessed Hungarian data set.

These data were collected from 920 cardiac patients: 725 men and 193 women aged between 28 and 77 years old; 294 from the Hungarian Institute of Cardiology, 123 from the University Hospitals in Zurich and Basel, Switzerland, 200 from the V.A. Medical Center in Long Beach, California, and 303 from the Cleveland Clinic in Ohio.

Each patient is represented through a number of demographic and anamnestic values, angina descriptive fields, and electrocardiographic measures (http://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/heart-disease.names).

In the dataset each patient condition is classified into 5 levels according to the severity of his/her heart disease. We simplified this classification system by transforming it into a binary class system: 1 means heart disease was diagnosed, 0 means no heart disease was found.

This is not the first time that we are running this experiment. In a not even that remote past, we built a Naïve Bayes KNIME model on the same data to solve the same problem. Today we want to build a logistic regression model and see if we get any improvements on the Naïve Bayes model performance.

Original patient data are stored in a Teradata database. The predictions from the old Naïve Bayes model are stored in a KNIME Table.

Teradata Aster is a proprietary database system that may be in use at your company/organization. It is designed to enable multi-genre advanced data transformation on massive amounts of data. If your company/organization is a Teradata Aster customer, you can obtain the JDBC driver that interfaces with KNIME by contacting your company’s/organization’s Teradata Aster account executive.

Table format is a KNIME proprietary format to store data efficiently, in terms of size and retrieval speed, and completely, i.e. also including their structure metadata. This leads to smaller local files, faster reading, and minimal configuration settings. In fact, the Table Reader node, which reads such Table files, only needs the file path and retrieves all other necessary information from the metadata saved in the file itself. Files saved in KNIME Table format carries an extension “.table”.

Teradata Aster on one side, KNIME Table formatted file on the other side. The question, as usual, is: Will they blend? Let’s find out.

Topic. Predicting heart disease. Is this chest pain innocuous or serious?

Challenge. Blend data from Teradata Aster system with data from a KNIME .table file. Build a predictive model to establish presence or absence of heart disease.

Access Mode. Database Connector node with Teradata JDBC driver to retrieve data from Teradata database. Table Reader node to read KNIME Table formatted files.

The Experiment

Accessing the Teradata Aster database

  1. First of all, we needed the appropriate JDBC driver to interface Teradata Aster with KNIME. If your company/organization is a Teradata Aster customer, the noarch-aster-jdbc-driver.jar file can be obtained by contacting your Teradata Aster account executive.
  2. Once we downloaded the noarch-aster-jdbc-driver.jar file, we imported it into the list of available JDBC drivers in KNIME Analytics Platform.
    1. Open KNIME Analytics Platform and select File -> Preferences -> KNIME -> Databases -> Add File.
    2. Navigate to the location where you saved the noarch-aster-jdbc-driver.jar file.
    3. Select the .jar file, then click Open -> OK.
  3. In a KNIME workflow, we configured a Database Connector node with the Teradata Aster database URL ( = jdbc:ncluster), the just added JDBC driver (com.asterdata.ncluster.jdbc.core.NClusterJDBCDriver) and the credentials to the same Teradata Aster database.
  4. The Database Connector node was then connected directly to a Database Reader node. Since we are quite expert SQL coders, we implemented the data pre-processing into the database Reader node in the form of SQL code. The SQL code selects the [schema].heartdx_prelim table, creates an ID variable called “rownum” (without quotes) and the new binary response variable, named dxpresent, (disease =1, no disease=0), and recodes missing values (represented by -9s and, for some variables, 0s) as true NULL values. The SQL code is shown below:
     
    DROP TABLE IF EXISTS [schema].knime_pract;
    
    CREATE TABLE [schema].knime_pract
    DISTRIBUTE BY HASH(age)
    COMPRESS LOW AS (
    	SELECT
    	 (ROW_NUMBER() OVER (ORDER BY age)) AS rownum,
    	 age,
    		 gender,
    	 chestpaintype,
    	 CASE
    		 WHEN restbps = 0 THEN null
    		 ELSE restbps
    		 END AS restbps,
    		 CASE
    		 WHEN chol = 0 THEN null
    		 ELSE chol
    		 END AS chol,
    		 fastbloodsug,
    		 restecg,
    		 maxheartrate,
    		 exindang,
    		 oldpeak,
    		 slope,
    		 numvessels,
    		 defect,
    		 dxlevel,
    		 CASE
    		 WHEN dxlevel IN ('1', '2', '3', '4') THEN '1'
    		 ELSE '0'
    		 END AS dxpresent
    	FROM [schema].heartdx_prelim
    );
    
    SELECT * FROM [schema].knime_pract;

If you are not an expert SQL coder, you can always use the KNIME in-database processing nodes available in the Node Repository in the Database/Manipulation sub-category.

Building the Predictive Model to recognize possible heart disease

  1. At this point, we have imported the data from the Teradata Aster database into our KNIME workflow. More pre-processing, however, was still needed:
    1. Filtering out empty or almost empty columns; that is, columns with too many missing values (please see this article for tips on dimensionality reduction: https://www.knime.org/blog/seven-techniques-for-data-dimensionality-reduction)
    2. On the remaining columns, performing missing value imputation in a Missing Value node by replacing missing numeric values (both long and double) with the median and missing string values with the most frequently occurring value
    3. Partitioning the dataset into training and test set using a Partitioning node (80% vs. 20%)
  2. After this pre-processing has been executed, we can build the predictive model. This time we chose a logistic regression model. A Logistic Regression Learner node and a Regression Predictor node were been introduced into the workflow.
    1. Column named dxpresent was used as the target variable in the Logistic Regression Learner node.
    2. Box “Append columns…” was checked in the Regression Predictor node. This last option is necessary to produce the prediction values which will later be fed into an ROC Curve node to compare the 2 models performances.

Reading data from the KNIME .table file

Here we just needed to write the Table file path into a Table Reader node. Et voilà we got the predictions previously produced by a Naïve Bayes model.

Note. The file path is indeed all you need! All other necessary information about the data structure is stored in the .table file itself.

 

Blending Predictions

  1. After training the logistic regression model, we used a Joiner node to connect its predictions to the older predictions from the Naïve Bayes model.
  2. We then connected an ROC Curve node, to display the false positives and true positives, through P(dxpresent=1), for both models.

The final workflow is shown in figure 1 and it is available for download on the EXAMPLES server under 01_Data_Access/02_Databases/09_Teradata_Aster_meets_KNIME_Table01_Data_Access/02_Databases/09_Teradata_Aster_meets_KNIME_Table*.

Figure 1. This workflow retrieves data from a Teradata Aster database, builds a predictive model (Logistic Regression) to recognize the presence of a heart disease, blends this model’s predictions with the predictions of an older model (Naïve Bayes) stored in a KNIME Table file, and then compares the 2 models performances through an ROC curve.

 


(click on the image to see it in full size)

The Results

The ROC curves resulting from the workflow are shown in figure 2. The red curve refers to the logistic regression, the blue curve to the old Naïve Bayes model. We can see that the Naïve Bayes model, though being older, is not obsolete. In fact, it shows an area under the curve (0.95) comparable to the area under the curve of the newer logistic regression (0.93).

Figure 2. ROC curves of the Logistic Regression (in red) and of the old Naive Bayes model (in blue)

 

To conclude, let’s spend two words on the logistic regression model interpretation. After opening the view of the Logistic Regression node, we get the table in figure 3. There, from the coefficient values you can see that gender and chest pain type #2 (atypical angina) are the main drivers for the prediction, although patients with chest pain type #2 are less likely to have heart disease than those with other types of chest pain.

Are men more affected than women by heart disease? Does this describe a general a priori probability or is it just the a priori probability of the data set? It would be interesting here to see how many men with chest pain = 2 have heart disease and how many do not. Same for women. We can investigate this with a GroupBy node following the Missing Value node. We configure the GroupBy node to group data rows by gender, chest pain type, and dxpresent; our desired aggregation is the count of the number of rows (on rownum column).

After execution, we find that 60 out of 193 women have chest pain type #2; of the women with this chest pain type, 4 have heart disease while 56 do not. In other words, our data set shows that women with chest pain type #2 have only a 4/60 = 6.7% chance of having heart disease. For the men, we find that 113 out of 725 men have chest pain type #2; of the men with this type of chest pain, 20 have heart disease while 93 do not. According to our data, men with chest pain type #2 have a 20/113 = 17.7% chance of having heart disease.

Figure 3. Coefficients of the Linear Regression model to predict presence/absence of heart disease

 

In this experiment we have successfully blended predictions from a logistic regression model trained on data stored in a Teradata database with older predictions from a Naïve Bayes model stored in a KNIME Table formatted file.

Again, the most important conclusion of this experiment is: Yes, they blend!

Coming Next …

If you enjoyed this, please share this generously and let us know your ideas for future blends.

We’re looking forward to the next challenge. What about blending SQL dialects? For example one of the many Hadoop Hive SQL dialects and Spark SQL. Will they blend?

 


The authors of the databases have requested that any publications resulting from the use of the data include the names of the principal investigator responsible for the collection of the data at each institution. They would be:

  1. Hungarian Institute of Cardiology. Budapest: Andras Janosi, M.D.
  2. University Hospital, Zurich, Switzerland: William Steinbrunn, M.D.
  3. University Hospital, Basel, Switzerland: Matthias Pfisterer, M.D.
  4. V.A. Medical Center, Long Beach and Cleveland Clinic Foundation: Robert Detrano, M.D., Ph.D.

* The link will open the workflow directly in KNIME Analytics Platform (requirements: Windows; KNIME Analytics Platform must be installed with the Installer version 3.2.0 or higher)

Data Chef ETL Battles. What can be prepared with today’s data? Ingredient Theme: Energy Consumption Time Series

$
0
0
Data Chef ETL Battles. What can be prepared with today’s data? Ingredient Theme: Energy Consumption Time SeriesrsMon, 08/14/2017 - 11:55

Do you remember the Iron Chef battles

It was a televised series of cook-offs in which famous chefs rolled up their sleeves to compete in making the perfect dish. Based on a set theme, this involved using all their experience, creativity, and imagination to transform sometimes questionable ingredients into the ultimate meal.

Hey, isn’t that just like data transformation? Or data blending, or data manipulation, or ETL, or whatever new name is trending now? In this new blog series requested by popular vote, we will ask two data chefs to use all their knowledge and creativity to compete in extracting a given data set's most useful “flavors” via reductions, aggregations, measures, KPIs, and coordinate transformations. Delicious!

Want to find out how to prepare the ingredients for a delicious data dish by aggregating financial transactions, filtering out uninformative features or extracting the essence of the customer journey? Follow us here and send us your own ideas for the “Data Chef Battles” at datachef@knime.com.

Ingredient Theme: Energy Consumption Time Series. Behavioral Measures over Time and Seasonality Index from Auto-Correlation.

Author: Rosaria Silipo
Data Chefs: Haruto and Momoka

Ingredient Theme: Energy Consumption Time Series

Let’s talk today about electricity and its consumption. One of the hardest problems in the energy industry is matching supply and demand. On the one hand, over-production of energy can be a waste of resources; on the other hand, under-production can leave people without the basic commodities of modern life. The prediction of the electrical energy demand at each point in time is therefore a very important chapter in data analytics.

For this reason, a couple of years ago energy companies started to monitor the electricity consumption of each household, store, or other entity, by means of smart meters. A pilot project was launched in 2009 by the Irish Commission for Energy Regulation (CER).

The Smart Metering Electricity Customer Behaviour Trials (CBTs) took place during 2009 and 2010 with over 5,000 Irish homes and businesses participating. The purpose of the trials was to assess the impact on consumers’ electricity consumption, in order to inform the cost-benefit analysis for a national rollout. Electric Ireland residential and business customers and Bord Gáis Energy business customers who participated in the trials, had an electricity smart meter installed in their homes or on their premises and agreed to take part in research to help establish how smart metering can help shape energy usage behaviors across a variety of demographics, lifestyles, and home sizes. The trials produced positive results.  The reports are available from CER (Commission for Energy Regulation) along with further information on the Smart Metering Project. In order to get a copy of the data set, fill out this request form and email it to ISSDA.

The data set is just a very long time series: one column covers the smart meter ID, one column the time, and one column the amount of electricity used in the previous 30 minutes. The time is expressed in number of minutes from 01.01.2009 : 00.00 and has to be transformed back to one of the classic date/time formats, like for example dd.MM.yyyy : HH.mm. The original sampling rate, at which the used energy is measured, is every 30 minutes.

The first data transformations, common to all data chefs, involve the date/time conversion and the extraction of year, month, day of month, day of week, hour, and minute from the raw date.

Topic. Energy Consumption Time Series

Challenge. From time series to behavioral measures and seasonality

Methods. Aggregations at multiple levels, Correlation

Data Manipulation Nodes. GroupBy, Pivoting, Linear Correlation, Lag Column

The Competition

What can we do, in general, with a time series? Often the final goal is prediction of future values, based on current and past values. Just how much past? Also, a time series can follow very different shapes. Has the shape any meaning? Can we summarize the time series evolution, by describing the electricity related habits of the household? Is there any seasonality that we can take into account? Is it possible to predict future values for groups of similar time series? In this case, how do we measure similarity across time series? Well, let’s start this challenge and let’s see what our data chefs have prepared for today’s data!

Data Chef Haruto: Behavior Measures over Time

Haruto has decided to remain in the time space and to analyze the electrical behavior of the energy consumers, as measured by their smart meters. In particular, he explored the energy consumption on weekends and business days, on each day of the week, on each hour of the day, and for different time frames during the day.

In order to do that, he calculated first the average energy consumption by week day, by hour, by day times, and by weekends vs. business days. Average values already show who is using the largest amount of energy. Then he transformed such average values into percentages, to understand when each entity uses how much energy.

In figure 1, the workflow upper branch - embedded in the “Usage Measures” labelled square – is from Data Chef Haruto.

Figure 1. Final Workflow 03_ETL_Energy_autocorr_stats. The upper part named "Usage Measures" describes the entity’s energy consumption behavior. The lower part labelled "Auto-correlation Matrix" calculates the auto-correlation matrix of the energy consumption time series for a selected meter ID. This workflow is available on the KNIME EXAMPLES Server under 02_ETL_Data_Manipulation/06_Date_and_Time_Manipulation/03_ETL_Energy_autocorr_stats02_ETL_Data_Manipulation/06_Date_and_Time_Manipulation/03_ETL_Energy_autocorr_stats*

The first 2 metanodes, named “Daily Values” and “Hourly Values”, respectively, calculate (Fig. 2):

  • The average daily/hourly energy usage by meter ID (GroupBy node)
  • The average energy usage by meter ID vs. day of week/hour of day (Pivoting node)
  • The average energy usage during weekends vs. business days / day time frames (Rule Engine + Pivoting node)

After that a series of Math Formula nodes in the metanodes named “Intra-day segments (%)” and “Week Day (%)” put the average values into context, by reporting them as the percentage of energy used during intra-day segments and during week days.

Figure 2. Content of the “Daily Values” metanode to calculate the daily energy consumption by meter ID in average, in average per day of week, in average over week ends and business days

Data Chef Momoka: Auto-Correlation Matrix

Momoka decided to look for seasonality patterns and for that to check each time series auto-correlation matrix.

In figure 1, the workflow lower branch - embedded in the “Auto-correlation Matrix” labelled square – is the result of Data Chef Momoka’s work.

First the data is shaped as a pivoting table with average energy consumption of meter ID vs. date and hour. The metanode “Pivoting”, indeed, produces the energy consumption time series for all meter IDs, sampled every hour, and sorted by time. The subsequent metanode, named “Select Meter ID”, allows to select one time series only through its meter ID value.

In order to calculate the auto-correlation matrix, we need:

  • normalized values for a meaningful comparison of the correlation indeces
  • past values to calculate the correlation of the current sample with its past N samples

In metanode “Normalize & Lag”, time series values are then being normalized into [0,1] and N past samples are introduced. Normalization is achieved with a Normalizer node, while the N past samples are produced by a Lag Column node. The Lag Column node makes N copies of the selected column and shifts its values of 1, 2, …, N steps forward. If the column values were sorted by time, this means that we would attach the N past samples of the time series to the current one.

The auto-correlation matrix of the current samples with their past N samples is then calculated using a Linear Correlation node. The correlation matrix will show a few highly correlated columns, like for example x(t) and x(t-2). In particular, if the auto-correlation function shows local maxima at recurrent steps in the past, like at x(t) and x(t- i*24) with i= 1,2, .., this might be a sign of a seasonality pattern.

The “Find Seasonality” metanode searches for such local maxima in the correlation functions. It detects the smallest seasonality period as the position of the first local maximum of the correlation function’s first derivative (Fig. 3).

Figure 3. Content of “Find Seasonality” metanode, which finds the local maxima in the auto-correlation function through its first derivative values

The final workflow with the contributions of both data chefs can be admired in Figure 1 and can be found on the EXAMPLES server in: 02_ETL_Data_Manipulation/06_Date_and_Time_Manipulation/03_ETL_Energy_autocorr_stats02_ETL_Data_Manipulation/06_Date_and_Time_Manipulation/03_ETL_Energy_autocorr_stats*

The Jury

Let’s now see the final results on a specific smart meter. The jury randomly chose meter ID 1038.

According to the behavioral features provided by data chef Haruto, the entity connected to meter ID 1038 uses 232 kW/day on average, more or less the same amount every day of the week, with not much difference between weekends and business days. Moving to the hour scale, meter ID 1038 uses ~10 kW/hour on average, most of it during the day and roughly equally distributed over morning and afternoon.

Indeed, the line plot provided by data chef Momoka for the energy usage time series of meter ID 1038 (Fig. 4) shows a cyclical trend day vs. night, where the kWs used during the day are definitely dominant. The plot also shows no difference of electricity usage across week days.

Figure 4. Line Plot of energy consumption time-series for meter ID 1038. Notice the day/night rhythm.

This cyclic trend justifies the auto-correlation based findings of data chef Momoka.  The signal autocorrelation map (Fig. 5) shows a 24-hour cycle, with local maxima in the auto-correlation function at x(t) and x(t-24), x(t) and x(t-48), x(t) and x(t-72) and so on.  The smallest seasonality period was calculated to be 24 hours.

Note. The stronger the cyclic behavior of the auto-correlation matrix, the more meaningful the seasonality pattern. In figure 5, the time-series seasonality is clearly visible through the cyclic trend of its auto-correlation matrix.

Figure 5. Auto-correlation matrix of energy consumption time-series for meter ID 1038. You can see the cyclic trend of the auto-correlation matrix and the auto-correlation local maxima at -24, -48 and so on.

Again, the final workflow with the contributions of both data chefs can be found on the EXAMPLES server in: 02_ETL_Data_Manipulation/06_Date_and_Time_Manipulation/03_ETL_Energy_autocorr_stats02_ETL_Data_Manipulation/06_Date_and_Time_Manipulation/03_ETL_Energy_autocorr_stats*

Note. The example workflow on the EXAMPLES server works only on a subset of the original dataset. This is because the original dataset must be obtained by filling the request form and emailing it to ISSDA. Therefore the auto-correlation map and in general all other numbers shown in this post will be different when produced by the example workflow on the reduced dataset!

We have reached the end of this competition. Congratulations to both our data chefs for wrangling such interesting features from the raw data ingredients! Oishii!

Coming next …

If you enjoyed this, please share it generously and let us know your ideas for future data preparations.

We’re looking forward to the next data chef battle. The theme ingredient there will be ClickStream data

 


* The link will open the workflow directly in KNIME Analytics Platform (requirements: Windows; KNIME Analytics Platform must be installed with the Installer version 3.2.0 or higher)

Distributed executors in the next major version of KNIME Server

$
0
0
Distributed executors in the next major version of KNIME ServerthorMon, 08/07/2017 - 11:25

If you are a KNIME Server customer you probably noticed that the changelog file for the KNIME Server 4.5 release was rather short compared to the one in previous releases. This means by no means that we were lazy! Together with introducing new features and improving existing features, we also started working on the next generation of KNIME Servers. You can see a preview of what is there to come in the so-called distributed executors. In this article I will explain what a distributed executor is and how it can be useful to you. I will also provide some technical details for the geeks among you and finally I will give you a rough timeline for the distributed executors' final release.

State of the KNIME Server

Currently the installation of a KNIME Server is straight-forward because all components reside on the same physical machine. These components are: the workflow repository; the executor, which runs workflow jobs; and the server itself, which provides the interface to the outside world (via WebPortal, REST, or EJBs for access from the KNIME Analytics Platform). The figure 1 shows this setup.

Figure 1. Architecture diagramm of the current KNIME Server

 

Having all components on the same machine has the main advantage that communication between them is fast and reliable. The downside though is - obviously - scalability. Depending on the workflows that you have created and on the amounts of data you are processing, the single executor on a single physical machine can become a bottleneck. Of course you could use a bigger machine, but there are clear limits to this approach. Therefore we decided to introduce the distributed executors!

Scale me!

The general concept of distributed executors is easy to understand: you run multiple executors on independent hardware. If the existing executors are all busy executing jobs and cannot accept new jobs you simply add more executors that handle the waiting jobs. The figure 2 shows this general idea.

Figure 2. Architecture diagramm of KNIME Server 5

 

Technically this setup becomes much more challenging. First of all, you cannot use the file system any more to exchange information (mainly workflows) between the server and the executors. This means all communication must be performed via a network, ideally using some standard protocol, such as HTTP. The existing REST interface already available in the KNIME Server is then a natural candidate. Indeed, in the designed architecture, the distributed executors rely heavily on it.

However, using only HTTP for communcation would require the server to know exactly which executors are available and where to reach them at each moment. The server would also need to queue requests in case no free executor is available. Since this is a common problem to many applications, some solutions are already available, one of which is the so-called "message queue". The concept of a message queue is quite simple: a sender puts a message in the queue and the system ensures that it is distributed to the right recipients (think of it as a post office).

Roger Rabbit

One of such message queueing systems is RabbitMQ. Although it is written in the rather exotic programming language Erlang, it's straight-forward to install and manage (and also fast for that matter). It can run on the same system as the KNIME Server or on any other system that is reachable by both the server and the executors. Then you have to tell both parties where to find the queue, and that's it. Setup of queues itself is done by the application. There is no configuration required in RabbitMQ.

Let's have a closer look at what happens if a user wants to execute a workflow on the server using the distributed executors.

  • First the server receives the request (from either of the available interfaces) and then creates a message saying that workflow X should be loaded for user Y. This message is sent to RabbitMQ.
  • This message is now forwarded to one of the executors. This executor then can decide whether to accept the message or not. Reasons for rejection could be too much load on the executor or missing capabilities for the job. If the message is rejected, it is offered to the next executor in a round-robin fashion.
  • If an executor has accepted the message, it loads the workflow, acknowledges the load message, and finally sends back a status message to the server via RabbitMQ.
  • This last message is processed by a dedicated queue for this job only, so that messages belonging to a particular job are not distributed to all executors. If the executor would have died between accepting the message and acknowledging it, the message would be put back in the queue and distributed to another executor.

If an executor requires data from the server, e.g. the workflow itself or data files required by a workflow job, then the REST interface is used. The message to load a workflow also contains the server's address; therefore the executor knows from where to get the data. The main reason for this split between message queue and REST is that message queueing systems are optimized for small messages, while workflows or data files can in principle be quite large. The figure 3 shows a sketch for the final setup.

Figure 3. Sketch of the communication between server and executors

 

Try it!

Since the changes required to decouple the executors from the server are rather large, we are not finished yet with implementing this new generation of KNIME Servers. However, a large part of the distributed executor functionality is already available. Since quite some customers are interested, we have prepared a preview of the KNIME Server 5 that you can already try now!

If you are an existing customer, go to the commercial products download page for the 2017-07 release. Under the downloads section for KNIME Server 4.5, you will find the links to the downloads and documentation for the KNIME Server 5 preview. The documentation also contains a list with available and still missing functionalities. If you are not a KNIME Server user yet but are interested in trying out the distributed executors, just contact us.

Our current plan is to release another preview for the KNIME Server 5 with its distributed executors around October and have the final version ready for the traditional December release. Next summer KNIME Server 5 will replace the KNIME Server 4.x release line.


Seven Modes of Deployment

$
0
0
Seven Modes of DeploymentgnuMon, 08/21/2017 - 09:00

Here's a familiar predicament: you have the data you want to analyze, and you have a trained model to analyze them. Now what? How do you deploy your model to analyze your data?

In this video we will look at seven ways of deploying a model with KNIME Analytics Platform and KNIME Server. This list has been prepared with an eye toward where the output of the deployment workflow goes:

  • to a file or database
  • to JSON via REST API
  • to a dashboard via KNIME's WebPortal
  • to a report and to email
  • to SQL execution via SQL recoding
  • to Java byte code execution via Java recoding
  • to an external application

Once you know these options, you will also know which one best satisfies your needs.

The workflows used in the video can be found on the KNIME EXAMPLES server under 50_Applications/27_Deployment_Options50_Applications/27_Deployment_Options*.

 


* The link will open the workflow directly in KNIME Analytics Platform (requirements: Windows; KNIME Analytics Platform must be installed with the Installer version 3.2.0 or higher)

Scaling Analytics with KNIME Big Data Extension

$
0
0
Scaling Analytics with KNIME Big Data ExtensionVincenzoFri, 08/25/2017 - 12:06

We built a workflow to train a model. It works fast enough on our local, maybe not so powerful, machine. So far.

The data set is growing. Each month a considerable number of new records is added. Each month the training workflow becomes slower. Shall we start to think of scalability? Shall we consider big data platforms? Could my neat and elegant KNIME workflow be replicated on a big data platform? Indeed it can.

The KNIME Big Data Extension offers nodes to build and configure workflows to run on the big data platform of choice. The cool feature of the KNIME Big Data Extension consists in the nodes GUI. The configuration window for each Big Data node has been built as similar as possible to the configuration window of the corresponding KNIME node. The configuration window of a Spark Joiner node will look exactly the same as the configuration window of a Joiner node.

Thus, it is not only possible to replicate your original workflow on a Big Data Platform, it is also extremely easy, since you do not need to learn new scripts or tools instructions. The KNIME Big Data Extension brings the ease of use of KNIME into the scalability of Big Data.

This video shows how we replicated an existing classical analytics workflow on a Big Data Platform.

The workflows used in the video can be found on the KNIME EXAMPLES server under 50_Applications/28_Predicting_Departure_Delays/02_Scaling_Analytics_w_BigData50_Applications/28_Predicting_Departure_Delays/02_Scaling_Analytics_w_BigData.knwf


* The link will open the workflow directly in KNIME Analytics Platform (requirements: Windows; KNIME Analytics Platform must be installed with the Installer version 3.2.0 or higher)

A Touch of Azure in Your KNIME Workflow

$
0
0
A Touch of Azure in Your KNIME WorkflowrsMon, 09/04/2017 - 12:02

The latest release of KNIME Analytics Platform 3.4 has produced many new features, nodes, integrations, and example workflows. This is all to give you a better all-round experience in data science, enterprise operations, usability, learning, and scalability.

Now, when we talk about scalability, the cloud often comes to mind. When we talk about the cloud, Microsoft Azure often comes to mind. That is the reason why KNIME has been integrating some of the Azure products and services.

The novelty of this latest release consists of the example material. If you currently access (or want to access in the future) some Microsoft products, on the cloud, in your KNIME workflow, you can start by having a look at the 11_Partners/01_Microsoft folder in the EXAMPLES server and at the following link on the KNIME Node Guide https://www.knime.com/nodeguide/partners/microsoft.

 A little note for the neophytes among us. The KNIME EXAMPLES server is a public KNIME server hosting a constantly growing number of example workflows (see YouTube video “KNIME EXAMPLES Server”). If you are new to a topic, let’s say “churn prediction”, and you are looking for a quick starting point, then you could access the EXAMPLES server in the top left corner inside the KNIME workbench, download the example workflow in 50_Applications/18_Churn_Prediction (50_Applications/18_Churn_Prediction/01_Training_a_Churn_Predictor50_Applications/18_Churn_Prediction/01_Training_a_Churn_Predictor*), and update it to your data and specific business problem. It is very easy and one of the most loved features in the KNIME Analytics Platform.

A special section (11_Partners/01_Microsoft) on the EXAMPLES server is dedicated to access and use Microsoft products. Let’s see now what we find in there!

Figure 1. Five tutorial workflows on Microsoft products available for download on the KNIME EXAMPLES Server and released in July 2017 with KNIME Analytics Platform 3.4. These tutorial workflows show how to access MS SQL Server, HDInsight Hive, and BlobStorage and how to run SQL queries for In-Database processing, train models using the HDInsight Spark ML library, and generically execute Microsoft R code.

 
  1. MS SQL Server. The first example workflow in the folder (11_Partners/01_Microsoft/01_SQL_Server_InDB_Processing(Azure)11_Partners/01_Microsoft/01_SQL_Server_InDB_Processing(Azure)*) shows how to run a few in-database operations on a MS SQL Server. Connection to the database happens through a dedicated connector node, named SQL Server Connector. In-database processing operations are implemented via dedicated SQL nodes, using a GUI in their configuration window, or via a SQL Query node for free SQL code.

Note. The SQL Server Connector node would connect equally well to Microsoft DW as it does to MS SQL Server, provided that the right JDBC driver is installed. So this example workflow actually works for SQL Server and DW at the same time.

 

  1. BlobStorage. The second example workflow (11_Partners/01_Microsoft/02_SQLServer_BlobStorage_andKNIMEModels11_Partners/01_Microsoft/02_SQLServer_BlobStorage_andKNIMEModels*) offers a data blending example. It takes data from MS SQL Server, like in the first example workflow above, and data from BlobStorage. The Azure BlobStore Connection node connects to a BlobStorage installation on the Azure cloud, the Azure BlobStore File Picker node allows for exploration of the BlobStorage repository and for the selection of one file. The full file path is then passed via flow variable to a more classic File Reader node.

 

Figure 2. Tutorial workflow 02.SQLServer_BlobStorage_and KNIMEModels shows how to connect to Azure BlobStorage data repository

  1. HDInsight Hive. Example workflow 11_Partners/01_Microsoft/03_HDI_Hive_KNIME11_Partners/01_Microsoft/03_HDI_Hive_KNIME* accesses Hive in an Azure HDInsight installation. Similarly to the SQL Server example, here a Hive Connector, a dedicated connector, connects to the Hive database and a number of In-database processing nodes prepares the data to train a Decision Tree Learner node, which is a KNIME native node.
  1. HDInsight Spark.11_Partners/01_Microsoft/04_HDI_Hive_Spark11_Partners/01_Microsoft/04_HDI_Hive_Spark* conceptually implements exactly the same workflow as in 03.HDI_Hive_KNIME, but here the mix and match effect is obtained by inserting Spark based model training nodes rather than KNIME native nodes. Data transfer from Hive to Spark and RDD creation happen through the Hive To Spark nodes. Transfer data operations and RDD creation can also happen through a variety of other nodes available, like Database to Spark, JSON To Spark, and Parquet To Spark. In this example, we also train a Spark Decision Tree. Many more data mining algorithms are available from the Spark Machine Learning library. Just type “Spark” in the search box above the Node Repository to fully explore the KNIME Spark integration.

Figure 3. Tutorial workflow 04.HDI_Hive_Spark shows the Hive Connector node, a dedicated connector to Hive access, and a number of Spark based nodes to run Machine Learning algorithms on a HDINsight Spark platform.

  1. Microsoft R. We have reached the last example workflow (11_Partners/01_Microsoft/05_Predict_DepartureDelays_with_MicrosoftR11_Partners/01_Microsoft/05_Predict_DepartureDelays_with_MicrosoftR*) of the batch produced with the latest release of KNIME Analytics Platform 3.4. The KNIME R integration fully supports Microsoft R. It provides a full set of nodes to write R scripts, train and/or apply R models, and produce R graphs. It is sufficient to set the pointer of the R executable to a Microsoft R installation in the KNIME Preferences page. The following video explains step by step how to configure and use the Microsoft R integration within KNIME Analytics Platform.

 

Video 1. YouTube tutorial video on the KNIME Microsoft R integration.

 

The 05.Predict_DepartureDelays_with_MicrosoftR workflow trains a model to predict whether a flight will be delayed at departure when leaving from ORD airport. It shows this same operation 3 times.

  • “Pure Microsoft R” uses only R code written in the R Snippet node.
  • “Hybrid Microsoft R & KNIME” mixes R code based nodes with KNIME GUI based nodes
  • “Pure KNIME Analytics Platform” runs the whole procedure again using only KNIME nodes

Indeed, not all data scientists are created equally R experts. For the R coding wizards, the standalone implementation using the R Snippet node as R editor is probably what they prefer. However, for the other less R experts, the other two implementations might be more suitable. Indeed, breaking up R code in smaller easier pieces and mixing and matching with KNIME nodes, might make the whole R coding experience more pleasant for most of us.

Figure 4. Tutorial workflow 05_Predict_DepartureDelays_with_MicrosoftR trains a model on the airline dataset in 3 ways. Purely using an R script; mixing and matching R code segments and KNIME nodes; purely using KNIME nodes.

With the ease of use and the extensive coverage of data science algorithms of KNIME Analytics Platform, with the collaboration and production features of the KNIME Server, with the versatility of the Azure platform, with the completeness of the Microsoft data repository offer, the work of a data scientist can become quicker, more efficient, and ultimately more productive.

A touch of Azure in your workflow might just be a welcome addition!

More information is available at the following links:


* The link will open the workflow directly in KNIME Analytics Platform (requirements: Windows; KNIME Analytics Platform must be installed with the Installer version 3.2.0 or higher)

Learning Deep Learning. A tutorial on KNIME Deeplearning4J Integration

$
0
0
Learning Deep Learning. A tutorial on KNIME Deeplearning4J IntegrationjonfullerMon, 09/11/2017 - 09:10

Introduction

The aim of this blog post is to highlight some of the key features of the KNIME Deeplearning4J (DL4J) integration, and help newcomers to either Deep Learning or KNIME to be able to take their first steps with Deep Learning in KNIME Analytics Platform.

Useful Links

If you’re new to KNIME, here is a link to get familiar with the KNIME Analytics Platform:
https://www.knime.com/knime-online-self-training

If you’re new to Deep Learning, there are plenty of resources on the web, but these two worked well for me:
https://deeplearning4j.org/neuralnet-overview
http://playground.tensorflow.org/

If you are new to the KNIME nodes for deep learning, you can read more in the relevant section of the Node Guide:
https://www.knime.com/nodeguide/analytics/deep-learning

With a little bit of patience, you can run the example provided in this blog post on your laptop, since it uses a small dataset and only a few neural net layers. However, Deep Learning is a poster child for using GPUs to accelerate expensive computations. Fortunately DL4J includes GPU acceleration, which can be enabled within the KNIME Analytics Platform.

If you don’t happen to have a good GPU available, a particularly easy way to get access to one is to use a GPU-enabled KNIME Cloud Analytics Platform, which is the cloud version of KNIME Analytics Platform.

In the addendum at the end of this post we explain how to enable KNIME Analytics Platform to run deep learning on GPUs either on your machine or on the cloud for better performance.

Getting started

We will use the MNIST dataset. The MNIST dataset consists of handwritten digits , from 0 to 9, as 28x28 pixel grayscale images. There is a training set of 60,000 and a test set of 10,000 images. The data are available from:
http://yann.lecun.com/exdb/mnist/

Our workflow downloads the datasets, un-compresses them, and converts them to two CSV files: one for the training set, one for the test set. We then read in the CSV files, convert the image content to KNIME image cells, and then use the KNIME DL4J nodes to build a variety of classifiers to predict which number is present in each image.

We aim at an accuracy of >95%, according to the error rates listed in the original article by Le Cun et al..

The workflows built throughout this blog post are available on the KNIME EXAMPLES Server under 04_Analytics/14_Deep_Learning/14_MNIST-DL4J-Intro04_Analytics/14_Deep_Learning/14_MNIST-DL4J-Intro*. This workflow group consists of:

  • The workflow named DL4J-MNIST-LeNet-Digit-Classifier
  • The “data” workflow group to contain the downloaded files
  • The “metanodes” workflow group which contains the three metanode templates used in the example.

The workflow named DL4J-MNIST-LeNet-Digit-Classifier (Fig. 1) actually consists of 2 workflows: the top one uses a simpler net architecture, while the lower one uses a 5 layer net architecture.

Figure 1. Workflows created during this blog post and available on the KNIME EXAMPLES server under 04_Analytics/14_Deep_Learning/14_MNIST-DL4J-Intro. This workflow actually consists of 2 workflows: the top one using a simpler deep learning architecture; the lower one using a 5 layer deep learning network.

We ran the whole experiment using a KNIME Cloud Analytics Platform running on an Azure NC6 instance. Of course, you can equally run it on an Amazon Cloud p2.xlarge instance or your local machine.

Beware. This workflow could take around an hour to run, depending on whether you have a fast GPU, and how powerful your machine is!

Required Installations:

Tools:

  1. KNIME Analytics Platform 3.3.1 (or greater) on your machine
    OR
    KNIME Cloud Analytics Platform on Azure Cloud
    OR
    KNIME Cloud Analytics Platform on AWS Cloud
  2. Python 2.7.x configured for use with KNIME Analytics Platform:
    https://www.knime.org/blog/how-to-setup-the-python-extension

Extensions:

  • KNIME Deeplearning4J extension from KNIME Labs Extensions/KNIME Deeplearning4J Integration (64 bit only).
  • KNIME Image Processing extension from the KNIME Community Contributions - Image Processing and Analysis
  • KNIME Image Processing - Deep Learning 4J Integration
  • Vernalis KNIME Nodes from KNIME Community Contributions - Cheminformatics
  • KNIME File Handling Nodes and KNIME Python Integration from KNIME & Extensions

If you are running KNIME Analytics Platform on your machine:

Optionally, if you have GPUs:

Importing the image data

Often when working with images, it is possible to read them directly in KNIME Analytics Platform from standard formats like PNG, JPG, TIFF. Unfortunately for us, the MNIST dataset is only available in a non-standard binary format. Luckily, it is straightforward to download the dataset and convert the files to a CSV format that can be easily read into KNIME.

The data import is handled by the ‘Download dataset and convert to CSV’ metanode. Here the data files are downloaded from the LeCunn website, and written to the folder named “data” and contained in the workflow group 14_MNIST-DL4J-Intro that you have downloaded from the EXAMPLES server.

In order to extract the pixel values for each image, we use a Python Source node to read the binary files and output to two CSV files (mnist_test.csv, mnist_train.csv). We have implemented an IF statement that only downloads and converts the files if the mnist_test.csv and mnist_train.csv files do not already exist; there’s no sense doing that download twice!

Figure 2. Content of the metanode named "Download dataset and convert to CSV". This metanode downloads the data files and writes them to a local "data" folder. Then through a Python Source node converts the binary pixel content into a CSV file.

Now we have numerical columns representing images. In the wrapped metanode named “Normalize images (train)” (Fig. 3), a File Reader reads the numerical columns and normalizes them with a Normalizer node.

The conversion back into binary images is obtained via the Data Row to Image node from the KNIME Image Processing extension. The Deeplearning4J (DL4J) integration in KNIME can handle numbers, strings, collections, or images (when the KNIME Image Processing - Deep Learning 4J extension is installed from the stable community update site) as input features.

It is important to randomize the order of the input rows, in order not to bias the model training with the input sequence structure. For that, we used the ‘Shuffle’ node.

Figure 3. Content of the "Normalize images (train)" wrapped metanode. Notice the execution in streaming mode and the transformation output port for the Normalization model.

The normalization model produced by the Normalizer node is exported from the Wrapped Metanode. We do this so that we can re-apply the same normalization to the test dataset in the wrapped metanode named “Normalize images (test)”.

First Try. A simple network

In addition to the typical KNIME Learner/Predictor schema, the DL4J Learner node requires a network architecture as input for the learning process (Fig. 4 vs. Fig. 5).

Figure 4. Classic Learner/Predictor schema in KNIME Analytics Platform. First the Learner, then the Predictor. That is all you need.

Figure 5. Deep Learning Learner/Predictor schema. The Learner node also requires a neural network architecture as input.

There are two ways to define a network architecture:

  1. Select one from some well-known pre-built network architectures available under KNIME Labs/Deep Learning/Networks in the Node Repository
  2. Build your own neural architecture from scratch.

Figure 6. Deep Learning/Networks sub-category contains a number of pre-built commonly used deep learning architectures.

Since we are experimenting, we will build our own network and we’ll start with a toy network.

We start with the DL4J Model Initializer node. We don’t need to set any options for this node. Next we introduce the Dense Layer. This time we need to set some options, but let’s stick with the default options for now, which creates the output layer with only one output unit to represent the numbers from 0 to 9, activation function ReLu, random weight initialization according to the XAVIER strategy, and a low learning rate value as 0.1. We have created the simplest possible (and not very deep!) neural network.

Now we can link our training set and the simple neural network architecture to the ‘DL4J Feedforward Learner (Classification)’ node. This learner node needs configuration.

The configuration window of the DL4J Feedforward Learner (Classification) node is somewhat complex, since it requires settings in 5 configuration tabs: Learning Parameter, Global Parameter, Data Parameter, Output Layer Parameter, and Column Selection. In general, there are many options available to set, the Deep Learning 4J website has some nice hints to help people get started.

The first 2 tabs, “Learning Parameter” and “Global Parameter” define the learning parameters used to train our network. Here since we are just getting started at this stage I accept the default options. The defaults are to use Stochastic Gradient Descent for optimizing the network. Nesterovs as the updater, with momentum of 0.9. We don’t set any global parameters, which would override those parameters set in the individual network layer node configuration dialogs. We’ll work on tuning the learning parameters in the LeNet workflow that we’ll get to shortly.

The third tab, named “Data Parameter”, defines how data is used to train the model. Here I set Batch Size to 128, Epochs to 15, and Image Size to 28,28,1. Batch size defines the number of images that are passed through the network and used to calculate the error, before updating the network. Larger batch sizes mean longer to wait between each update, but also give the possibility of learning more information with each iteration. Epochs describe the number of full passes of the dataset that are made, choices here can help to guard against under/over-fitting of the data. The image size is the size of the image in pixels (x,y,z).

Figure 7. The "Data Parameter" tab in the configuration window of the DL4J Feedforward Learner (Classification) node. Batch Size defines the number of images used for each network parameter update and is set to 128 as a trade off between accuracy and speed. Epochs are set to 3 since in this case we want the example to run for only a short time. Image Size defines the size of image on the x, y, z axis in number of pixels.

The ‘Column Selection’ tab contains all information about input columns and target column. The target column is set to column ‘Target’, which contains the number represented in the image. The image column named “AggregatedValues” is used as the input feature.

Figure 8. "Column Selection" tab in the configuration window of the DL4J Feedforward Learner (Classification) node. This tab sets the target column and the input columns.

Finally connecting up the corresponding Predictor and the Scorer nodes we can test the model quality (see upper workflow in figure 1).

To train our first not-so deep learning model, we need to execute the DL4J Feedforward Learner (Classification). The execution of this node can take some time (probably more than 10 minutes). However, it is possible to monitor the learning progress, and even to terminate it early, if a suitable model has already been reached. Right-clicking the DL4J Feedforward Learner (Classification) node and selecting ‘View: Learning Status’ from the context menu displays a window including the current training epoch and the corresponding Loss (=Error) calculated on the whole training set (Fig. 9). If the loss is sufficient for our purpose or if we have become impatient, we can hit the “Stop Learning” button to stop the training process.

Once the calculation is complete you can execute the Scorer node to evaluate the model accuracy (Fig. 10).

Figure 9. Learning Status window for the DL4J Feedforward Learner (Classification) node. This window is open by right-clicking the node and selecting the option " View: Learning Status". Here you can monitor the learning progress of your deep learning architecture. You can also stop it at any moment by hitting the “Stop Learning” button.

Figure 10. Confusion Matrix and Accuracy of the single-layer neural network trained on the number data set to recognize numbers in images. Notice the disappointing ~35% Accuracy. A single-layer network is not enough?

Did you notice the accuracy just a little above 35%? That was a little disappointing! But not entirely unexpected. We didn’t spend any time optimizing the input parameters since we’re not aiming to evaluate what the optimal network architecture is, rather to see how easy it is to reproduce one of the more well known complex architectures. It is well known that deep learning networks often require several layers and careful optimization of input parameters. So in order to go a bit deeper, in the next section we’re going to take the LeNet network that has been pre-packaged in the Node Repository, and use that.

Something closer to what is described by LeCunn et al.

We can quickly import a well-known architecture that has been shown to work well for this problem by dragging and dropping the ‘LeNet’ metanode from the Node Repository into the workspace. Double-clicking the “LeNet” metanode lets us take a look at the network topology. We see that there are now five layers defining the network (Fig. 11).

The process of building the network architecture is triggered again by a “DL4J Model Initializer” node, requiring no settings. We then add a Convolution Layer (which applies a convolution between some filter with defined size to each pixel in the image), a Pooling Layer (pooling layers reduce the spatial size of the network - in this case halving the resolution at each application), then again a Convolution Layer, a Pooling Layer, and a Dense Layer (neurons in a dense layer have full connections to all outputs of the previous layer). The result is a 5-layer neural network with mixed types of layers.

Figure 11. LeNet neural network architecture as built in the "LeNet" metanode.

Finally we make a few more changes in order to closely match the parameters originally described in the article by LeCun et al.. That means setting the learning rate to 0.001 in the DL4J Feedforward Learner (Classification) node. The Output Layer parameters are 0.1 learning rate, XAVIER weight initialisation and Negative Log Likelihood loss function.

Evaluating the results we can clearly see that adding the layers and tweaking the parameters has made a huge difference in the results. We can now predict the digits with 98.71% accuracy!

Figure 12. Confusion Matrix and Accuracy of a neural network shaped according to the LeNet architecture, that is introducing 5 hidden mixed type layers in the network architecture. The network is trained again on the number data set to recognize numbers in images. Now we get almost 99% Accuracy. This is much closer to the performances obtained by LeCun et al..

Summary

Deep Learning is a very hot topic in machine learning at the moment, and there are many, many possible use cases. However, you’ll need to spend some time to find the right network topology for your use case and the right parameters for your model. Luckily the KNIME Analytics Platform interface for DL4J makes setting those models up straightforward.

What’s more, the integration with KNIME Image Processing allows you to apply Deep Learning to image analysis, and using the power of GPUs in the cloud, it might not take as long as you think to get started.

To enable KNIME Analytics Platform to run deep learning using GPUs, follow the instructions reported in the final part of the addendum.

Perhaps, more importantly than that, it is also easy to deploy those models using the WebPortal functionality of the KNIME Server, but that discussion is for another blog post…

Tweet me: @JonathanCFuller

References

  1. http://yann.lecun.com/exdb/mnist/
  2. http://yann.lecun.com/exdb/publis/pdf/lecun-98.pdf

Workflows:


Addendum

Configuring KNIME Analytics Platform to run Deeplearning4J on image data, optionally with GPU support and on the Cloud

With KNIME Cloud Analytics Platform pre-configured and available via the AWS and Azure Marketplaces it is straightforward to run DL4J workflows on a GPU.

Note that to see a speedup in your analysis you’ll need to have a modern GPU designed for Deep Learning, which is exactly what the Nvidia K80 GPUs available on Azure NC6 instances and AWS P1 instances are.

If you have a modern GPU on your local machine, you can also run the GPU enabled workflows using a KNIME Analytics Platform installation on your local machine.

In this addendum we offer a step by step guide on what to install and what to enable to run deep learning on a KNIME Analytics Platform, optionally using GPU acceleration and a cloud installation.

Step 1: Install KNIME Analytics Platform

  1. On a Cloud Instance
    Launching KNIME Cloud Analytics Platform differs slightly in each of the marketplaces, but the following documents show how to launch an instance.
    1. Azure NC6 instances
      Note that at the time of writing NC6 instances are only available in the following regions: East US, South Central US, West US 2. See here for latest availability: https://azure.microsoft.com/en-us/regions/services.
      Note also that the KNIME DL4J integration currently only supports single GPU machines, so choose NC6 instances rather than the instances with multiple GPUs.
    2. AWS P2.xlarge instances.
  2. On your Machine.
    Follow these instructions for:

Step 2: Install required Extensions

When you load a workflow, KNIME Analytics Platform checks that all required extensions are installed. If any are missing, you will be prompted to install them (Fig. 13). Choosing Yes, will start the installation process for the missing extensions. We recommend to install all missing extensions for this workflow (Fig. 14).

Note. Missing nodes in a workflow are visualized via a red and white grid icon. If your workflow still has missing nodes and you are prompted to save it, you should choose ‘Don’t save’.

Figure 13. Prompt to install missing KNIME extensions.

Figure 14. Recommended KNIME Extensions to install to run the workflow described in this blog post

If you are not prompted or if you have saved the workflow with missing nodes, you can still install the missing extensions manually. Following the instructions in this YouTube video https://youtu.be/8HMx3mjJXiw, install:

  • KNIME Deeplearning4J extension from KNIME Labs Extensions/KNIME Deeplearning4J Integration (64 bit only).
  • KNIME Image Processing extension from the whole KNIME Community Contributions - Image Processing and Analysis
  • Vernalis KNIME Nodes from KNIME Community Contributions - Cheminformatics
  • KNIME File Handling Nodes and KNIME Python Integration from KNIME & Extensions

Note that if your extension is already installed you will probably not see it in the list of installable packages.

After installation, you should have the following categories in the Node Repository: Deep Learning under KNIME Labs, KNIME Image Processing and Vernalis under Community Nodes, Python under Scripting, File Handling under IO.

In order to use the KNIME Python integration, you need Python 2.7.x configured for use with KNIME Analytics Platform. Follow the instructions described in this blog post: https://www.knime.org/blog/how-to-setup-the-python-extension

Step 3: Install KNIME Image Processing support for DL4J Integration (not required for KNIME Cloud Analytics Platform)

If you rely on existing installations of KNIME extensions, it might happen that you are still missing the specific KNIME Image Processing support for DL4J Integration, since this is a quite new component.

To manually install the KNIME Image Processing support for DL4J Integration extension, follow these steps.

  1. File > Preferences, and enable the ‘Stable Community Contributions Update Site’

Figure 15. Enable "Stable Community Contributions Update" site. it is not enabled by default.

  1. File > Install KNIME Extensions…

Figure 16. "Install KNIME Extensions ..." option from File Menu

  1. Install KNIME Image Processing – Deeplearning4J Integration

Figure 17. Install KNIME Image Processing - Deeplearning4J Integration

Step 4: Enable GPU acceleration (Optional):

  1. Configure CUDA support
    Once you’ve logged into your newly launched KNIME Cloud Analytics Platform instance, you’ll need to install the CUDA libraries required for the DL4J GPU support. That is described here:
    http://docs.nvidia.com/cuda/cuda-installation-guide-microsoft-windows/index.html#axzz4ZcwJvqYi
    Use this download link:
    https://developer.nvidia.com/compute/cuda/8.0/Prod2/local_installers/cuda_8.0.61_win10-exe
  2. Enable KNIME Analytics Platform to use GPUs
    Finally you’ll need to enable the Deeplearning4J Integration GPU support. That can be found under File > Preferences, and then searching for Deeplearning4J Integration. You’ll need to check the box ‘Use GPU for calculations?’. Don’t restart your Analytics Platform just yet, there is one more step required.

Figure 18. Enable usage of GPU for deep learning in the Preferences page

Step 5: Restart KNIME Analytics Platform and execute the workflow!

 


* The link will open the workflow directly in KNIME Analytics Platform (requirements: Windows; KNIME Analytics Platform must be installed with the Installer version 3.2.0 or higher)

The Seven Steps to Model Management

$
0
0
The Seven Steps to Model ManagementbertholdMon, 09/18/2017 - 14:20

We all know that just building a model is not the end of the line. However, deploying the model to put it into production is often also not the end of the story, although a complex one in itself (see our previous Blog Post on “The 7 Ways of Deployment”). Data scientists are increasingly often also tasked with the challenge to regularly monitor, fine tune, update, retrain, replace, and jump-start models - and sometimes even hundreds or thousands of models together.

In the following, we describe, in increasing complexity, different flavors of model management starting with the management of single models through to building an entire model factory.

Step 1. Models in Action: Deployment

We need to start with actually putting the model into production, e.g. how do we use the result of our training procedure to score new incoming data. We will not dive into this issue here, as it was covered in a separate blog post already. To briefly recap: we have many options such as scoring within the same system that was used for training, exporting models in standardized formats, such as PMML, or pushing models into other systems, such as scoring models converted to SQL within a database or compiling models for processing in an entirely different runtime environment. From the model management perspective, we just need to be able to support all required options.

It is important to point out that in reality very often the model alone is not very helpful unless at least part of the data processing (transformation/integration) is a part of the “model” in production. This is where many deployment options show surprising weaknesses in that they only support deployment of the predictive model alone.

To get a visual analogy started that we will use throughout this post, let us depict what this simple standard process looks like:

 

Step 2. Models under Observation: Monitoring

Next up is a topic that is critical for any type of model management: continuously making sure our model keeps performing as it should. We can do this on statically collected data from the past but that only allows us to ensure the model does not suddenly change. More often, we will monitor recently collected data, which allows us to measure whether the model is starting to become outdated because of reality changes (this is often referred to as the model drift, which is ironic since it is reality that drifts, not the model). Sometimes it is also advisable to include manually annotated data in this monitoring data set to test border cases or simply make sure the model is not making gross mistakes.

In the end, this model evaluation step results in a score for our model, measuring some form of accuracy. What we do with that score is another story: we can simply alert the user that something is off, of course. Real model management will automatically update the model, which we will discuss in the next section.

Step 3. Models Revisited: Updating and Retraining

Now it is getting more interesting and much more like actually managing something: how do we regularly perform model updates to ensure that we incorporate the new reality when our monitoring stage reports increasing errors? We have a few options here. We can trigger automatic model updating, retraining, or complete replacement. Usually we will allow for a certain tolerance before doing something, as illustrated below:

Some model management setups simply train a new model and then deploy it. However, since training can take significant resources and time, the more sensible approach is to make this switch dependent on performance and ensure that it is worth replacing the existing model. In that case an evaluation procedure will take the previous model (often called the champion) and the newly (re)trained model (the challenger), score them and decide whether the new model should be deployed or the old one be kept in place. In some cases, we may only want to go through the hassle of model deployment when the new model significantly outperforms the old one, too!

Note that all of the options described above will struggle with seasonality if we do not take precautions elsewhere in our management system. If we are predicting sales quotes of clothing, seasons will affect those predictions most dramatically. But if we then monitor and retrain on, say, a monthly basis we will, year after year, train our models to adjust to the current season. In a scenario such as this, the user could manually set up a mix of seasonal models that are weighted differently, depending on the season. We can attempt to automatically detect seasonality but usually these are known effects and therefore can be injected manually.

Another note on preexisting knowledge: sometimes models need to guarantee specific behavior for certain cases (border cases or just making sure that standard procedures are in place). Injecting expert knowledge into model learning is one aspect but in most cases, simply having a separate rule model in place that can override the output of the trained model is the more transparent solution.

Some models can be updated, e.g. we can feed in new data points and adjust the model to also incorporate them into the overall model structure. A word of warning, though: many of these algorithms tend to be forgetful, that is, data from a long time ago will play less and less of a role for the model parameters. This is sometimes desirable but even then, it is hard to properly adjust the rate of forgetting.

It is less complex to simply retrain a model, that is, build a new model from scratch. Then we can use an appropriate data sampling (and scoring) strategy to make sure the new model is trained on the right mix of past and more recent data.

To continue our little visualization exercise, let us summarize those steps of the model management process in a diagram as well:

 

Step 4. More Models: From a Bunch…

Now we are reaching the point where it gets interesting: we want to continuously monitor and update/retrain an entire set of models.

Obviously, we can simply handle this as the case before, just with more than one model. Note: now issues arise that are connected to interface and actual management. How do we communicate the status of many models to the user and let her interact with them (for instance forcing a retraining to take place even though the performance threshold was not passed) and also who controls the execution of all those processes?

Let us start with the latter – most tools allow their internals to be exposed as services, so we can envision a separate program making sure our individual model management process is being called properly. Here at KNIME we use a management workflow to do that work – it simply calls out to the individual process workflows and make sure they execute in order.

For the controller dashboard we can either build a separate application or again, use KNIME software. With KNIME’s WebPortal and the built-in reporting capabilities, we can not only orchestrate the modeling workflows but also supervise and summarize their outputs.

In most setups there will be some sort of configuration file to make sure some process workflows are called daily, others weekly and maybe even control from the outside what re-training and evaluation strategies are being used – but we are getting ahead of ourselves…

Step 5. …to Families

Handling bunches of models gets even more interesting when we can group them into different model families. We can then handle those models similarly that are predicting very similar behavior (say, they are all supposed to predict future prices for used cars). There is one issue, in particular, that is interesting to cover in this context: if models are fairly similar we can initialize the new model from the other models rather than starting from scratch or only training the new model on isolated past data. We can use either the most similar model (determined by some measure of similarity of the objects (used cars…) under observation or a mix of models for initialization.

Again, let us visualize this setup:

As you can see, we cluster our models into groups, or model families, and can use a shared initialization strategy for new models added to a specific family. In all probability, some of the overall management (frequency of running those model processes, for instance) can also be shared across families.

Step 6. Dynasties: Model Factories

Now we are only one step away from generalizing this setup to a generic Model Factory. If we look at the diagram above, we see that the steps are rather similar – if we abstract the interfaces between those sufficiently, we should be able to mix and match at will. This will allow new models that we add to reuse load, transformation, (re)training, evaluation, and deployment strategies and combine them in arbitrary ways. Therefore, for each model we simply need to define which specific process steps are used in each stage of this generic model management pipeline. The following diagram shows how this works:

We may have only two different ways to deploy a model, e.g. as a PMML document and as a webservice. But we have a dozen different ways to access data that we want to combine with five different ways of training a model. If we had to split this into different families of model processes, we would end up with over a hundred variations. Using Model Factories, we need to define only the individual pieces (“process steps”) and combine them in flexible ways defined in a configuration file, for example. If somebody afterwards wanted to alter the data access or the preferred model deployment, we would only need to adjust that particular process step rather than having to fix all processes that use it.

Note that it often makes sense to split the evaluate-step into two: the part that computes the score of a model and the part that makes a decision on what to do with that score. The latter can include different strategies to handle champion/challenger scenarios and is independent to how we compute the actual score. In our KNIME implementation, described in the next step, we choose to follow this split as well.

Step 7. Putting Things to Work: The KNIME Model Factory

Using KNIME workflows as an orchestration vehicle, putting a Model Factory to work is actually straight forward: Configuration setups define which incarnation of each process step is used for each model pipeline. For each model, we can automatically compare past and current performance and trigger retraining/updating, depending on its configuration

Our recent white paper (“The KNIME Model Factory”) describes this in detail. KNIME Workflows represent process steps, the process pipeline, and also define the UI for the data scientists, allowing model processes to be edited, added, and modified using the KNIME WebPortal. You can download the workflow orchestrating all of this as well as an example setup of workflows modeling process steps from our EXAMPLES Server under 50_Applications/26_Model_Process_Management. If you want to put it into production, a workflow running on the KNIME Server also gives you an overview dashboard to monitor, override, and add models using the KNIME WebPortal.

Viewing all 561 articles
Browse latest View live