Transcription

KNIME for Open-Source BioImage Analysis A TutorialChristian Dietz and Michael R. BertholdAbstract The open analytics platform KNIME is a modular environment that enables easy visual assembly and interactive execution of workflows. KNIME is already widely used in various areas of research, for instance in cheminformatics orclassical data analysis. In this tutorial the KNIME Image Processing Extension isintroduced, which adds the capabilities to process and analyze huge amounts ofimages. In combination with other KNIME extensions, KNIME Image Processingopens up new possibilities for inter-domain analysis of image data in an understandable and reproducible way.1 IntroductionEvery day, research involves recording increasing numbers of images as a resultof the constantly improving imaging techniques, making them key to life scienceresearch. Advanced microscopy allows the acquisition of multidimensional imagesalmost without any user interaction and can therefore generate a plethora of heterogeneous image data. However, to make sense of the generated image data andfinally draw conclusions, an exhaustive analysis of the images has to be conducted.In addition to classical image processing techniques, more sophisticated algorithmsare increasingly being applied - from the field of machine learning and data mining(Eliceiri et al, 2012). The extracted information is then further analysed with established statistical analysis techniques. For instance, detecting objects within images(i.e. segmentation) and the detailed statistical evaluation of the collected results areChristian DietzUniversity of Konstanz, Chair for Bioinformatics and Information Mining, Universitaetsstrasse 10,78464 Konstanz, Germany e-mail: [email protected] R. BertholdUniversity of Konstanz, Chair for Bioinformatics and Information Mining, Universitaetsstrasse 10,78464 Konstanz, Germany e-mail: [email protected]

2Christian Dietz and Michael R. Bertholdessential stages of a typical image analysis process (Saha et al, 2013; Ljosa et al,2012; Aligeti et al, 2014). For a full exploitation of the outcome, an appropriate visualization of the information or a linkage to other information sources from otherdomains may be necessary to gain new insights.A large number of monolithic and highly task-oriented software solutions hasbeen proposed to tackle the problems that occur in each step of bio-image analysistasks (Eliceiri et al, 2012). As a result, researchers are required to choose froma set of stand-alone tools, which have to be orchestrated to solve the given task.Typically, two approaches are used that link these kinds of tools: one approach isto transfer the data manually between the tools while the other approach involveswriting a customized program or script to automate a particular process. However,these approaches typically lead to a number of critical problems. Transferring the data manually involves a human being and is therefore timeconsuming and does not scale with the amount of the acquired images. Customized scripts are prone to errors. Furthermore, results calculated with thesehighly problem-specific scripts are frequently unable to be reproduced or reusedby others.A straightforward, but infeasible solution to the described problems is to builda single monolithic platform that covers the complete range of functionalities required by a bio-image analysis workflow. However, future demands are yet unknownand therefore a closed, proprietary software solution does not scale with the newrequirements that evolve with technological advance. Therefore, the open-sourcecommunity has realized the great need for, and benefit of, closer cooperation byfostering interoperability among individual projects and open, extensible platforms.Following this approach, the open-source analytics platform KNIME (Berthold et al,2008) provides the ability to seamlessly integrate a diverse and powerful collectionof existing software tools and libraries. KNIME is a user-friendly and comprehensive open-source data integration, processing, analysis, and exploration platform designed to handle large amounts of heterogeneous data. It has been developed since2006 and is used by professionals in industry and academia. As an integration platform, KNIME directly combines the advantages of several different tools and domains. The integrated tools are encapsulated KNIME nodes, the basic processingunits in KNIME, which in turn can be combined to form so-called workflows. KNIME workflows not only inherently document the entire analysis process, but theycan also be exported and easily made available to others, who can subsequently reproduce the results or use the workflows as a starting point for their own analysis.To guarantee reproducibility, KNIME makes sure that whenever any of the modules change in any way, for example the change of a version an integrated tool, theprevious version of that module is carefully deprecated but remains part of the platform. Hence, workflows published years ago still run with the most recent releasesof KNIME. Once a workflow has been created, it can be applied to hundreds ofthousands of images and other large data sets - even on small-scale devices thanksto the intelligent caching technology of KNIME. This makes KNIME well-suitedfor high-throughput screenings, in which the analysis results can also be quite large.

KNIME for Open-Source BioImage Analysis - A Tutorial3The KNIME Image Processing Extension enhances KNIME by providing algorithms and data structures to process and analyse images. To avoid reinventing thewheel, KNIME Image Processing uses and integrates state-of-the-art libraries suchas ImageJ1 (Schindelin et al, 2012) and ImageJ21 , SCIFIO2 , OMERO (Allan et al,2012), ClearVolume (Royer et al, 2015), ImgLib2 (Pietzsch et al, 2012), CellProfiler(Kamentsky et al, 2011), TrackMate3 and others. These well-known image processing tools can not only exchange data and therefore be used in combination, it is alsopossible to link their output to other extensions from completely different domains.For example, once interesting hits have been identified in the image data, the respective molecules can be explored with one of the many KNIME cheminformaticsextensions, for instance the KNIME RDKit extension4 .An image processing and analysis workflow typically consists of a subset of several consecutive steps: Loading images, (pre)-processing, segmentation, tracking,feature extraction, model learning and the subsequent visualization and statisticalanalysis of the information gathered in the previous steps. Different problems canbe incurred in each of these steps, depending on the image analysis task itself. However, by combining KNIME Image Processing nodes with nodes from other available KNIME extensions, it is easy to orchestrate these comprehensible workflows,which can span multiple domains, to solve the issues in KNIME without needingto program a single line of code. In Section 2 the main concepts of KNIME andKNIME Image Processing are introduced. Taking this as a basis, Section 3 goes onto explain an image processing workflow example in a step by step process.2 Basic ConceptsThis section explains how KNIME and its extensions are downloaded and installed.Next, the KNIME User Interface is described, while the last part of this sectioncovers the most fundamental concepts of KNIME Image Processing, which are important for understanding the image processing workflow explained in Section 3.2.1 Download and InstallationThe open analytics platform KNIME can be downloaded and installed from theKNIME website5 . KNIME comes packed with an installer for Windows and Macsystems. Linux users simply have to extract KNIME. As KNIME is a /rdkit.http://www.knime.org.

4Christian Dietz and Michael R. Bertholdbased system, there are several extensions that are not part of the basic KNIMEinstallation. These extensions are easily installed via so-called update-sites. KNIME Image Processing6 , for example, is installed from the Trusted CommunityContributions site. For details on how to install additional plugins, please seehttp://tech.knime.org/community.2.2 KNIME User InterfaceFig. 1 KNIME User InterfaceFigure 1 shows the KNIME User Interface. The KNIME Explorer (A) depictsthe various locations where workflows can be stored or uploaded. By default, twolocations are available: (i) The KNIME Example server on which several example workflows can be found. (ii) The LOCAL workspace, which was selected onthe first start-up of KNIME. A new workflow can be created with File New New KNIME Workflow. This new, empty workflow is accessed via the LOCALworkspace. Workflows in KNIME are essentially graphs that connect nodes (atomicprocessing units in KNIME), and visually model the individual processing steps ofa certain task. A Double-Click on the workflow in the KNIME Explorer (A) opens itin the Workflow Editor (C). The user is now able to drag&drop nodes from the NodeRepository (B) onto the canvas of the workflow editor, to compose complex yet clearworkflows, for example to process and analyse images. The nodes can then be connected by drawing a line from the output node to the input node, enabling the datato be passed from node to node. Additionally, each KNIME node provides a Node6http://knime.imagej.net.

KNIME for Open-Source BioImage Analysis - A Tutorial5Description (D) explaining which input data it requires, explanations of the requiredparameters, what the node does with the incoming data and the output of the node.The Node Repository (B) contains all of the KNIME nodes that are part of the currently installed KNIME extensions. The default KNIME Open Analytics Platforminstallation provides a basic set of nodes for data manipulation, data mining, a selection of data views, node control, time series analytics and basic IO and Databasenodes. KNIME nodes for image analysis can be added by installing more KNIMEextensions, as described in Section 2.1. The KNIME Console (E) view displays errorand warning messages in order to provide feedback to the user. Finally, the Outline(F) view provides an overview of the whole workflow even if only a small part isvisible in the workflow editor and the Favorite Nodes (G) provide quick access topersonal favorite, frequently and recently used nodes.2.3 Handling of Images and Labelings in KNIMEFig. 2 A typical KNIME table with five columns. Each column of the table has a certain data-type,e.g. numbers, text, molecules or images.

6Christian Dietz and Michael R. BertholdA workflow usually starts with a node, which represents a data source, e.g.connecting a database, reading a text file, or reading images. The data are transported between the connected nodes, typically organized in data tables, consistingof columns of certain (extensible) data-types and an arbitrary number of rows. Atypical data table is depicted in Figure 2, with each column of the table comprisingan arbitrary object type, e.g. numbers, text or molecules. KNIME Image Processingadds two new column types to the mix: images and labelings. Labelings representthe segmentation of an image - the partitioning of an image into segments. As opposed to images, labelings store one or more labels for each pixel, instead of numericvalues. A label associates each pixel with an object, class value, track number or anyother information.Contrary to what might be assumed initially, images and labelings stored in asingle cell of a data table can be of arbitrary dimensionality. For example, a table cellmay contain a multi-channel video or z-stack. To accomplish n-dimensional imageprocessing, KNIME uses ImgLib2 as its underlying programming framework.2.4 Image Processing Specific Dialog Components2.4.1 Dimension SelectionFig. 3 The configuration dialog of the Image Normalizer node.In order to provide the user with the flexibility to choose how images and labelings with more than just two dimensions are to be processed, most of the nodesprovided in the KNIME Image Processing Extension offer a so-called Dimension

KNIME for Open-Source BioImage Analysis - A Tutorial7Selection dialog (see Figure 3). This dialog enables users to select the dimensionson which an algorithm will operate. For instance in the case of a simple Z-Stack, theImage Normalization node can be configured so as to apply normalization to eachX,Y plane either independently or for the entire X,Y,Z cube by selecting X,Y or X,Y,Zin the dimension selection.2.4.2 Column SelectionMany KNIME Image Processing nodes, whose input is an image or labeling, operateon a row-to-row basis. This means that - given an input image - another image orlabeling is calculated based on the algorithm implemented in the node. The user candetermine the layout of the output table of these nodes with a dialog componentcalled Column Selection. Generally, a user has three options: the resulting columnwith images or labelings can either be appended to the incoming table, replace theexisting column or an entirely new table can be created.2.5 Visualization of Images and LabelingsFig. 4 The KNIME Image Processing Image Viewer allows users to inspect the images in moredetail. Users can browse through the various dimensions of an image, inspect the values of thepixels and obtain information about important meta-data.

8Christian Dietz and Michael R. BertholdKNIME Image Processing enables users to explore images and labelings in moredetail, which is especially useful if an image or labeling comprises more than twodimensions (e.g. z-stacks or videos). The user can access this view by Right Click Open Image Viewer (see Figure 4) on a KNIME Image Processing node. Another,more specific view is the Interactive Segmentation View node. It can be used tovalidate segmentation, classification or tracking results as it offers an overlay viewfor images and labelings. Additional visualization plugins can be installed to extendKNIME Image Processing. For instance the ClearVolume Integration offers fast,GPU accelerated 5D volume rendering and can easily be used within KNIME.3 Step by Step to Phenotype ClassificationIn this section we walk step by step through an image processing workflow. Theworkflow classifies cells as either positive or negative according to their phenotype(see Figure 5)7 .Fig. 5 Two images from the publicly available high-content screening image data provided in(Ljosa et al, 2012). The left image contains positive cells and the right, negative cells.The cells in this example stem from images from the publicly available highcontent screening image data provided in (Ljosa et al, 2012) (human cytoplasmnucleus translocation assay, available from the Broad Bioimage Benchmark Collection). The images were taken from stably transfected osteosarcoma cells seeded ina 96 well plate and contain the information about the translocalization of the Forkhead (FKHR-EGFP) fusion protein from the cytoplasm (Channel 2) to the nucleus(Channel 1)8 .7The entire Phenotype Classification workflow is available for download athttp://knime.imagej.net/aaec.8 For detailed information see http://www.broadinstitute.org/bbbc/BBBC013/. Please note: TheBMP images available on the website are already splitted into the individual channels.

KNIME for Open-Source BioImage Analysis - A Tutorial9The example workflow is depicted in Figure 6. The individual parts of the workflow are organized in so-called meta nodes to reduce the complexity of the workflow.Meta nodes are nodes that contain subworkflows, i.e. in the workflow they look likea single node and yet they can contain many nodes and even more meta nodes. Thisprovides a series of advantages such as enabling the user to design much larger,more complex workflows and the encapsulation of specific actions.Fig. 6 The workflow discussed in this tutorial.A Double-Click on the meta node allows the user to have a closer look at whatis inside. In the following, the content of each meta node will be explained in moredetail. However, it is important to note that the individual parts of the workflow areeasily replaced by other nodes possibly more suitable for other image processingtasks. Besides KNIME Image Processing, integrations with for example R, Python,Weka and especially KNIME itself offer a wide range of functionality for morecomplex visualizations and advanced machine learning and data mining techniques.3.1 Loading ImagesFig. 7 Detailed look inside the Loading Images meta node.To date, the proprietary file formats of microscope image analysis software havemade it difficult for open-source platforms to load images generically. However,SCIFIO with its integration of the BioFormats (Linkert et al, 2010) library, canconvert approx. 125 file formats used by various microscope manufacturers, such as

10Christian Dietz and Michael R. BertholdZeiss LSM, Metamorph Stack, Leica LCS or DICOM, into a KNIME compatibleformat. In KNIME, this functionality can be accessed via the Image Reader node,which integrates these libraries. A user can either select the images in the ImageReader configuration dialog (Right Click Configure) or provide URLs to theimages as an input table coming from another node, e.g. the List Files node. Theresulting workflow of the latter approach is illustrated in Figure 7.The List Files node is used to list the URLs of all images of a certain folder. Connecting the node to the Image Reader node enables the user to configure the ImageReader node (Right-Click on Image Reader Configure), such that it loads allimages into KNIME from these URLs. In this configuration: Tab: Additional Options File name column in optional column has to be set to the column of theincoming table that contains the URLs to the requested images.3.2 Preprocessing ImagesFig. 8 Detailed look into the Preprocessing meta node. The images are split into Nuclei and Cytoplasm channels and renamed accordingly.KNIME Image Processing offers a range of general (pre-)processing techniquesto enhance image quality: Standard linear and non-linear filters are available as aremorphological and binary operations, pixel-wise image arithmetics, edge-detectors,background subtraction algorithms, projections or the nodes for the manipulation ofthe dimensionality, such as splitting and merging images. Additionally, the ImageJMacro node, which is part of the KNIME Image Processing - ImageJ Integration9 ,allows the execution of arbitrary ImageJ1 macros on a huge amount of image data.The image preprocessing used in this tutorial is implemented in the Preprocessing meta node (see Figure 6). First of all, the channels of the images are split, whichresults in a table with two columns, the Nuclei and Cytoplasm (see Figure 8). Next,each channel is preprocessed individually. The images in the Nuclei column suffer9For details and installation instructions see https://tech.knime.org/community/imagej.

KNIME for Open-Source BioImage Analysis - A Tutorial11Fig. 9 Detailed look into the Background Subtraction of Images in Nuclei Channel meta node.from non-uniform illumination, which makes it difficult to apply automated segmentation methods. Therefore in the Background Subtraction of Images in NucleiChannel meta node the quality of the images in the Nuclei column is enhanced.Here, a very simple background subtraction technique was chosen, especially todemonstrate how individual KNIME nodes are easily combined to create an existing or completely new processing or analysis technique without any programming.Figure 9 depicts the workflow implementing the algorithm: The images in the Nuclei column are filtered with a very large kernel (sigma 100.0) using the GaussianConvolution node and the output is appended as an additional column to the inputtable. The mean value of the pixel intensities is calculated for each of the filteredimages using the Image Feature node.Fig. 10 Configuration dialog of the Image Calculator node.

12Christian Dietz and Michael R. BertholdFinally, the resulting background corrected images are obtained by subtractingthe sum of the filtered image and the mean of the filtered image at each pixel positionfrom the original image using the Image Calculator node.The configuration of the Image Calculator node is shown in Figure 10. The lastnode in the Background Subtraction of Images in Nuclei Channel meta node is theImage Converter. This node can be used to normalize and scale the intensity valuesof the images to a certain range. In this tutorial we normalize and scale the valuesbetween 0 and 255 ( UnsignedByteType) to reduce the amount of required memory.Finally, both the background-corrected images from the Nuclei column andthe images from the Cytoplasm column are filtered with a small Gaussian kernel(sigma 2.0) in the Preprocessing meta node. The results are appended to the table.3.3 SegmentationIn order to classify the cells into positive and negative ones according to their phenotypes, both the nuclei and their cytoplasm have to be segmented. The subworkflowfor this segmentation is encapsulated in the Segmentation meta node and is shownin Figure 11.Fig. 11 Workflow to segment the nuclei and cytoplasm, respectively.The images in the column Nuclei are segmented using the well-known Otsu(Otsu, 1975) thresholding algorithm, which is implemented in the Global Thresholder node. The output of the node is an image consisting only of black and whitepixels. At each position of this binary image indication is given of whether the pixelbelongs to a nucleus or to the background of the image. In order to split potentialtouching objects into the individual nuclei, the ImageJ1 Watershed Macro is executed on the binary images using the ImageJ1 Macro node, which is part of theKNIME Image Processing - ImageJ Integration. The subsequent Connected Component Analysis node derives a labeling from the binary images, which determineswhether each pixel belongs to an individual nuclei as opposed to determining merelywhether the pixel belongs to the nuclei or the background. The result is appended

KNIME for Open-Source BioImage Analysis - A Tutorial13to the table. Thanks to the connected Labeling Filter node, objects that are eithertoo small or too big can be removed from the labeling by manually defining theexpected size of the nuclei. In this workflow we set the minimum size of nuclei to50. The remaining nuclei now serve as seeding points for the segmentation of thecytoplasm. Starting at each nucleus in parallel, the region growing algorithm implemented in the Voronoi Segmentation node extends the seeding segments until nomore pixels can be added to the individual segments. This is the case if a pixel hasalready been added to another segment or the intensity value of a pixel is lower thana manually defined threshold. The Voronoi Segmentation was configured to returnthe segmentation of the cytoplasm without the seeds, obtained with a threshold of25 and Fill Holes activated.Figure 12 shows the resulting segmentation of the images in the Cytoplasm channel, using the Interactive Segmentation View node.Fig. 12 Results of the Voronoi Segmentation.For other segmentation tasks, KNIME Image Processing offers a wide range ofsimple and more advanced segmentation techniques. Besides established algorithmssuch as Graph Cuts or Local Thresholding, the KNIME Image Processing - Supervised Image Segmentation (SUISE) extension comprises nodes for supervised pixeland segment classification10 .10For details see the example workflows on n.

14Christian Dietz and Michael R. Berthold3.4 Feature ExtractionAfter certain objects have been identified and segmented, features can be calculatedfrom either the derived labelings alone or the combination with their source images.Features numerically describe the individual objects and are instrumental in drawingconclusions from the acquired images. These features are therefore part of most ofthe image processing and analysis tasks.Fig. 13 Detailed look into the Feature Extraction meta node. First Order Statistics, Geometric andHaralick Features are extracted individually for the Cytoplasm and the Nuclei channel.KNIME Image Processing provides several feature implementations, for example simple first order statistics of the intensity values of a segment (mean intensity,standard deviation, kurtosis etc.) or geometric properties of a segment (roundness,size, convexity, Zernike Moments (Khotanzad and Hong, 1990), Fourier shape descriptors, etc.), as well as more complex texture measurements (Haralick (Haralicket al, 1973), Tamura (Tamura et al, 1978), etc.).Figure 14 shows the output table of the Joiner node in Figure 13, which combinesthe results from the preceding Image Segment Feature nodes by joining the rows ofthe individual tables according to their RowId. Given the nuclei (Channel 1), thecytoplasm (Channel 2) and their corresponding labelings, which were derived in theprevious Segmentation meta node, these nodes calculate for each identified objectthe first order statistics, haralick texture features and several geometric properties.Each row of the output table of the Feature Extraction meta node corresponds to thenumerical descriptions of a single object.

KNIME for Open-Source BioImage Analysis - A Tutorial15Fig. 14 Output table of Image Segment Feature node. Each row corresponds to the numericalmeasurements of a cells nucleus and cytoplasm.3.5 Model LearningThe output of the Feature Extraction meta node can now be connected to KNIMEnodes, which allow operations to be performed on numerical data.Fig. 15 Detailed look into the Model Learning meta node.Typical examples include nodes for statistical testing, machine learning and datamining or visualization. In this example, a supervised classification of the nucleiand cytoplasm is performed based on the calculated features, in order to determinewhether it is considered positive or negative (see Figure 15). Therefore, as a first

16Christian Dietz and Michael R. Bertholdstep, the ground-truth data, which is part of the publicly available high-contentscreening image data set, is read into KNIME as a text file and joined with thealready loaded image data. The ground-truth data contains indications of the classesfor several cells, which then serve as the training data for a supervised learningalgorithm11 .However, if this ground-truth data is not available, users can manually createthis information by using the Interactive Labeling Editor node. Given the groundtruth and the numerical description of the nuclei and cytoplasm, the Decision TreeLearner node can be used to train a decision tree model, which in turn can be applied to cells as yet unseen using the Decision Tree Predictor node (see 15). Theoutput of the Decision Tree Predictor node comprises an additional column withthe classification result.The Decision Tree model with default configuration settings is used in this example. However, other machine learning techniques can easily be applied instead, forexample Support Vector Machines (Scholkopf and Smola, 2001), Random Forests(Breiman, 2001) or any other algorithm, which are either included in KNIME oravailable as a KNIME extension, such as Weka, R or Python.Fig. 16 Boosting of a Naive-Bayes learner.Figure 16, for instance, depicts the well-known Boosting algorithm, which isoffered with the KNIME Ensemble Learning plugin and also comprises nodes forBagging or Stacking.3.6 Evaluation and ValidationUsers often want to manually explore and validate the information extracted fromthe raw images, as the features or the results of a classification task. KNIME itselfoffers a wide range of functionalities to visualize numerical data. Scatter plots, line11For details see Phenotype Classification workflow at http://knime.imagej.net/aaec.

KNIME for Open-Source BioImage Analysis - A Tutorial17Fig. 17 Detailed look into the Evaluation and Validation meta node.plots, bar plots or histograms are just some examples of those that are offered. Evenmore plots are available with the R and Python extensions in KNIME. Furthermore,KNIME provides nodes for statistical significance testing, for example T-Tests orANOVA Testing.Fig. 18 Line-Plot visualizing the classification results of the Model Learning meta node.The Evaluation and Validation meta node comprises one example of how to visualize the results from the classification conducted in the Model Learning (see Figure17). The resulting line-plot (see Figure 18) contains the counts of positive and negative cells of images which are part of row D in the well-plate. First, the Row Filternode removes all images that are not part of row D. The subsequent Group By nodecounts the number of positive and negative cells for each image in row D, while thePivoting node arranges the KNIME table, such that the cell counts appear next to

18Christian Dietz and Michael R. Bertholdeach other. It can be observed that the number of positive cells increases over thecolumns of row D of the well-plate, which meets the expecations 12 .4 ConclusionsIn this tutorial the basic concepts of KNIME Image Processing are introduced andthe advantages of combining different software packages in a single understandable,multi-domain workflow through KNIME are demonstrated by means of an exampleworkflow for phenotype classification. The applied techniques in this use-case aresimple and exemplary. However, the already published workflows in (Saha et al,2013; Lodermeyer et al, 2013; Gunkel et al, 2014; Strauch et al, 2014; Aligeti et al,2014) solve simple problems like counting cells or measuring the intensity of segmented cells, as well as more complex tasks involving machine learning and datamining techniques. For instance, in (Gunkel et al, 2014) the entire image acquisitionprocess was controlled by a KNIME workfl

covers the most fundamental concepts of KNIME Image Processing, which are im-portant for understanding the image processing workflow explained in Section 3. 2.1 Download and Installation The open analytics platform KNIME can be downloaded and installed from the KNIME website5. KNIME co