BayesiaLab User Guide
BayesiaLab User Guide
www.bayesia.com
Table of Contents
Welcome ..................................................................................................................................... v Introduction ................................................................................................................................ vi I. Main window ............................................................................................................................ 1 II. Graph Windows ....................................................................................................................... 2 1. Status Bar ....................................................................................................................... 2 1.1. Database Report ................................................................................................... 6 2. Information panel ........................................................................................................... 10 3. Graph panel ................................................................................................................... 12 3.1. Graph panel use ................................................................................................. 12 3.2. Node Management .............................................................................................. 14 3.3. Node Edition ....................................................................................................... 16 3.3.1. Discrete Node .......................................................................................... 23 3.3.2. Continuous Node ...................................................................................... 25 3.3.3. Conditional Probability Table ...................................................................... 30 3.3.4. Deterministic nodes .................................................................................. 32 3.3.5. Equations ................................................................................................ 33 3.3.6. Assessments ........................................................................................... 63 3.4. Node States ........................................................................................................ 69 3.5. Constraint node ................................................................................................... 72 3.6. Utility node .......................................................................................................... 73 3.7. Decision node ..................................................................................................... 74 3.8. Arc Management ................................................................................................. 75 3.9. Arc states ........................................................................................................... 77 3.10. Filtered States ................................................................................................... 80 3.11. Class management ............................................................................................ 82 3.12. Constant management ....................................................................................... 85 3.13. Cost Management ............................................................................................. 87 3.14. Forbidden arc management ............................................................................... 88 3.15. Temporal Indices ............................................................................................... 90 3.16. Structural Complexity Influence Coefficient .......................................................... 91 3.17. Local Structural Coefficients ............................................................................... 92 3.18. State Virtual Numbers ........................................................................................ 93 3.19. Experts ............................................................................................................. 94 3.19.1. Assessment Report ................................................................................ 96 3.20. Variations .......................................................................................................... 99 3.21. Comments ...................................................................................................... 100 3.22. Contextual menus ............................................................................................ 104 4. Monitor panel ............................................................................................................... 109 4.1. Monitor use ....................................................................................................... 109 4.2. Monitor contextual menus .................................................................................. 120 5. Shortcuts ..................................................................................................................... 122 III. Menus ................................................................................................................................ 125 1. Network ....................................................................................................................... 125 1.1. Network export .................................................................................................. 127 1.2. Network Locking ................................................................................................ 129 2. Data ............................................................................................................................ 131 2.1. Data Importation Wizard .................................................................................... 137 2.2. Data Association Wizard .................................................................................... 154 2.3. Evidence Scenario File ...................................................................................... 160 2.4. Graphs ............................................................................................................. 163 2.4.1. Bar chart ................................................................................................ 163 2.4.2. Occurrence matrix .................................................................................. 165 2.4.3. Distribution function ................................................................................ 166 2.4.4. Scatter of points (2D) .............................................................................. 168 2.4.5. Colored Line Plot .................................................................................... 170
iii
2.4.6. Scatter of points (3D) .............................................................................. 172 2.4.7. Bubble chart ........................................................................................... 174 3. Edit ............................................................................................................................. 177 4. View ............................................................................................................................ 179 5. Learning ...................................................................................................................... 181 5.1. Association Discovering ..................................................................................... 182 5.2. Characterization of the target node ..................................................................... 187 5.3. Clustering ......................................................................................................... 192 5.4. KMeans Clustering ............................................................................................ 196 5.5. Multiple clustering .............................................................................................. 197 5.6. Policy learning of Static Bayesian networks ......................................................... 200 5.7. Policy Learning of Dynamic Bayesian networks ................................................... 201 6. Inference ..................................................................................................................... 202 6.1. Adaptive Questionnaire ...................................................................................... 203 6.2. Interactive inference .......................................................................................... 206 6.3. Interactive Bayesian updating ............................................................................. 208 6.4. Batch labeling ................................................................................................... 212 6.5. Batch inference ................................................................................................. 214 6.6. Batch most probable explanation labeling ........................................................... 216 6.7. Batch most probable explanation inference ......................................................... 218 6.8. Batch joint probability ......................................................................................... 220 6.9. Batch Likelihood ................................................................................................ 222 7. Analysis ....................................................................................................................... 224 7.1. Graphical analysis ............................................................................................. 224 7.2. Analysis reports ................................................................................................ 253 7.3. Network Performance ........................................................................................ 271 7.3.1. Network Targeted Performance ................................................................ 271 7.3.2. Network global performance (Log-likelihood) ............................................ 281 7.4. Target Optimization ............................................................................................ 289 7.5. Target Optimization Tree .................................................................................... 291 7.6. Target Interpretation Tree ................................................................................... 298 8. Monitor ........................................................................................................................ 303 9. Tools ........................................................................................................................... 305 9.1. Network Comparison ......................................................................................... 305 9.1.1. Graphical Structure Comparator .............................................................. 308 9.1.2. Comparison of Joint Probabilities ............................................................. 311 9.2. Arc Confidence ................................................................................................. 318 9.2.1. Data Perturbation ................................................................................... 320 9.2.2. Targeted Cross Validation ........................................................................ 323 9.2.3. Structural Coefficient Analysis ................................................................. 328 9.2.4. Network Extraction ................................................................................. 333 9.3. Multi-Quadrant Analysis ..................................................................................... 335 10. Dynamic Bayesian networks ....................................................................................... 341 10.1. Time variable ................................................................................................... 344 11. Options ...................................................................................................................... 346 11.1. Settings .......................................................................................................... 346 12. Help .......................................................................................................................... 361 12.1. Analysis of the use of BayesiaLab ..................................................................... 361 IV. Toolbars .............................................................................................................................. 365 V. Search ................................................................................................................................ 370 VI. Console ............................................................................................................................. 373
iv
Welcome
Welcome in the BayesiaLab Help document. BayesiaLab is a complete tool for the creation and the use of Bayesian networks.
Documentation
BayesiaLab integrates Help under the JavaHelp format. This Help document describes the functionalities and the BayesiaLab user interface. When you are using BayesiaLab, you can press Pressing will display the contextual help cursor. to display the help.
Once this mode is enabled, you can click on every component (menus and submenus as well) to display a specific contextual help.
About JavaHelp
JavaHelp is a cross-platform help system developed by Sun Microsystems to be the Help tool for the Java applications and for the applets.
Introduction
Bayesian networks are represented by graphical structures (nodes and arcs). Nodes correspond to random variables and arcs correspond to direct probabilistic relations between these connected variables. These probabilistic relations are quantified by means of probability distributions (usually a conditional probability table associated to the nodes). Bayesian networks can be automatically learned from databases and/or manually modeled by experts. It is then possible to update the probability distribution of each variable by taking into account the state of other variables.
BayesiaLab is a tool for graphical manipulation of Bayesian networks. It allows defining, modifying, using and learning models based on Bayesian networks.
vi
The main window is the BayesiaLab work environment. It is divided into 3 main parts: 1. the command zone (menus and toolbar) that contains all the commands that can operate either on all the graphs, or on the active graph. 2. the graph zone in which graph windows are opened. The minimum contextual menu, activated by right clicking on the zone of graphs, allows managing the display of the console. 3. the graph bar that allows handling the graphs windows. A left click on the network name button makes the corresponding graph active. The contextual menu (right click) of this bar makes it possible reorganizing the graph windows (Cascade/Horizontal and vertical Mosaics and Reduction).
The contextual menu of a network name button allows acting on the state of the graph window (Maximize/Minimize/Icon/Close and Rename).
A graph window can be in two different modes and two buttons allow switching from one mode to the other one. 1. Modeling Mode : mode where the graph visualization panel is visible and in which actions of modeling and learning are carried out. Validation Mode : mode where the graph visualization panel and the monitor visualization panel are visible and in which actions of validation and exploitation are carried out.
2.
1. Status Bar
The status bar of a graph window (lower part of the window) is made up of several elements :
Mode buttons
A graph window can be in two different modes. The following two buttons are used to pass from one mode to the other one. To switch to Modeling mode
Graph Windows
To switch to Validation mode. Switching to this mode can require a series of blocking tasks due to the computation of node probabilities and to the inference options.
Selection counters
An indicator displays the number of selected nodes and the number of selected arcs : count is on the left and the arc count is on the right. . The node
Cost indicator
When costs different from 1 are associated to the nodes in the network, the icon of the cost indicator is displayed in the status bar: . A click on this icon opens the cost editor dialog.
Class indicator
When classes are defined in the network, the icon of the class indicator is displayed in the status bar: . A click on the icon opens the class editor dialog. A right click on the icon displays the list of the classes. If a class is selected, it will be displayed and if deselected, it will be hidden.
The checkbox named All is a short-cut to select or unselect all the checkboxes at the same time. When a class is not checked, the nodes that are part of it become transparent and not selectable anymore. If an arc is between two transparent nodes, it becomes transparent also. Note that when an arc or a node is transparent, whatever the reason, it is not selectable anymore. The checkbox Intersection displays only the nodes contained in all the selected classes.
Graph Windows
Constant indicator
When constants are defined in the network, the icon of the constant indicator is displayed in the status bar: . A click on the icon opens the constant editor dialog. This icon is only accessible in the modeling mode.
Experts Indicator
When experts are associated with the network, the icon of the expert use indicator is displayed in the status bar: . A click on the icon opens the Expert Editor of the network. A right click on the icon displays a menu that allows the user to display the assessment report.
Temporal indicator
When we want to use the time as a variable (named: ?t?) in the formulas describing the probability distributions, the icon of the temporal indicator is displayed in the status bar: . A click on the icon allows removing the use of the time variable in the network. In this case, the nodes using the time variable in their formulas will be displayed with the icon: . This icon is only accessible in the modeling mode.
Scenario file
When an evidence scenario file was imported or manually created in validating mode, the icon is displayed in the status bar. A click on this icon allows us to remove the evidence file associated to the network. The tooltip of the icon displays the number of scenarios contained in the file. In validation mode, a right-click on the icon displays a list of the evidence sets contained in the file. A click on a line sets the corresponding observations and the associated comment is displayed in the status bar, if it exists.
Graph Windows
Virtual database
It is possible to learn a Bayesian network considering an initial structure. To take into account this a priori knowledge, we consider a virtual database with N samples, where N corresponds to the number of cases that have been used to set this a priori knowledge.The distribution of these samples corresponds to the joint probability distribution represented by the initial Bayesian network. This virtual database and the real database are then both taken into account by the learning algorithms to induce a new Bayesian network. The icon at the right-hand side of the bar is displayed. A click on this icon removes the association between the network and the virtual database.
Associated database
When the Bayesian network has an associated database, the icon at the right-hand side of the bar is activated: . Pointing on the icon displays a tooltip containing the complete path of the associated database, the number of examples contained in the database, the number of rows used for learning and for the tests if it was defined, and if the database contains weights of missing values. A click on this icon removes the association between the network and the database. When the database contains missing values, a question mark is added to the database icon: . When the database has a weight variable, a new symbol is added to the database icon to indicate it: . When the database contains rows used for learning and rows used for tests, the following symbol is added to the icon: . If the database is stratified, another symbol is added: . Finally, the icon can look like this: . The tooltip associated to the icon shows, according to the case: the name of the database or if it is an internal database, the total number of examples, the learning examples number, the test examples number, the sum of the weights, the stratification over a node and the corresponding probability distribution, the presence of missing values. A right click on the icon displays a menu that allows the user to: remove the data type (test, learning), remove the database stratification, remove the weights, display the database report. In validation mode, if the database has row identifiers, a + right-click on the icon displays a floating panel allowing the user to perform a search among the identifiers. The search is done thanks to a text field. The wildcards characters ? and * can be used. ? can replace one and only one character. * can
Graph Windows
replace none or several characters. The Case Sensitive option allows the search to respect the case. After pressing enter, the search is performed and the list of the results is displayed. The number of corresponding rows is displayed at the bottom of the panel. A click on a line sets the corresponding observations and the row identifier is displayed in the status bar:
Lock indicator
It is possible to lock the editing of a network to use it only in validating mode. When the network is locked, the icon is displayed in the status bar. By clicking on this icon, the user can enter its password to unlock the network. Then, the icon is displayed in the status bar. When the network is unlocked, a click on the icon locks it immediately. The lock manager, which can be accessed from the Network menu, allows adding a lock, removing a lock, changing the password of the lock, etc.
Graph Windows
It shows the number of variables, the number of variables with missing values and the number of variables with continuous values associated. The global database is analyzed and the number of examples is indicated. If the database has stratification on a node, it is also indicated with the corresponding probability distribution. The sum of weights and the normalization factor of the weights are displayed if database has associated weights. The report indicates the number of missing values and if the database has row identifers. If the database has data types associated, the learning and test databases are also analyzed. For each one, the example number, the sum of the weights and the number of missing values are shown. The second part of the report details the content of the database for each variable:
Graph Windows
For each discrete or continuous variable, the numbers of missing values and filtered values with their associated percentages are displayed if necessary. For the continuous variables, the report indicates if they have associated continuous values. It indicates also the minimum, the maximum and the mean of each continuous variable.
Graph Windows
All this information is displayed for the global, learning and test databases if data types are associated.
Graph Windows
2. Information panel
The information panel of a graph window is visible only in Validation mode. It can be closed or opened with the button located at the bottom right of the panel. It contains up to six different data:
Joint Probability
The joint probability corresponds to the current set of observations done (hard positive and negative evidence, and soft evidence). Without evidence, this joint probability is obviously equal to 100%. It is automatically updated at each modification of the monitors.
Log-Likelihood
The Log-Likelihood is the log2 of the joint probability. When the joint probability equals 100%, its value is 0. It tends towards the infinity when the joint probability tends towards 0. It is automatically updated at each modification of the monitors.
Cases
When a database is associated with the network, the information panel contains an estimation of the number of records that correspond to the current evidences (hard positive and negative evidences, and soft evidence). Without any evidence, the number of records is obviously equal to the database size. It is automatically updated at each modification of the monitors.
Total value
When there is at least a node with associated values in the Bayesian network, the expected total value of the network is displayed in the information panel.
Mean value
When there is at least a node with associated values, the expected mean value of the nodes having associated values is displayed in the information panel.
Uncertainty
This value represents the uncertainty variations over the unobserved nodes relatively to the fully disconnected network.This value is computed from the entropy (the highest entropy corresponds to the uniform distributions and the lowest one to a probability of 100% on a state). This value is computed only if the corresponding option is checked in the monitor panel's contextual menu.
Likelihood
This value represents the likelihood variations of the Bayesian network relatively to the fully disconnected network. This likelihood is computed from the joint probabilities of the evidences done. This value is computed only if the corresponding option is checked in the monitor panel's contextual menu.
10
Graph Windows
A tooltip displays all the previous values but without rounding them.
11
Graph Windows
3. Graph panel
In Modeling mode, a graph window only presents the visualization panel of the graphs. In this panel and this mode, it is possible to build and modify a Bayesian network manually or by learning.
12
Graph Windows
decision node creation mode: in this state, a click on the graph background creates a new decision node named automatically. arc creation mode: in this state, a click on a node followed by a drag towards another node creates an arc between these two nodes, only if this arc does not introduce any loop. deletion mode: in this mode, a click on an object (arc or node) deletes it. A click on the graph background allows initiating the definition of a deletion zone. By default, a right click allows to come back into selection mode. However, it is possible to change this behavior to come back in selection mode automatically after any action. This behavior can be modified through the preferences.
13
Graph Windows
Node Deletion
The deletion of the node is made either in the deletion mode (see Use) or by using the shortcut click. +
Node Moving
The moving of a node is made either in the selection mode (see Use) or by using the shortcut + click.Automatic tools for node positioning can be used to layout the nodes while trying to satisfy contradictory constraints (to move away the nodes, to decrease the size of the arcs, etc).
Node Edition
The node edition dialog box allows to change the type of the node, to create/delete/change states, and to edit the probabilities. This dialog box is activated by the contextual menu that is associated with the node, or by double-clicking on the node. The name of the node can be directly changed by doubleclicking on it.
Node Monitoring
In Validation mode, the contextual menu associated to a node allows creating a monitor.
Node Exclusion
14
Graph Windows
Excluding a node allows us to not take it into account during learning. Excluding a node can be done through the node contextual menu, or with the shortcut + double-click.
15
Graph Windows
16
Graph Windows
The top area containing the name of the node currently edited. The associated combo box allows changing the node to edit. The "Rename" button allows renaming the current node. It displays the following dialog box:
Note that the renaming of a node is simply done by double-clicking on the node name on the graph. The area with thumbnails that allow selecting the different properties to modify. There are seven different thumbnails:
States
The States panel contains two areas: The area containing the type of the variable/node being edited. Two types are available: Discrete: for the discrete variables Continuous: for the discretized variables with continuous numerical values. The states list edition panel and the associated buttons. A node has at least two states.
Probabilities distribution
This panel allows editing the probabilities associated with the node. It contains two areas: The View Mode area that allows displaying and editing the table of conditional probabilities in three forms: probabilistic determinist equations
17
Graph Windows
As illustrated above, this button replaces the Conditional Probability Table by the corresponding occurrence matrix, taking into account the smoothing factor, if any.
The conditional probability table edition area that allows modifying the probabilities associated to the node. It is possible to change the order of the parents by clicking and dragging the name of the parent in the header of the table's left part. Once a parent is reordered, the modifications are propagated in the conditional probability table. When experts are associated with the network, a button Assessment is displayed and is activated when a cell is selected. The border of the cells with assessments becomes green and the icon displayed in the cell indicates how important is the disagreement between the experts for this cell.
Properties
18
Graph Windows
This panel allows editing four properties of the node. The edition of each property is also available from the node's contextual menu. The color: allows displaying a colored tag on the node, its comment and its monitor. Checking the option or clicking on the preview rectangle allows displaying the color chooser dialog box. Once the color selected, it is displayed inside the rectangle and can be modified again by clicking on the rectangle. To remove the color, you will have to uncheck the option. The Propagate color to the classes button displays a dialog box allowing selecting the classes, associated to the node, on which we want to propagate the color. In this case, all the nodes belonging to these chosen classes will display the same selected color or any if there is no selection. The image: allows displaying an image instead of the node's default representation. Checking the option or clicking on the preview rectangle allows displaying the file chooser dialog box in order to choose the wanted image. The display size is 30*30. If an image is bigger, it will be reduced, if it is smaller, it will be centered. The image will be saved in the network's file. To remove the image, you will have to uncheck the option. The Propagate image to the classes button displays a dialog box allowing selecting the classes, associated to the node, on which we want to propagate the image. In this case, all the nodes belonging to these chosen classes will display the same selected image or any if there is no selection. The temporal index: allows associating a temporal index to the node. This index is a positive or null integer. It allows indicating a temporal order between the nodes that is taken into account by the learning algorithms. A node with a temporal index greater than the temporal index of another node cannot be its ancestor. To remove the index, you will have to uncheck the option. The Propagate index to the classes button displays a dialog box allowing selecting the classes, associated to the node, on which we want to propagate the index. In this case, all the nodes belonging to these chosen classes will have the same index or any if there is no index. The cost: allows associating a cost to the node. This cost is a real superior or equal to 1 representing the cost of an observation upon a node. The cost is used in the adaptive questionnaire. It is possible to make a node not observable by unchecking the option. In this case, the node will not be proposed in the adaptive questionnaire. It is also possible to use the "not observable" cost to ignore the values of the node that are read in a database (cf. Interactive inference, Batch labeling, Batch joint probability), to indicate the node to update (cf Interactive Bayesian updating), or to indicate the
19
Graph Windows
node for which one wants to compute the posterior probability distribution for each case described in a database (cf Batch inference). The Propagate cost to the classes button displays a dialog box allowing selecting the classes, associated to the node, on which we want to propagate the cost. In this case, all the nodes belonging to these chosen classes will have the same cost or will be not observable if there is no cost. The state virtual number: allows replacing the real number of states during the learning with the MDL score. The node's state number has an important impact on the MDL score computed during the structural learning. This allows influencing the network's structural complexity locally to the node. More a node has states the less it has chance of having linked parents during the learning and vice versa. Decreasing this parameter decreases the MDL score of the node and vice versa. The Propagate state virtual number button displays a dialog box allowing selecting the classes, associated to the node, on which we want to propagate the state virtual number. In this case, all the nodes belonging to these chosen classes will have the same state virtual number or any if there is no one. The local structural coefficient: this parameter acts like the network's global structural coefficient but is proper to each node. It can increase or decrease the structural complexity of the network at the node. This parameter acts on the whole MDL score of the node contrary to the state virtual number. More a node has a high MDL score the less it has chance of having linked parents during learning and vice versa. Decreasing this parameter decreases the node's MDL score and vice versa. The Propagate local structural coefficient button displays a dialog box allowing selecting the classes, associated to the node, on which we want to propagate the local structural coefficient. In this case, all the nodes belonging to these chosen classes will have the same local structural coefficient or any if there is no one. The exclusion: a node can be excluded during structural learning, meaning the learning algorithm won't add any arc having this node as extremity. This is particularly useful if you wish to learn the structure of a network on a subset of nodes. The Propagate exclusion button displays a dialog box allowing selecting the classes, associated to the node, on which we want to propagate the exclusion status. In this case, all the nodes belonging to these chosen classes will have been excluded or not.
Classes
This panel allows managing the classes associated to the node. A class is defined as a set of nodes of a network with an associated name. The concept of classes allows to regroup nodes having common properties and to manage globally these properties. A node can belong to several classes at the same time. The classes can be also managed with the class editor.
The left list contains classes to which the node belongs. The buttons allows adding a class that already exists in the network, to add a new class that does not exist in the network and to delete the selected classes.
20
Graph Windows
Values
This panel allows managing the values associated to the node's states. We can associate a numerical value to each state of the node. These values allow computing an expected numerical value for these nodes, even if these nodes are purely symbolic. It is also possible not to assign a value to certain states. The states without associated value are then excluded from the calculation of the expected value.
The values are used quite like Utility nodes. Indeed, an expected numerical value can be obtained by associating an Utility node to each node, except that the states without values cannot be represented with this kind of node. Thus, these values are used to evaluate the network, to measure the impact of such lever on the quality of the network. However, unlike Utility nodes, these values are not taken in account during the action policies learning. On the other hand, the values are used in the Pearson's linear correlation coefficient. When the node is continuous and has associated data, a button Generate Values is displayed at the bottom and allows computing automatically the values associated to the sates from the database. Weights are taken into account.
State names
21
Graph Windows
This panel allows managing the long names associated to the node's states. We can associate a long name to each state of the node. These long names can be used in the monitors, reports and during data export as well (saving database, imputation).
Filtered State
This panel allows us to specify if there is or not a filtered state. Only one filtered state is allowed by variable, continuous or discrete. This state is used to represent the cases where the variable does not have a real existence like, for example, an analysis which would be made according to the positive result of a test.
Comment
22
Graph Windows
The last panel allows editing the comment associated to the node. This comment is in HTML. It is possible to add hypertext links, images and to modify the background and foreground colors. The fonts are fully customizable. A complete description of the integrated HTML editor is available. The comment edition is also available from the node's contextual menu.
23
Graph Windows
24
Graph Windows
It is then possible to set the desired number of states. The automatic naming prefixes all the states with an L and associates the rank of the state. With the numbered names, it is possible to change the string prefix. In the first two cases, the variable is symbolic. A numerical variable is defined by using integer or real labels. It is possible to automatically generate a sequence of integers by specifying the starting value and the step. In the same manner, it is possible to give real values by indicating the starting value and the step. The button "Generate names" allows launching the automatic naming that use the L prefix. The button "Aggregates" allows running the aggregate editor. This one allows adding or removing, for the selected state, different names considered equivalent (aliases). It is useful when you need to associate a database with a network. It is possible to rename directly an aggregate by double-clicking on its name in the table.
By clicking on the "Add" button, a new dialog box is displayed allowing entering a new aggregate name:
The buttons "Up" and "Down" allow changing the order of the states. The modification of this order is propagated into the conditional probability tables, in the long state names and in the values associated to the states. Once the modification of the node is validated, the new order is propagated on the probability tables of the children nodes of the current node.
25
Graph Windows
be manually set and/or automatically learned. If data are associated to this node then the button Curve allows displaying the interface for automatic and manual discretization from data.
A state of a discretized variable is composed of three fields: a label, the lower bound, the upper bound. The modification of the intervals can be done by four ways: 1. via the interval chart (colored zone): a left click allows moving the threshold along the axis. 2. via the table. The manual data entry must respect the following constraints: the lower bound must be lower or equal to the upper one and higher than the upper bounds of the preceding states; the upper bound must be higher or equal to the lower bound and lower than the lower bounds of the following states; 3. with the button "Normalize" that computes intervals of equal width; 4. with the button "Generate Intervals" that opens the following window:
26
Graph Windows
It is possible to set the number of desired intervals as well as the global lower and upper bounds. A discretization with equal width is then performed between these global bounds and the labels are named with the following syntax <=lower bound and >upper bound when using the automatic naming, or prefixed by a string followed by the order number of the label. The button "Generate Names" allows launching the automatic naming using <=lower bound and >upper bound. The button "Aggregates" allows running the aggregate editor. This one allows adding or removing, for the selected state, different names considered equivalent (aliases). It is useful when you need to associate a database with a network. It is possible to rename directly an aggregate by double-clicking on its name in the table.
By clicking on the "Add" button, a new dialog box is displayed allowing entering a new aggregate name:
By clicking on the "Curve" at the top of the panel, the discretization from data interface is displayed. This button is displayed only if data are associated to this node:
27
Graph Windows
This interface is similar to the manual discretization interface of the data import/association. It represents the current node's data distribution function. The X-axis represents the number of individuals and the Y-axis represents the values of the continuous variable. The user can switch the view of the data to a representation of the density curve generated by the Batch-Means method. In this view, the data's density curve is displayed. The continuous variable's values are represented along X-axis and the density of probability is represented along the Y-axis. The two red areas at each extremity indicate that the curve may not be accurate and can't be used to place here some discretization points.
28
Graph Windows
This window is fully interactive and allows, in both view: Adding a threshold: Right Click Removing a threshold: Right Click on the threshold Selecting a threshold: Left Click on a threshold Moving a threshold: Left Click down and mouse move, the current Y-Coordinate appears in the Point box : Zooming: Ctrl + Left Click down + move + release to define the area that we want to enlarge. In the distribution function, the zoom will be done vertically and in the density curve, it will be done horizontally. It is possible to zoom successively as much as you need. Unzooming: Ctrl + Double Left Click Besides this distribution function, the button: allows having access to the three automatic discretization methods through a new dialog. This part can be considered a wizard for the manual discretization as it is possible to launch these methods, to see the resulting discretization on the distribution function, and then to modify the result by moving, deleting and adding thresholds.
29
Graph Windows
If the chosen discretization fails, a dialog box is displayed to warn the user. In this dialog it can change the chosen discretization. When an interval is modified, deleted or added, whatever the way to do this, if data are associated to the node, then the modified interval values are automatically updated from the data. In the same way, the modified interval names are automatically generated if they are numerical. If the new manual or automatic discretization is validated by clicking the button Accept of the editor, the database will be automatically updated in order to take into account this new discretization. It will be also stored as a manual discretization in the database.
The reading of a conditional probability table in BayesiaLab is carried out in the following way: zone 1 corresponds to the value combinations of the parents of the node. Of course, this zone exists only if the node has at least one entering arc. When a node has parents, it is possible to change the order of the parents by dragging the name of the parents inside the header of the zone 1. When a parent is moved, the conditional probability table is reorganized to take the modification into account. zone 2 corresponds to the probability distributions, conditionally to each case described by zone 1, or to the a priori probability distribution in the absence of entering arc. If the network has a database and the states of the node are not modified, a tool tip displays, for each probabilities, the number of
30
Graph Windows
corresponding cases in the database (taking into account the smoothing factor, if any) and the represented percentage of the database. The interpretation of the first line of this table is thus: the probability of having Dyspnea is 10% when TbOrCa is False and Bronchitis is False
Table edition
The main problem with conditional probability tables is the exponential growth of the number of lines with respect to the number of parents. Their data acquisition can then quickly become a tedious task. When databases exist, it is possible to fill them automatically (learning of the parameters). It is also possible to use equations to describe more concisely the probability distributions. The probability acquisition can also be alleviated by using the cut & paste facilities, inside the same table, between different tables or with external applications. These tables also come with the classical cell selection tools, either by directly clicking on the cells, or by clicking on the line/column headers. Whereas a click on a header with the key pressed allows making an OR with the previous selection, the same click with the key makes an AND. A click with the key on a cell allows selecting/unselecting the cell without changing the set of selected cells. This is then the way to edit a cell that belongs to a selection and to set this edit to all the other selected cells. A click on a cell with the key pressed selects form the active cell to the pointed cell. Probabilities can be copied and pasted to other tables or to external applications.
Complete
This operation consists in equally distributing the residual probability, defined as being equal to 100 the sum of the probabilities already defined on the line. If there are blank cells, this residual probability is distributed on these cells. Otherwise, if the residual probability is positive, this operation works as the Normalization one over all the cells of the line.
Normalize
Depending on probability probabilities are reduced or augmented in such way that relative weight of each remains unchanged there at least a blank cell and if residual is positive this operation corresponds to the complete one.
Randomize
All the cells of the Conditional Probability Table, or just the selected cells, are randomly filled.
Assessment
When experts are associated with the network, a button Assessment is displayed and is activated when a cell is selected. The border of the cells with assessments becomes green and the icon displayed in the cell indicates how important is the disagreement between the experts for this cell.
31
Graph Windows
32
Graph Windows
Table edition
Each cell has an associated value list (the node values). This list appears thanks to a double left click in the cell. It is possible to select a set of cells by using the line or the column headers, or directly by clicking on the cells while maintaining the Shift or Ctrl keys pressed, as proposed in the conditional probability tables. When several cells are selected, the choice of the value is done by double-clicking on a cell while maintaining the Ctrl key pressed. The value that is chosen is then set for all the selected cells.
3.3.5. Equations
The conditional probability tables can be described efficiently by using equations. These equations are applied for each cell of the probability tables. The results are automatically normalized if necessary. The edition of equations is realized by using the following node edition panel.
33
Graph Windows
34
Graph Windows
1. The equation type: allows specifying if the equation returns node values (deterministic) or numerical values that will be considered, after normalization, as probabilities (probabilistic). The heading of the equation edition windows (part 2) reflects the choice. Note that the variable appears in the probabilistic equation to indicate the corresponding column in the probability table.
2. Equation edition window: text field that allows entering equations. The description language of the equations is strongly typed. The syntax used functions and infixed operators. 3. Message panel to display the errors of the formula. If a continuous node has a determinist equation, it is possible that some computed values could be outside the domain of the variable. In this case, a dialog box proposes to enlarge automatically the limits of the variable in order to contain these values. 4. This part allows specifying the number of random samples used for the generation of the probability tables. These samples are used to sample values inside each interval of the discretized continuous variables that appear in the equation. This part also allows specifying a smoothing parameter that is used to initialize the occurrence number of each table cell. Setting this parameter to a value greater than 0 implies not only the smoothing of the probability distribution but implies also that all the states have a non zero associated probability. This smoothing parameter is then unavailable for deterministic equations. 5. List of the probability distributions, functions and operators that are available to write an equation: a. Discrete probability distributions b. Continuous probability distributions c. Special functions d. Arithmetic functions e. Transformation functions f. Trigonometric functions g. Relational operators h. Boolean operators i. Arithmetic operators j. User functions A double-click on the name of the distribution, function or operator, inserts it in the equation.
35
Graph Windows
6. List of the variables that are available for the equation, i.e. the node that is edited and its parents, if any. If the parameter Time node is defined in the network, it is available for each equation. If there are some constants defined in the network, the word Constants is displayed and allows accessing the list of the available constants when it is selected as you can see it in the following image. A double click on the name of the variable inserts it in the equation. As it can be seen in the following screenshot, the variable names are flanked by question marks when they are referenced in the equation. 7. List of the values of the variable that is selected in part 6 or the available constants if the indicator is selected in part 6. A double-click on a value or on a constant inserts it in the equation.
The constants
To use some constants previously defined as in this example, you have just to select the constant indicator in the part 6 and to double-click on the needed constant in the part 7 as shown in the following image. The conditional probability table will be recomputed according to the current values of the used constants.
36
Graph Windows
The variable type is dynamically set during the equation evaluation by using the following rules: The discrete variables with only two states defined as true and false, yes or no, vrai and faux, or oui et non (depending on the locale, case and order insensitive) are typed as Boolean. They can also be typed as String, depending on the context of use. The discrete variables with states made of characters are typed as String. The discrete variables with all the states made of integers are typed as Integer. They can also be typed as String depending on the context of use. The discrete variables with all the states made of numbers and at least one of them is a real number are typed as Real. They can also be typed as String depending on the context of use. The continuous variables are typed as Real.
37
Graph Windows
Binomial(k, n, p)
Description: Probability of ending up with exactly K occurrences of the same event of probability p among n independent experiments. Number of Parameters: 3 Parameter type: integer, integer, numerical Result type: real Example: The probability distribution below corresponds to Binomial(?N1?, 20, 0.3)
NegBinomial(k, n, p)
Description: Probability of needing K trials to have n successes of the same event of probability p among independent experiments. Number of Parameters: 3 Parameter type: integer, integer, numerical Result type: real Example: The probability distribution below corresponds to NegBinomial(?N1?, 4, 0.3)
38
Graph Windows
Geometric(k, p)
Description: Probability of needing k independent experiments to have the first observation of an event of probability p. Number of Parameters: 2 Parameter type: integer, numerical Result type: real Example: The probability distribution below corresponds to Geometric(?N1?, 0.3)
Hypergeometric(k, n, m, N)
Description: Probability of ending up with k winning objects when choosing n objects among N where m are winning objects. Number of Parameters: 4 Parameter type: integer, integer, integer, integer Result type: real Example: The probability distribution below corresponds to Hypergeometric(?N1?, 5, 5, 20)
39
Graph Windows
Poisson(k, l)
Description: Probability of ending up with k observations of an event during a great number of independent experiments when the mean is l. Number of Parameters: 2 Parameter type: integer, real Result type: real Example: The probability distribution below corresponds to Poisson(?N1?, 18.5)
DiscUniform(k, a, b)
Description: Uniform distribution defined on the discrete interval [a, b]. Number of Parameters: 3 Parameter type: integer, integer, integer Result type: real
40
Graph Windows
Triangular(x, m, l, r)
Description: Triangular probability distribution of x with modal value m, left deviation l and right deviation r. Number of Parameters: 4 Parameter type: numerical, numerical, numerical, numerical Result type: real Example: The probability distribution below corresponds to Triangular(?N1?, 0.5, 0.2, 0.4)
41
Graph Windows
Cauchy(x, m, s)
Description: Cauchy probability distribution of x with modal value m and scale s. Number of Parameters: 3 Parameter type: numerical, numerical, numerical Result type: real Example: The probability distribution below corresponds to Cauchy(?N1?, 0.5, 0.1)
Exponential(x, l)
Description: Exponential probability distribution of x with lambda = l. Number of Parameters: 2 Parameter type: numerical, numerical Result type: real Example: The probability distribution below corresponds to Exponential(?N1?, 2)
Weibull(x, a, l)
42
Graph Windows
Description: Weibull probability distribution of x. Note that Weibull(x, 1, l) = Exponential(x, l). Number of Parameters: 3 Parameter type: numerical, numerical, numerical Result type: real Example: The probability distribution below corresponds to Weibull(?N1?, 1.5, 1.5)
Gamma(x, a, l)
Description: Gamma probability distribution of x. Gamma(x, 1, l) = Exponential(x, l). Number of Parameters: 3 Parameter type: numerical, numerical, numerical Result type: real Example: The probability distribution below corresponds to Gamma(?N1?, 1.5, 1.5)
43
Graph Windows
Number of Parameters: 5, the last two ones represent the lower and upper bounds of the variable with default values 0 and 1 respectively. Parameter type: numerical, numerical, numerical, numerical, numerical Result type: real Example: The probability distribution below corresponds to Beta(?N1?, 2, 5, 0, 5)
ChiSquare(x, n)
Description: Chi-Square probability distribution of x with n degrees of freedom. Number of Parameters: 2 Parameter type: numerical, integer Result type: real Example: The probability distribution below corresponds to ChiSquare(?N1?, 3)
LogNormal(x, m, s)
Description: Log normal probability distribution of x. Number of Parameters: 3
44
Graph Windows
Parameter type: numerical, numerical, numerical Result type: real Example: The probability distribution below corresponds to LogNormal(?N1?, 0.4, 0.8)
Uniform(x, a, b)
Description: Uniform probability distribution of x on the interval [a, b]. Number of Parameters: 3 Parameter type: numerical, numerical, numerical Result type: real
45
Graph Windows
CumulBinomial(k, n, p)
Description: Normalized cumulated probability of ending up with K or less occurrences of the same event of probability p among n independent experiments. Number of Parameters: 3
46
Graph Windows
RandomUniform(min, max)
Description: Returns a real value greater than or equal to min and less than max. Returned values are chosen pseudo randomly with (approximately) uniform distribution from that range. Number of Parameters: 2 Parametertype: numerical, numerical Result type: real
RandomGaussian(m, s)
Description: Returns the next pseudo random, Gaussian ("normally") distributed real value with mean m and standard deviation s from this random number generator's sequence. Number ofParameters: 2 Parametertype: numerical, numerical Result type: real
Ceil(x)
Description: Nearest integer greater than or equal to x. Number of Parameter: 1 Parametertype: numerical Result type: real
Exp(x)
Description: Exponential of x. Number of Parameter: 1 Parametertype: numerical Result type: real
47
Graph Windows
Fact(n)
Description: Factorial of n. Number of Parameter: 1 Parametertype: integer >= 0 Result type: integer
Floor(x)
Description: Nearest integer less than or equal to x. Number of Parameter: 1 Parametertype: numerical Result type: real
Frac(x)
Description: Fractional part of x. Number of Parameter: 1 Parametertype: numerical Result type: real
Integer(x)
Description: Integer corresponding to x. When x is positive, it's Floor(x) and it's Ceil(x) otherwise. Number of Parameter: 1 Parametertype: numerical Result type: real
Ln(x)
Description: Natural logarithm of x. Number of Parameter: 1 Parametertype: numerical Result type: real
48
Graph Windows
Log(x)
Description: Common logarithm of x. Number of Parameter: 1 Parametertype: numerical Result type: real
Log2(x)
Description: Binary (base 2) logarithm of x. Number of Parameter: 1 Parametertype: numerical Result type: real
Max(x1, x2)
Description: Maximum between x1 and x2. Number of Parameters: 2 Parametertype: numerical, numerical Result type: numerical
Min(x1, x2)
Description: Minimum between x1 and x2. Number of Parameters: 2 Parametertype: numerical, numerical Result type: numerical
Sqrt(x)
Description: Square root of x. Number of Parameter: 1 Parametertype: numerical Result type: real
49
Graph Windows
Description: Clip x to the interval [min, max]. Number of Parameters: 3 Parametertype: numerical, numerical, numerical Result type: real
Round(x)
Description: Nearest integer less than or equal to x if the fractional part is less than 1/2, nearest integer greater than or equal to x otherwise. Number of Parameter: 1 Parametertype: numerical Result type: real
RoundTo(x, n)
Description: Round of x to n digit after (if n is negative) or before the decimal point (if n is positive). When n is equal to 0, this function is identical to Round(x). Number of Parameters: 2 Parametertype: numerical, integer Result type: real
Sign(x)
Description: Returns -1, 0 or 1 depending on the sign of x. Number of Parameter: 1 Parametertype: numerical Result type: integer
Cos(x)
50
Graph Windows
Tan(x)
Description: Tangent of x. Number of Parameter: 1 Parametertype: numerical Result type: real
Asin(x)
Description: Inverse sine of x. Number of Parameter: 1 Parametertype: numerical Result type: real
Acos(x)
Description: Inverse cosine of x. Number of Parameter: 1 Parametertype: numerical Result type: real
Atan(x)
Description: Inverse tangent of x. Number of Parameter: 1 Parametertype: numerical Result type: real
Sinh(x)
Description: Hyperbolic sine of x.
51
Graph Windows
Cosh(x)
Description: Hyperbolic cosine of x. Number of Parameter: 1 Parametertype: numerical Result type: real
Tanh(x)
Description: Hyperbolic tangent of x. Number of Parameter: 1 Parametertype: numerical Result type: real
Relational operators
The priority of the operators is defined from 1 to 9, 9 being the strongest priority.
==, !=
Description: Equality and non equality operators. Number of Parameters: 2 Parametertype: the two parameters have to be of the same type or numerical Result type: boolean Priority: 4
52
Graph Windows
Boolean operators
The priority of the operators is defined from 1 to 9, 9 being the strongest priority.
!
Description: Negation operator. Parameter number: 1 Parametertype: boolean Result type: boolean Priority: 8
XOR
Description: Exclusive Or operator. Parameter number: 2 Parametertype: boolean, boolean Result type: boolean Priority: 3
&
Description: And operator. Parameter number: 2 Parametertype: boolean, boolean Result type: boolean Priority: 2
|
Description: Or operator. Parameter number: 2 Parametertype: boolean, boolean Result type: boolean Priority: 1
Arithmetic operators
The priority of the operators is defined from 1 to 9, 9 being the strongest priority.
53
Graph Windows
^
Description: Exponentiation operator. Number of Parameters: 2 Parametertype: numerical, integer Result type: numerical Priority: 9
Description: Negation operator. Number of Parameter: 1 Parametertype: numerical Result type: numerical Priority: 8
%
Description: Modulo operator. Number of Parameters: 2 Parametertype: integer, integer Result type: integer Priority: 7
*
Description: Multiplication operator. Number of Parameters: 2 Parametertype: numerical, numerical Result type: numerical Priority: 7
/
Description: Division operator. Number of Parameters: 2
54
Graph Windows
+, Description: Addition and subtraction operators. Number of Parameters: 2 Parametertype: numerical, numerical Result type: numerical Priority: 6
Architecture
To allow this integration, a Java interface is included in the library BayesiaLab.jar located in the BayesiaLab's installation directory. In order to create its own function, the user must 'implement' this interface with its own Java class. Once its class created, a jar library file must be created and copied into a directory named "ext" which is a sub-directory of the BayesiaLab's running directory. If the created Java library needs some additional files, they must be located in the running directory of BayesiaLab or in its sub-directory "ext". In this last case, you must remember to specify in your program to look for the additional files in the ext sub-directory. For Windows, the path is: C:\Documents and Settings\All Users\Application Data\BayesiaLab\ext
Interface
The Java interface that must be implemented is called ExternalFunction and is located in the package com.Bayesia.BayesiaLab.ext Here is its content: package com.Bayesia.BayesiaLab.ext; public interface ExternalFunction {
/** * Define the real type. */ public static final int REAL = 1; /**
55
Graph Windows
* Define the integer type. */ public static final int INTEGER = 2; /** * Define the boolean type. */ public static final int BOOLEAN = 3; /** * Define the String type. */ public static final int STRING = 4; /** * This method is called once when the object is created, i.e. * when the formula is built. * It can be used to create structures that will be constant * for all the life time of the formula it is part of. * @return true if the initialization process succeed otherwise false */ public boolean initialization(); /** * This method is called once, when the object is no more used * and the object will be destroyed. * @return true if the initialization process succeed otherwise false */ public boolean finalization(); /** * This method is called before the generation of the conditional * probability tables. * It indicates how many calls to the method evaluate() will be done. * For example a binary node with * two binary parents will need 2*2*2=8 combinations, and if this * node or any of its parent is continuous, * the number of sample that will be generated for each combination * (1000 for example) will be given as sample number. * @param combinationNumber the number of combinations the * conditional probability table represents. * @param sampleNumber the number of samples that will be executed * if at least one node is continuous. */ public void beforeEvaluation(int combinationNumber, int sampleNumber); /** * This method is called after the evaluation of * all the combinations. */ public void afterEvaluation(); /** * This is the main method the evaluates the function for each * given combination of parameters. * If an error has occurred during the evaluation, an * ArithmeticException must be thrown with the convenient * error message. * @throws ArithmeticException if an error has occurred during the * evaluation */ public void evaluate() throws ArithmeticException; /** * Return the name of the function that will be displayed * in BayesiaLab and recognized by the formula parser.
56
Graph Windows
* @return a String representing the name of the function */ public String getName(); /** * Return the description of the function that will be displayed * in as a tooltip in BayesiaLab. This string can contains html tags. * @return a String describing the function */ public String getDescription(); /** * This method indicates if the number of parameters is variable. * If true, the method getParameterNumber() indicates the minimum * number of parameters that the function accepts. * @return true if the parameter number is variable false otherwise * @see #getParameterNumber() */ public boolean isVariableParameterNumber(); /** * This method is used to indicate to the function what is the number of parameters * that will be used when the parameter number is variable. * This method is called after the creation of the instance. * parameterNumber must be superior or equal to the * value returned by getParameterNumber(). * @see #getParameterNumber() * @see #isVariableParameterNumber() */ public void setUsedParameterNumber(int parameterNumber); /** * Return the number of parameters the function uses. * If the number of parameters is variable, it indicates the minimum number * of parameters used by the function. * @return the number of parameters * @see #isVariableParameterNumber() */ public int getParameterNumber(); /** * Return the name of the parameter at the specified index * (from 0 to n). If the parameter number is variable, this method * will be called only for the first parameters until * getParameterNumber(). * @param parameterIndex the index of the parameter * @return the name of the parameter * @see #getParameterNumber() * @see #isVariableParameterNumber() */ public String getParameterNameAt(int parameterIndex); /** * Return the type of the parameter at the specified index * (from 0 to n). * The type must be one of REAL, INTEGER, BOOLEAN or STRING. * @param parameterIndex the index of the parameter * @return the type of the parameter */ public int getParameterTypeAt(int parameterIndex); /** * Return the type of the value returned by the function. * The type must be one of REAL, INTEGER, BOOLEAN or STRING. * @return the type of the returned value
57
Graph Windows
*/ public int getReturnType(); /** * Sets the integer value of the parameter at parameterIndex. * The type of the parameter must be one of INTEGER or REAL. * @param parameterIndex the index of the parameter * @param value the integer value associated to the parameter */ public void setParameterValueAt(int parameterIndex, int value); /** * Sets the double value of the parameter at parameterIndex. * The type of the parameter must be REAL. * @param parameterIndex the index of the parameter * @param value the double value associated to the parameter */ public void setParameterValueAt(int parameterIndex, double value); /** * Sets the boolean value of the parameter at parameterIndex. * The type of the parameter must be BOOLEAN. * @param parameterIndex the index of the parameter * @param value the boolean value associated to the parameter */ public void setParameterValueAt(int parameterIndex, boolean value); /** * Sets the string value of the parameter at parameterIndex. * The type of the parameter must be STRING. * @param parameterIndex the index of the parameter * @param value the String value associated to the parameter */ public void setParameterValueAt(int parameterIndex, String value); /** * This method is called to get the result of the evaluation if * the return type is INTEGER. * @return the result of the evaluation as integer */ public int getIntegerResult(); /** * This method is called to get the result of the evaluation if * the return type is REAL. * @return the result of the evaluation as double */ public double getRealResult(); /** * This method is called to get the result of the evaluation if * the return type is BOOLEAN. * @return the result of the evaluation as boolean */ public boolean getBooleanResult(); /** * This method is called to get the result of the evaluation if * the return type is STRING. * @return the result of the evaluation as String */ public String getStringResult(); }
58
Graph Windows
Example of implementation
The following example use a Windows dynamic library (dll) that returns the sum of two real numbers. import com.Bayesia.BayesiaLab.ext.*; public class Sum implements ExternalFunction { /** * Allows to load the library that performs the * computation and located in the ext sub-directory * of the execution directory of BayesiaLab. */ static { System.loadLibrary("ext/Sum"); } /** * An array defining the name of the parameters. */ private static final String[] parameters = {"a", "b"}; /** * An array defining the type of the parameters. */ private static final int[] parameterTypes = {REAL, REAL}; /** * An array to store the values of the parameters. */ private double[] values = new double[2]; /** * Represents the result of the computation. * It is initialized with Not a Number. */ private double result = Double.NaN; /** * This method is linked to the method defined in * the library Sum that performs the computations. * @param a * @param b * @return the sum of a and b */ private native double compute(double a, double b); /** * The method calls the method compute with the parameters */ public void evaluate() throws ArithmeticException { result = compute(values[0], values[1]); } /** * Nothing to do. */ public boolean initialization() { return true; } /** * Nothing to do. */
59
Graph Windows
public boolean finalization() { return true; } /** * Nothing to do. */ public void beforeEvaluation(int combinationNumber, int sampleNumber) { } /** * Nothing to do. */ public void afterEvaluation() { } /** * The name of the function is Sum. */ public String getName() { return "Sum"; } public String getDescription() { return "Return the sum of the real numbers a and b."; } public boolean isVariableParameterNumber() { return false; } /** * Nothing to do because there is variable parameter number. */ public void setUsedParameterNumber(int parameterNumber) { } public int getParameterNumber() { return parameters.length; } public String getParameterNameAt(int parameterIndex) { return parameters[parameterIndex]; } public int getParameterTypeAt(int parameterIndex) { return parameterTypes[parameterIndex]; } public int getReturnType() { return REAL; } /** * Nothing to do because there is no integer parameter. */ public void setParameterValueAt(int parameterIndex, int value) { } /** * Give the real value to the parameter at the specified index. */ public void setParameterValueAt(int parameterIndex, double value) { values[parameterIndex] = value; } /** * Nothing to do because there is no boolean parameter. */ public void setParameterValueAt(int parameterIndex, boolean value) { }
60
Graph Windows
/** * Nothing to do because there is no String parameter. */ public void setParameterValueAt(int parameterIndex, String value) { } /** * Return 0 by default. It won't be called. */ public int getIntegerResult() { return 0; } /** * Return the result of the computation. */ public double getRealResult() { return result; } /** * Return false by default. It won't be called. */ public boolean getBooleanResult() { return false; } /** * Return null by default. It won't be called. */ public String getStringResult() { return null; } }
Compilation
The command line used to compile the java class is : javac -cp "C:\Program Files\Bayesia\BayesiaLab\BayesiaLab.jar" Sum.java It may be necessary to indicate the path to the javac.exe program located in the bin directory of the jdk1.6. The Sum.class file will be generated.
61
Graph Windows
/* Header for class Sum */ #ifndef _Included_Sum #define _Included_Sum #ifdef __cplusplus extern "C" { #endif /* * Class: Sum * Method: compute * Signature: (DD)D */ JNIEXPORT jdouble JNICALL Java_Sum_compute (JNIEnv *, jobject, jdouble, jdouble); #ifdef __cplusplus } #endif #endif This file must not be modified. For more information on JNI (Java Native Interface) consult Internet.
Starting
After having created its plugin and after the restart of BayesiaLab, the plugin is loaded and is available in the equation editor:
62
Graph Windows
3.3.6. Assessments
The mechanism of knowledge elicitation of BayesiaLab allows a set of experts to provide assessments on the conditional probability tables of nodes. A "facilitator" can mediate to ask relevant questions to experts and get their assessments. The experts must be declared in the network with the expert editor. Two ways of capturing the assessments are proposed: Each assessment is input directly in BayesiaLab. Or each assessment is input via the online assessment tool. When a node has assessments, the icon is displayed at the bottom left of the node. More the disagreement between the assessments is important for this node, more the background of the icon becomes dark. A slideshow explains the principle of knowledge elicitation in BayesiaLab. It can be found at the following address:http://www.bayesia.com/en/products/bayesialab/resources/tutorials/bayesiaLab-knowledge-elicitation-environment.php Assessments can be entered from the tab Probability Distribution in the node editor. Experts must have been already created. In this case, a button Assessment is displayed and becomes activable when a cell is selected. After pressing the button, the assessment edition is done for the selected line.
63
Graph Windows
When an assessment was made for a line in the conditional probability table, the corresponding cells have a green border. The icon displayed in the cell indicates the importance of disagreement between the experts for this cell. More the icon is visible, more the disagreement is important. Move your mouse over a cell displays a tooltip showing: The minimum given by an expert for this cell The maximum given by an expert for this cell The number of assessments for this cell
Editor
The assessment editor displays the list of assessments for the selected line of the conditional probability table.
64
Graph Windows
A line in this table defines an assessment. This assessment consists of: A probability for each state of the node, the sum of probabilities to be equal to 100. The background of the corresponding cells is gray. These values can be edited by double-clicking or by typing the value if the cell is selected. The name of the expert who has made this assessment. The expert is chosen from a combo box containing the name of each expert. The confidence the expert gives to his assessment. This value is editable. The comment the expert makes about his assessment. A tooltip containing the entire comment is displayed when we move the mouse over the cell. The time taken to make this expertise. It is measured in seconds. This value is editable. It is 0 by default when the expertise is filled manually. However, it is automatically filled in when the expertise is online. The button Add adds an assessment by default at the end of the list. This assessment will then be edited manually. It is possible to have several assessments from the same expert, but, for clarity, it is not recommended. The button Delete deletes the selected assessments. When an assessment is selected the expert's image is displayed on the right side:
65
Graph Windows
Assessment Validation
When you press the button Accept of the dialog box, assessments are automatically normalized if necessary, the sum of probabilities to be equal to 100. Then, the consensus is computed. He will serve as a probability distribution in the conditional probability table of the node for the corresponding row. This consensus is computed by averaging assessments for each state weighted by the confidence of each expert. The assessment whose confidence is equal to zero will not be taken into account.
Online Editing
When an online knowledged elicitation session has been created, a button Post Assessment is displayed in the window.
When you press this button, the current question is posted online so that experts can answer it connected via a dedicated web interface. This question asks the experts to get the probability distribution that corresponds to the chosen line of the conditional probability table. If assessments already exist for this question, the web interface for relevant experts will be pre-filled.A window displays, in real-time, the experts who anwsered the question:
Web Interface
66
Graph Windows
Once the online session has been opened, an expert can connect with its Internet browser to following secured address:https://www.bayesialab.com He will enter his expert name and the name of the session respecting upper case:
Once a question is posted through BayesiaLab, it will appear in the interface of each expert. If the expert has already answered this question, the corresponding fields are already filled. They may be modified if necessary.
67
Graph Windows
On the right side, the name and the comment of the node are displayed. A colored slider and a field allows editing the probability associated with each state the node. When the probability of a state is changed, the others will be adjusted in order to always sum to 100. A lock located on the left allows blocking the corresponding probability so it is no longer editable. The probability can be unlocked by clicking again on the lock. The expert can also enter the confidence it grants to its assessment. A text field lets you enter the comment associated with this assessment. On the left, the context of the question is displayed, i.e. the current states of each parent of the node. By hovering the name of the parents, their comments will be displayed in a tooltip. The pie chart in the bottom left reproduces the current probability distribution. A textual indicator indicates in a simplified way the confidence of the assessment. The button Validation sends the anwser of the expert to the server then the indicator of experts who have answered is updated on BayesiaLab. After that, the waiting interface is displayed again.
Tools
The contextual menu of the expert indicator allows us to generate an assessment report. The menu Tools > Assessment provides access to various tools for exporting assessments. The menu Analysis > Graphic provides access to the assessment sensitivity analysis.
68
Graph Windows
Furthermore, a node (Discrete or Continuous) has various possible states, each one being represented by a different visualization:
Normal: the node in the standard state (node N1); Selected: the selection makes it possible to carry out operations like Move, Delete, Monitor, etc. (node N2); Monitored: a monitored node has an associated monitor in the monitoring panel (node N3); Hard evidence: an observed node (node N4) represents a random variable that has been observed. Based on these observations, it is possible to start a probabilistic inference to compute the probabilities of the unobserved random variables knowing the value of the observed ones; Soft evidence: a soft observed node (node N5) represents a random variable for which the user has entered likelihoods. Simple probability: a node with simple probability set (node N5) represents a random variable for which the user has entered likelihoods from probabilities. Fixed probability: a node with fixed probability (node N9) represents a random variable for which the user has entered fixed probabilities. Target: a target node (node N6) is a particular node (possibly having a target value) that specifies focusing the analysis algorithms on that node (in validation mode), or to direct the learning process (supervised learning) Machine learning: the design of the structure is fully focused on the characterization of this node; Evaluation: the precision of the prediction of the Bayesian network is measured with respect to the target node. The target value is used for measuring the precision with the Gain, Lift and ROC curves. Analysis: the analysis is carried out to measure the influence of the network variables on the knowledge of the target variable (Target Node Analysis, Target Analysis Report), or on the knowledge of the target state (Target State Analysis: accurate analysis of the influence of a variable on the target state). The target node is also used to specify the root evidence that is necessary to generate the Evidence Analysis Report.
69
Graph Windows
Inference: the value of the target is inferred for each case described in a database (Batch Exploitation); Decision-making aid: development of an adaptive questionnaire focused on the knowledge of the target variable, or focused on the knowledge of the target state. Temporally spied: a node can be temporally spied only if it belongs to a dynamic Bayesian network. This node state indicates that the probability evolution of a particular value of this node (node N7) will be followed in a chart and possibly in an output file during the temporal simulation. Hidden: this kind of nodes (node N8) corresponds to the nodes that are created while there is a database associated to the network. These nodes do not have any corresponding data in the base (typically, the Cluster node is a hidden node). These nodes can be used to create new concepts based on nodes that exist in the base. As in Clustering, it is possible to connect hidden nodes to a subset of nodes, to indicate the number of states, and to learn the parameters. Once the learning parameters is achieved, Taboo learning (the only Association Discovery algorithm that does not remove the existing arcs when launching) can be used to integrate these hidden nodes to the rest of the networks. Not observable: this kind of nodes (node N9) corresponds to the nodes that have no associated cost. Translucent: a node (node N10) can become translucent for various reasons. In this case, it is not selectable anymore. The possible causes are: it doesn't belong to a displayed class its arcs are filtered by the Pearson's correlation the current mode is the neighborhood analysis etc. If the comment associated to the node is shown on screen, it becomes translucent as well. Excluded: this kind of node (node N11) corresponds to the nodes which must not be taken into account during learning. It is possible to exclude a node or to include it thanks to the contextual menu of the node. When a node is excluded, the supervised or structural learning algorithms do not add any relationships of which one of the ends is this node. The MDL score is modified accordingly. Beyond these states, the nodes can be tagged by a Warning or an Error . In the last case, it means that there is an error the formula of the node (after changing the parent nodes). The warning indicates either that the formula is not verified, or that the conditional probability table is not edited, or that the probabilities do not sum to 100. Keeping pressed while pointing on one of those icons allows displaying the corresponding error/warning message. The icon , displayed at the bottom left of the node, indicates if assessments are associated with this node. More the disagreement between the assessments is important for this node, more the background of the icon becomes dark. Pressing while hovering on the icon displays the following information: the number of conditional probability table's rows with assessments compared to the total number of rows the total number of assessments the number of experts involved compared to the total number of expert the global disagreement the maximum disagreement
70
Graph Windows
The icon , displayed at the top right of the node, indicates if there is missing values in the database associated to the network. Pressing while hovering on the icon displays the number of missing values associated to this node. The icon , displayed at the bottom right of the node, indicates if one of the node's states is defined as a filtered state. The icon , displayed at the right of the node's name indicates if there is a comment associated to the node. Pressing while hovering on the icon displays the comment associated to this node. All these indicators can be hidden with the button or menu the display toolbar. hide information in the View menu or in
71
Graph Windows
72
Graph Windows
The arcs that link chance nodes and utility nodes are fixed to prevent arc inversion. Indeed, utility nodes cannot have outing arcs. In Validation Mode, the expected value of a utility node is displayed in a specific monitor.
73
Graph Windows
As it can be seen in the screen shot above, the best policy (the best action for each state) is described with blue background cells in the quality table of the Decision node. In Validation Mode, the best action is displayed in a specific monitor, and its expected value can be displayed by pressing while pointing on the monitor.
74
Graph Windows
The button Display loops displays the following dialog box showing the different introduced loops and their length. Selecting a loop in the list displays the arcs that belong to the loop in pink in the graph.
Arc inversion
An arc can be inverted by using the contextual menu that is activated by a left click on the arc (or on an arc belonging to the selection). In modeling mode, any arc can be inverted (but the conditional probability tables are lost). In Validation mode, an arc can only be inverted when that inversion leads to an equivalent graph, i.e. a graph belonging to an equivalent class (the conditional probability tables are then modified so that the probability law remains unchanged).
Arc deletion
The deletion of an arc is made either in the deletion mode (see Use) or by using the shortcut + click.
Move an arc
A node is moved by keeping pressed during the drag operation. An arc can only be moved if the initial origin node has the same number of states as the final origin node. The probability table is conserved.
75
Graph Windows
Edge display
Displaying edges is only allowed in Validation mode. It can be done by using the Inference menu or directly by the mean of the shortcut . This function is unavailable for dynamic Bayesian networks and approximate inference.
Edge orientation
It is only possible to set an orientation to an edge in Validation mode. It can be done by using the Inference menu or directly by the mean of the shortcut . Edges are automatically directed when switching to modeling mode.This function is unavailable for dynamic Bayesian networks and approximate inference.
76
Graph Windows
An arc can be in one of these 4 states: normal: the arc in a standard state (arc from N1 towards N3); selected: the selection makes it possible to carry out operations like Invert, Delete, etc. (arc from N2 towards N3); fixed: a fixed arc (arc from N1 towards N2) is a mean to introduce a priori knowledge on the network structure before learning; temporal: a temporal arc indicates a temporal relation between two particular nodes: the father node (Node t) represents a variable at time step t whereas the child node represents the same variable at time step t+1 (Node t+1). These two nodes must thus be strictly identical. In addition, a node can only have one single temporal child. A temporal arc transforms the network in a dynamic Bayesian network. An arc can become translucent for various reasons: one of its extremities doesn't belong to a displayed class it is filtered by the Pearson's correlation etc. In this case, it is not selectable anymore. If the comment associated with the arc is shown on the screen, it becomes translucent also. In Validation mode, it is also possible to have not oriented arcs (edges) to indicate that the orientation of the arc can be changed without modifying the joint probability law. The icon , displayed at the middle of the arc name indicates if there is a comment associated to the arc. Pressing while hovering on the icon displays the comment associated to this arc. Lastly, in Validation mode, the node contextual menu allows highlighting the arcs that belong to the influence paths between a node and the target node. Those paths indicate by which way the information goes from the node to the target node (and conversely). The screenshot below illustrates this functionality. Three information paths link Dyspnea and Cancer, the number in brackets corresponds to the length of the path.
77
Graph Windows
This analysis, that illustrates the d-separation concept, takes into account the context, i.e. the evidences that have been entered, as illustrated below. Without any evidence concerning the Dyspnea value, there is no information path that links VisitAsia and Cancer. However, knowing the value of that node opens three influence paths.
78
Graph Windows
79
Graph Windows
If a node has a filtered state, its monitor will display the icon we can see it in the following picture:
80
Graph Windows
1. No initially filtered state in the continuous node, filtered values in the database: an interval of width 1E-7 will be added after the intervals defined in the node. 2. Initially filtered state in the continuous node and no filtered value in the database: the imported values which correspond to the interval of the filtered state are treated as filtered values. 3. Initially filtered state in the continuous node and filtered values in the database: the imported values which correspond to the interval of the filtered state are treated as filtered values and the imported filtered values are also associated with this interval.
81
Graph Windows
Class editor
The class editor is accessible by the contextual menu associated to the graph. The table shows the list of the classes with the number of nodes they contain. It is possible to rename a class by doubleclicking on its name in the table.
Add a class
The Add button displays the following dialog box in order to enter the name of the class, which must be different from the others:
Once a valid name is entered, you have to select the nodes that you want to add in the class:
82
Graph Windows
Modify a class
The Modify button displays a dialog box which is the same as the previous one and that allows managing the content of the class. It is possible to add or remove nodes from this class. If no node remains and you validate, the class will be deleted.
Delete a class
The Delete button removes all the classes selected in the table.
Apply a color
The Color button displays the following dialog box in order to associate a chosen color to the nodes of the selected classes if the first checkbox is selected or to associate a distinct color to each selected class if the last checkbox is selected. In this case, a random color will be associated with the nodes of each class. If no checkbox is selected, the color of the nodes will be removed.
Apply an image
The Image button displays the following dialog box in order to associate a chosen image to the nodes of the selected classes. If the checkbox is not selected, the images of the nodes will be removed.
83
Graph Windows
The Temporal index button displays the following dialog box in order to associate a temporal index to the nodes of the selected classes. If the checkbox is not selected, the temporal indices of the nodes will be removed.
Apply a cost
The Cost button displays the following dialog box in order to associate a cost to the nodes of the selected classes. If the checkbox is not selected, the nodes will be not observable.
Note
It is not the classes to which these properties are applied but to the nodes contained in these classes at the time when these properties are applied. If a node is added to a class after having applied a property, this node will not have this property.
84
Graph Windows
Constant editor
The constant editor is accessible by the contextual menu associated with the graph. This table displays the list of the constants. For each constant, its name, its type and its current value are displayed. The values are editable by double-clicking on the corresponding cell.
Add a constant
The Add button displays the following dialog box allowing entering the name of the constant, which must be unique in the network, the type of the constant and its value:
Delete a constant
85
Graph Windows
The Remove button deletes all the selected constants. After validating, the nodes having their equation that uses a deleted constant are displayed with the icon: .
Modify a value
To modify the value of a constant you have to double-click on it to edit it. In the following case, the edition of a Boolean constant allows only the values True or False:
Using constants
The constants are used when the equations of the nodes are defined. They are handled like variables.
86
Graph Windows
Cost table
The cost table can be edited by using the contextual menu associated to the graph or through the assistant of the adaptive questionnaire that is available in the Inference Menu Item. This table associates a cost to each variable (1.0 is the default value). A cost must be superior or equal to 1. In order to indicate that a variable is not observable, one has to delete the associated cost (or indicate a value inferior to 1).
The table is sortable by clicking on the header of each column. The costs can be edited individually for each node by using the contextual menu of the node, or by using the thumbnail Properties of the node editor.
Import
The Import button allows importing a list of costs from a dictionary as in Import Dictionary menu.
Export
The Export button allows exporting the list of costs in a dictionary as in Export Dictionary menu.
87
Graph Windows
The table is sortable by clicking on the header of each column. In this example, you can see eight kinds of arcs constraints: 1. Forbidden arc from N1 to N2 2. Forbidden arc from N1 to N3 and from N3 to N1 3. Forbidden arcs from N1 to all the nodes of the class Class1 4. Forbidden arcs from N1 to all the nodes of the class Class2 and from all the nodes of the class Class2 to N1 5. Forbidden arcs from all the nodes of the class Class1 to N2 6. Forbidden arcs from all the nodes of the class Class1 to N3 and from N3 to all the nodes of the class Class1 7. Forbidden arcs between the nodes of the class Class1 8. Forbidden arcs from all the nodes of the class Class1 to all the nodes of the class Class2 and from all the nodes of the class Class2 to all the nodes of the class Class1
88
Graph Windows
The Add button displays the following dialog box to define a new constraint on the arcs. It is possible to choose between a node and a class for the beginning and the end and to choose if the constraint applies in one or both directions using the corresponding buttons:
89
Graph Windows
The table is sortable by clicking on the header of each column. When at least one node has a temporal indice, the indicator network. A click on this icon displays the previous dialog box. is displayed in the status bar of the
Import
The Import button allows importing a list of temporal indices from a dictionary as in Import Dictionary menu.
Export
The Export button allows exporting the list of temporal indices in a dictionary as in Export Dictionary menu.
90
Graph Windows
The following editor can be accessed from the contextual menu of the network or from the indicator in the status bar.
91
Graph Windows
The table is sortable by clicking on the header of each column. When at least one node has a local structural coefficient, the indicator of the network. A click on this icon displays the previous dialog box. is displayed in the status bar
Import
The Import button allows importing a list of local structural coefficients from a dictionary as in Import Dictionary menu.
Export
The Export button allows exporting the list of local structural coefficients in a dictionary as in Export Dictionary menu.
92
Graph Windows
The table is sortable by clicking on the header of each column. When at least one node has a temporal index, the indicator network. A click on this icon displays the previous dialog box. is displayed in the status bar of the
Import
The Import button allows importing a list of state virtual numbers from a dictionary as in Import Dictionary menu.
Export
The Export button allows exporting the list of state virtual numbers in a dictionary as in Export Dictionary menu.
93
Graph Windows
3.19. Experts
The assessment mechanism needs to define experts associated with the network. An expert is defined by: a name that must be unique a credibility which is a real between 0 and 1. When an expert has its credibility equal to 0, his expertise is not taken into account. A fully credible expert has a credibility of 1. This credibility weighs its own expertise relatively to others. an image for rapid identification. It can be a photo or an avatar for example. a comment which describes the expert (eg competence, domain, etc.) In the editor expert, these four properties can be edited by double-clicking on the corresponding cell. The fifth property displayed is the number of assessments the expert has made in the network. This property is not editable.
Add :
The button Add opens a new window allowing the user to enter information about a new expert:
94
Graph Windows
To add or change the image, simply click in the box. The verification of the uniqueness of the name is made during the validation.
Remove:
The button Remove the selected experts of the list. The assessment done by this expert will also be removed from the nodes. However, the conditional probability tables won't be regenerated.
Import:
The button Import allows loading a dictionary of experts. The format of the dictionary is the following: Structure of the dictionary file Experts Name of the expert, Equal, Space or Tab, credibility, Space, path of the image (optional), Space, //comment (optional). The path of the image is relative to the directory in which the file is. In fact, it is more simple to put the images in the same directory. If the same expert is present sevral times, the last occurrence is always chosen.
Export:
The button Export exports a dictionary of experts. The images of the experts will be saved in png format in the same directory as the dictionary with the name expertImageN.png
Open Session:
The button Open Session allows opening an online assessment session.
95
Graph Windows
The user msut provide a session identifier (unique) and a password. Once the session opened, the experts can connect with their browser to the folowing secured address:https://www.bayesialab.com The user must enter his name as given in the expert editor (respecting case) and the session identifier. To create an online assessment session, you must contact Bayesia directly: [email protected]. If a session with the same name has already been opened, a dialog box will offer to overwrite if this session comes from the same machine. If it comes from another machine, it will not be possible to use this session name.
Close Session:
the button Close Session allows ending the current online assessment session. Experts which are still connected will be disconnected. Closing a network or BayesiaLab while a session is running automatically closes the session.
Generate Tables:
This button is active if at least one expert is selected. It generates conditional probability tables of the nodes by taking into account the assessment of selected experts. If none of the selected experts has made assessment for a node then the consensus of all experts is used. The credibility of experts is used for generation When at least one expert is associated with the network, the indicator of the network. A clic on this icon displays this editor. is displayed in the status bar
96
Graph Windows
It shows the list of experts: their name, credibility, comment and the number of assessments done like in the expert editor. The last column shows the average assessment time used by each expert to make assessment. This time is computed only when an online session is used. After that, two tables are displayed: The first table is the list of the nodes with assessment. The first column is node name, the next column is the comment of the node (depending on the settings) and the last one is the global disagreement of the assessments.This percentage represents the average deviation of each assessment with respect to the mean of each cell. It takes into account the confidence associated with each assessment. If an assessment has a confidence equal to zero, it won't be taken into account in the global disagreement. The nodes are sorted according to the global disagreement. The second table is also the list of the nodes with assessment. The first column is node name, the next column is the comment of the node (depending on the settings) and the last one is the maximum disagreement of the assessments. This percentage represents the maximum deviation of all assessments in the whole table. If an assessment has a confidence equal to zero, it won't be taken into account in the maximum disagreement. The nodes are sorted according to the maximum disagreement. The global and maximum disagreements between the experts allow us to easily find on which nodes the knowledge of the experts is not the same and the knowledge elicitation should be verified. The second part of the report details the assessment for each variable:
97
Graph Windows
For each node with assessment, a table contains: The number of rows in the conditional probability table that have assessments associated with compared to the total number of rows The total number of assessments done on this node The number of experts involved compared to the total number of experts The global disagreement of the node The maximum disagreement in the conditional probability table The global assessment time which is the sum of all assessment times The mean assessment time by row of the conditional probability table The mean expert assessment time for this node
98
Graph Windows
3.20. Variations
Several analysis need to define positive and negative variations in order to apply them on the mean of each node. The following editor allows the user to edit these variations. This editor is accessible from the parameter panels of the concerned analysis. It is possible to associate with each node a negative and a positive variation in percentage. These variations will be saved with the network and will be available for each analysis. A variation, positive or negative, is a positive real number between 0 and 100%. The default value is 10%.
The Import button allows importing a list of variations from a dictionary as in Associate Dictionary menu. The syntax is: Dictionary File Structure Variations Name of a node or a class Equal,Space or Tabvariation for giving the same value to negative and positive variations ornegative variationSpacepositive variation for giving a different value to negative and positive variations. The varaition is a real number between 0 and 100. A node can be present only once otherwise the last occurrence is chosen.
Export
The Export button allows exporting the list of negative and positive variations in a dictionary as in Export Dictionary menu.
99
Graph Windows
3.21. Comments
It is possible to associate a comment to a network, to a node or to an arc by using the contextual menus. Concerning the comments associated to the network, some actions add some fields to them automatically: The date and the author, when a network is created; The database, the number of lines, the value of the structural complexity if it differs from the default one, the missing value processing, the learning method, the final score and compression rate, at the end of learning. Concerning the nodes, it is also possible to associate comments by using a dictionary file.The associated comments must be written in HTML (3.2) or in simple text. In this case, they will be automatically embedded inside and HTML document. The comments associated to the nodes can be edited with the node editor. The comments are in HTML (version 3.2). The following editor allows creating complex comments in HTML.
The File menu allows : creating a new empty HTML document opening a HTML (3.2) file saving its comment in a HTML file The Edit menu allows : copying, cutting and pasting undoing or redoing an action
100
Graph Windows
The Insert menu allows : inserting a link towards a file or an URL inserting an image The Format menu allows : displaying the following dialog that allows modifying the page properties:
The Tools menu allows : displaying the HTML source of the comment that can be directly modified:
With the buttons of the toolbar, it is possible to change, for the current selection, the font, the text alignment, the bold, italic and underlined attributes and the color of the foreground and background. According to the position of the cursor, the contextual menu, accessible with a right click, allows: copying or cutting the selection,
101
Graph Windows
inserting, editing or removing a link, displaying page properties, displaying image properties
The nodes and the arcs that have an associated comment are presented as follows:
Comments are displayed by keeping pressed while pointing on the icon displayed on the node or the arc. To display the graph's comment the key must be pressed while hovering the graph's window. If a node has a color tag, the corresponding color is displayed as a border at the top and the left of the display. A click on a hypertext link opens the associated document.
It is also possible to display all the comments in a specific window by using the contextual menu associated to the network background. The contents of that window (HTML text) can then be copied&pasted to external applications, saved in a HTML file or printed.
102
Graph Windows
103
Graph Windows
On a node
Edit (modeling mode): opens a dialog box for the edition of the node properties. Rename (modeling mode): allows editing the name of the corresponding node. Copy (any mode): allows copying the node (and the selection to which it belongs, if it exists). Delete (modeling mode): deletes the current node and all the selected ones. Set as Target Node (any mode): the target node (and its target value) is unique in the graph. It is used to supervise the Bayesian network learning algorithms in order to construct a structure dedicated to the prediction of that node, or to indicate focusing the analysis on it. Influence Paths to Target (validation mode): shows the different influence paths that link the node and the target node. Monitor (validation mode): associates a monitor to this node and all the other selected nodes (see monitor panel). Exclude (modeling mode): set the node as excluded so it won't be taken into account during learning. Graphs (any mode, with a database): shortcut to display the graphs editor with the selected nodes as entry variables (see graphs). Imputation (modeling mode, with a database and missing values): allows performing imputation of the missing values on the selected nodes if they are hidden or have missing values in the database. See database imputation. Follow the Temporal Evolution (any mode, dynamic Bayesian network): allows defining one or several states of a node for which the probability evolution will be followed during the temporal simulation:
Select: Connected Nodes: select the nodes that are directly or indirectly connected to this node. Connected Root Nodes: select the nodes that are directly or indirectly connected to this node and that have no parent. Markov Blanket: select the nodes of this node's Markov blanket in the network.
104
Graph Windows
Classes: select the nodes that belong to the same classes of this node. Alignment: aligns the set of selected nodes with respect to the node from which the contextual menu has been activated: Horizontal (any mode): aligns the nodes by keeping their horizontal spacing. Vertical (any mode): aligns the nodes by keeping their vertical spacing. Horizontal Distribution (any mode): aligns the nodes and gives the same horizontal spacing between them. Vertical Distribution (any mode): aligns the nodes and gives the same vertical spacing between them. Properties: Color: This property can be modified with the node editor. Edit (any mode): allows associating a color tag to the node and to all the selected nodes. If classes are associated with these nodes, the chosen color can be associated with all the nodes of the selected classes. Remove (any mode): removes the color tagging of the node and of all the selected nodes. If classes are associated with these nodes, colors can be removed from all the nodes of the selected classes. Image: This property can be modified with the node editor. Edit (any mode): allows associating an image to a node and to all the selected nodes that will be displayed instead of the default node's view. If classes are associated with these nodes, the chosen image can be associated with all the nodes of the selected classes. Remove (any mode): removes the image of the node and of all the selected nodes. If classes are associated with these nodes, images can be removed from all the nodes of the selected classes. Classes: This property can be modified with the node editor or with the class editor. Add (any mode): allows adding an existing class or a new class to a node and to all the selected nodes. Remove (any mode): removes the selected classes from this node. If a removed class does not contain any node, it will be deleted. Temporal Indice (modeling mode): This property can be modified with the node editor. It allows setting or removing the temporal indice of the node and of all the selected nodes. If classes are associated with these nodes, the temporal indices can be set or removed from all the nodes of the selected classes. Cost (any mode): This property can be modified with the node editor. It allows setting or removing the cost of the node and of all the selected nodes. If classes are associated with these nodes, costs are set or removed from all the nodes of the selected classes. State Virtual Number (modeling mode): This property can be modified with the node editor. It allows setting or removing the sate virtual number of the node and of all the selected nodes. If classes are associated with these nodes, the state virtual numbers can be set or removed from all the nodes of the selected classes. Local Structural Coefficient (modeling mode): This property can be modified with the node editor. It allows setting or removing the local structural coefficient of the node and of all the selected nodes.
105
Graph Windows
If classes are associated with these nodes, the local structural coefficients can be set or removed from all the nodes of the selected classes. Exclusion (modeling mode): This property can be modified with the node editor. It allows excluding or not the node and all the selected nodes. If classes are associated with these nodes, all the nodes of the selected classes can be excluded or not. Comment (any mode): This property can be modified with the node editor. Edit (any mode): allows editing the comment of the node. Remove (any mode): removes the comment of the node and of all the selected nodes. If classes are associated with these nodes, images can be removed from all the nodes of the selected classes.
On an arc
Change Orientation (modeling mode): inverts the arc. The probabilistic data associated with the two nodes are lost. Invert Orientation within the Equivalence Class (validation mode): inverts the arcs only if the resulting graph makes it possible to encode the same probability law. The conditional probability tables of the implied nodes are automatically updated. These arc inversions are propagated to the other arcs and edges in order to preserve the probability law. Edge Orientation (validation mode): gives an orientation to an arc for which the orientation has previously been removed (see edges display). Temporal Relation (any mode): defines the arc as a temporal arc; i.e. an arc with a particular semantic: the two linked nodes represent the same variable at two consecutive time steps. This is thanks to this functionality that it is possible to define dynamic Bayesian networks. Delete (modeling mode): deletes the current arc and all the selected ones. The probabilistic data associated to the destination node is lost. Properties: Color: Edit (any mode): allows associating a color tag to the arc and to all the selected arcs. Remove (any mode): removes the color tagging of the arc and of all the selected arcs. Fix (any mode): allows regarding the arc as a certain a priori knowledge for the Taboo learning algorithm (see learning), as well as in the edge orientation context. Comment (any mode): Edit (any mode): allows editing the comment of the arc. Remove (any mode): removes the comment of the node and of all the selected nodes. If classes are associated with these nodes, images can be removed from all the nodes of the selected classes.
106
Graph Windows
Paste (modeling mode): paste the arcs and nodes previously copied inside the network, renaming the nodes if necessary. Delete Selection (modeling mode): deletes all the selected arcs and nodes. Delete All Arcs (modeling mode): deletes all the arcs to obtain an unconnected network. Delete All Unfixed Arcs (modeling mode): deletes all the arcs that are not fixed. Delete All Unconnected Nodes (modeling mode): deletes all the nodes without any arcs. Delete All Virtually Unconnected Nodes (KL Force) (modeling mode): deletes all the nodes without any arcs (only if the Arc force analysis has occurred and the Arc force trim has been used in the validation mode: . Edit Structural Coefficient (any mode): opens the dialog box that allows modifying the structural complexity influence coefficient of the network for learning. Edit Costs (any mode): opens the dialog box that allows associating a cost for the observation of a variable (see Cost management). Edit Classes (any mode): opens the dialog box that allows the creation and the edition of the classes associated to the nodes (see Classes management). Edit Constants (modeling mode): opens the dialog box that allows the creation and the edition of the constants that will be used in the formulas describing the probability distributions of the nodes (see Constants management). Edit the Forbidden Arcs (modeling mode): opens the dialog box that allows the creation of forbidden arcs in the network's structure (seer Forbidden arcs management). Edit Temporal Indices (any mode) : opens the dialog box that allows editing the temporal indices associated to the nodes for learning. Edit State Virtual Numbers (modeling mode): opens the dialog box that allows editing the state virtual numbers associated to nodes and used for learning. Edit Local Structural Coefficients (modeling mode): opens the dialog box that allows editing the local structural coefficients of each node. This coefficient is used for structural learning. Edit Experts (modeling mode): opens the dialog box that allows editing the experts of the network. Experts are used for assessment sessions. Use Time Variable (modeling mode): allows using the parameter variable that represents the time in the equations.
Display Comments: allows displaying all the comments into a specific window. Graph Report: allows displaying a report containing some properties of the network (like connectivity, number of nodes, of arcs, etc.) and also the different warnings and errors of the nodes. This report also contains the conditional probability tables of the selected nodes. If there is no selected node, all the tables are displayed. A list of the excluded nodes and a sorted list of the forbidden arcs are added if necessary. Properties: Background Image: Edit (any mode): allows the selection of an image file in order to set it as background of the current graph window. It is possible, through the settings, to activate or deactivate this functionality.
107
Graph Windows
Remove (any mode): removes the image from the background of the active graph window. Font: allows changing the font used to display the nodes' name. Comment: allows adding a comment to the network. A default comment with the date and author name is automatically associated at the creation of a network. The comment is displayed when pressing while pointing on the background of the graph panel.
108
Graph Windows
4. Monitor panel
In Validation mode, a graph window presents two panels: the graph panel and the monitor panel. The monitor panel is used to visualize the probability distributions of the monitored variables returned by the inference process. It is also the panel that easily allows entering evidences (hard and soft) about the variables.
109
Graph Windows
2. entering evidences (hard or soft) about the variable. Each observation/removal causes the dynamic update of the displayed monitors. The monitors that are associated with Utility nodes are used to display the expected value of the node and the sum of expected utilities. The monitors associated with Decision nodes are used to display the marginal probabilities of the decision, to indicate what the optimal action is, and to choose the action to apply.
Creation
It is only possible to associate a monitor with a node in Validation mode. BayesiaLab offers two ways to create monitors: 1. a double-click on the node, 2. via the contextual menu of the node The monitor then appears displayed in the monitor panel.
Removal
There are four possibilities to delete a monitor: 1. a double-click on the corresponding node, 2. via the contextual menu associated with the node, 3. via the contextual menu associated with the monitor, 4. by selecting the monitor and by pressing the key.
States
A monitor can be in different states, depending on the type and the state of the associated node: 1. normal: probability distribution of a chance node;
110
Graph Windows
2. hard evidence : the observed value is highlighted with the green bar;
4. soft evidences validated: after validating soft evidences with the light green button;
6. likelihoods validated from probabilities: after validating probabilities with the light green button;
7. fixed probabilities validated: after validating probabilities with the mauve button;
9. target + targeted value: probability distribution of the target node within the framework of an adaptive questionnaire centered on a target value. The target value probability appears in light blue;
111
Graph Windows
10. temporal: probability distribution of a temporal father node (current time step);
11. utility: expected value of the utility node and the sum of all the expected utilities (in bold face). These values appear with their respective minimal and maximal values.
12. decision: displays the probability distribution of the actions corresponding to a decision node (equiprobable except when descendant nodes are observed). The recommended action with respect to the context appears in light blue. This action corresponds to the one with the best expected quality, quality that can be displayed by pressing the key while pointing on the monitor
13. not observable: there is no cost associated with the node, it won't be proposed in the adaptive questionnaire and it won't be observed in the interactive inference, the interactive updating, the batch labeling and the batch inference.
112
Graph Windows
113
Graph Windows
A node state with a zero likelihood value is an impossible state. If all the states have the same likelihood, the probability distribution remains unchanged. The soft evidence edition mode is available by two means: by pressing the key while clicking on a state bar or by using the contextual menu associated with the monitor. A green and red buttons are then added to the monitor. The likelihoods can be entered: by maintaining the left mouse button pressed while choosing the desired likelihood level, or directly by editing the likelihood value by double-clicking on the value. Once all the likelihoods are entered, the light green button allows validating the data entry and the probability distribution is updated. The red button allows cancelling the likelihood edition. Setting likelihoods:
The observed node takes the light green color of the evidence.
Probability Setting
Setting the probabilities allows directly indicating the probability distribution of a node. Likelihoods are recomputed so that the final probability distribution of the node is the same as entered by the user. The probability edition mode is available by two means: by pressing the and keys while clicking on a state bar or by using the contextual menu associated with the monitor. A light green, mauve and red buttons are then added to the monitor. The probabilities can be entered: by maintaining the left mouse button pressed while choosing the desired probability level, or directly by editing the probability value by double-clicking on the value. A click on the name of the state (on the right) fix the current probability value (the probability bar is green). Once all the probabilities are entered, the light green button allows setting the probabilities and the mauve button allows fixing the probabilities. The probability distribution is then updated. The red button allows cancelling the probability edition.
114
Graph Windows
1. Simply setting the probabilities: When the probabilities are validated with the light green button, the likelihoods associated with the states of the node are computed again in order to make the marginal probability distribution correspond to the distribution entered by the user. It is, in fact, an indirect capture of the likelihoods. You must note that, at the next observation of another node, the probability distribution of this node will change because the likelihoods are not computed again.The result will be displayed with light green bars as the likelihoods:
The observed node takes the light green color of the evidence. 2. Fixing the probabilities: When the probabilities are validated with the mauve button, the likelihoods associated with the states of the node are computed again in order to make the marginal probability distribution correspond to the distribution entered by the user, as in the previous case. However, at each new observation on another node, a specific algorithm will try again to make the probability distribution of the node converge towards the distribution entered by the user. Fixing probabilities is also done in the evidence scenario files with the notation p{...}.You must note that fixing probabilities is only valid for the exact inference. If the approximate inference is used, fixing probabilities is considered like simply setting the probabilities: there is no more convergence algorithm.The result will be displayed with mauve bars:
Caution
To obtain the indicated distribution, a convergence algorithm is used. However, sometimes this algorithm cannot converge towards the target distribution. In this case, the probabilities fixing is not done and the node comes back to its initial state. In this case a warning dialog box is displayed and an information message is also written in the console.
115
Graph Windows
Once the target value entered, there are three options: No Fixing: the distribution found must be observed as likelihoods. Fix Mean: the indicated mean must be observed as fixed mean. When the mean is fixed, if an observation is done on another node, the convergence algorithm will automatically determine a new distribution in order to obtain the target mean, taking the other observations into account. If we store this evidence in the evidence scenario file, only the target mean will be stored. Fixing mean is also done in the evidence scenario files with the notation m{...}. You must note that fixing mean is only valid for the exact inference. If the approximate inference is used, fixing mean is considered like simply setting the likelihoods corresponding to the target mean: there is no more convergence algorithm. Fix Probabilities: the distribution found must be set as fixed probability distribution. Fixing probabilities is also done in the evidence scenario files with the notation p{...}. You must note that fixing probabilities is only valid for the exact inference. If the approximate inference is used, fixing probabilities is considered like simply setting the likelihoods corresponding to the target mean: there is no more convergence algorithm.
116
Graph Windows
Simply click on one of the evidence sets in order to set the corresponding observations (if it is possible). This option is disabled in temporal networks and during interactive inference and updating.
At each example, the nodes are observed with the corresponding value in the database except if this value is missing or the nodes are declared as not observable or as target. This option is disabled in temporal networks and during interactive inference and updating.
Node centering
Pressing key pressed while clicking on monitor, or on the selected monitor, allows centering the graph on the associated node and makes that node blinking during a few seconds.
Node selection
117
Graph Windows
Pressing key pressed while clicking on monitor, or on the selected monitor, allows selecting the corresponding node.
Monitor list
The list of the monitors can be reordered by drag&drop.
Probability Variation
This option allows visualizing the variation of the probabilities between each modification of the set of evidences.
A double-click on the monitor, outside the state zone, or pressing the button
The button of the monitor's toolbar allows specifying the current probability distributions as the reference state for the computation of the probability variation. The reset of the arrows is then unavailable. The button displays the maximum positive variation and the maximum negative variation among all the states of all the nodes.
118
Graph Windows
It contains the comment of the node if any or the name of the node if the comment is displayed instead of the name in the monitor. It contains a table with the states' names as first column, the corresponding probabilities as second one and the number of cases each state represents according to evidences. This last column is displayed only when a database is associated with the network.
Copy&Paste
Monitors can be copied and pasted to external applications. They can be pasted as images or data arrays in plain text or in html format. In addition to the probabilities, the mean, standard-deviation and value are also copied if they are present.
Zoom
Monitors can be zoomed or unzoomed up to the convenient buttons in the monitor toolbar.
119
Graph Windows
On the monitor
The monitor contextual menu is dynamic. It depends on the type and the number of states of the monitored node.
120
Graph Windows
Enter Likelihoods: allows entering in the soft evidence entry mode. Enter Probabilities: allows entering in the probability entry mode. Distribution for Target Value/Mean: allows entering a target mean/value and the corresponding probability distribution will be automatically determined, if it exists. This menu is always available for the continuous nodes and for the discret nodes having values associated to their states or if their states are numbers. Fix Probabilities: allows fixing the probabilities of the selected nodes to the current marginal probabilities. Absolute and relative bars and relative curve: allows changing the display mode of the chart that represents the marginal probability distributions. These menus are not available on utility nodes. Remove Likelihoods: allows removing the evidence entered by the user whatever the kind of evidence used. This option is only available when the node is observed. State List: a click on a state in this menu selects/unselects an observation of this state. Copy: copies the selected monitors into the clipboard, allowing pasting them into word processing software for example. Delete: deletes the selected monitors. However, the states of the nodes (observed or not) remain unchanged.
121
Graph Windows
5. Shortcuts
+ + + + + + + + + + Create a new Bayesian network Open a file Save the active Bayesian network Close the active graph window Print the active Bayesian network Export the active Bayesian network Undo Redo In the graph's window: Select all the nodes and arcs In the monitors' window: select all the monitors Copy the selected objects to the clipboard. These objects can then be pasted in BayesiaLab or in external applications (nodes and arcs, conditional probability tables, equations, graphs, confusion matrix, monitors) (Modeling mode) (Modeling mode) Paste the copied objects Cut the selected objects Search for nodes, arcs and monitors Zoom in and out Default Zoom Adjust and optimize the size of the network to fit the window Zoom in or out the network Rotate counterclockwise or clockwise the network Center the network in the working window or Horizontal or vertical flip Put the network in the upper left corner of the working window Add the pointed object to the selection Move the network in the graph window Close BayesiaLab Move an arc Zoom in or out the monitors
+ + + or + +
+ mouse wheel on the network or + + + + Click (Modeling mode) + Drag + + Drag + mouse wheel on the monitors
+ double-click on a state of a monitor (Validation Define the corresponding node and state as targets mode) + click on a state of a monitor (Validation mode) Enter the Soft-Evidence Edition mode + mode) + click on a state of a monitor (Validation Enter the Probability Edition mode Center the node corresponding to the selected monitor Select the node corresponding to the selected monitor
122
Graph Windows
Show edges Arc Force Analysis Pearson's Correlation Analysis Node Force Analysis Hides or displays the various node's indicators Displays the comments associated to a node if pressed while hovering on the node or arc's comment indicator. Displays the warning or error message if pressed while hovering over the node's error indicator. Displays the number of missing values if pressed while hovering over the node's missing values indicator. Displays the probability exact values if pressed while hovering over a monitor.
+ + + + +
(Validation mode) (Validation mode) (Validation mode) (Validation mode) (Validation mode)
Influence Analysis wrt Target Node Mosaic Analysis Global Orientation of edges Target Dynamic Profile Generate the target node analysis report Generate the evidence analysis report Generate the relationships analysis report Variable Clustering Select and center the nodes corresponding to the selected monitors. Total Effects on Target Target Mean Analysis Switch to the Decision Node Creation mode while the key is pressed Switch to the Constraint Node Creation mode while the key is pressed Switch to the Deletion mode while the key is pressed Display or not the positioning grid Display or not the images associated to the nodes Switch to the Arc Creation mode while the key is pressed Switch on/off the node comments Switch on/off the arc comments Switch to the Node Creation mode while the key is pressed Set or not the node and its first state as target depending on its current state. Display or not the colored tags on the nodes Switch to the Utility Node Creation mode while the key is pressed Display the network's comment if pressed while hovering the background of the graph
(Validation mode) (Modeling mode) (Modeling mode) (Modeling mode) + + (Modeling mode)
123
Graph Windows
Exclude or include the node depending on its current state. Switch to the Selection mode while the key is pressed Delete the selected objects: nodes and arcs in Modeling mode and monitors in Validation mode Quick automatic positioning of the nodes Display this help document Activate the contextual help Switch to modeling Mode Switch to Validation Mode
124
1. Network
New: creates a new graph window. Open: reads a file (XBL, BIF, NET or DNE formats) and opens a new graph window containing the Bayesian network described in this file. There are three icons that can be displayed in the file chooser: If the selected network has a database associated to it, then this option proposes to load this database with the network or not. If the selected network has an evidence scenario file associated to it, then this option proposes to load this file with the network or not. If the selected network has its junction tree saved with it, then this option proposes to load the junction tree with the network or not.
Save: saves the Bayesian network of the active graph window using the XBL format. If the network has a database or/and an evidence scenario file associated to it, they will be saved in the same file as the network in order to load them later. These features can be cancelled in the database settings. If the network as a junction tree, it will be also automatically saved in the same file.
Save As: saves the Bayesian network of the active graph window with a specified name. There are three icons that can be available in the file chooser:
125
Menus
If the network has a database associated to it, then this option proposes to save this database with the network or not. If the network has an evidence scenario file associated to it, then this option proposes to save this file with the network or not. If the network has a junction tree created, then this option proposes to save the junction tree with the network or not
Close: closes the active graph window. Proposes to save the network, the associated database and evidence scenario file if one of them is modified. Close All: closes all the graphs, prompting for saving the network, the associated database and evidence scenario file if one of them is modified. Recent Networks: keeps a list of the recently opened networks. The size of this list can be modified through the settings of menus. If a network is loaded from this list, the associated database, evidence scenario file and junction tree will be automatically loaded. Choose Working Directory: allows indicating in which directory we want to work. A name is associated to this directory and it is added to the list of recent working directories. The following dialog box allows configuring it:
The directory changing policy can be modified in the directory preferences. Recent Working Directories: keeps a list of the recent created or used working directories. Export: allows exporting the Markov blanket of the target variable of the current network into a language selected in the following dialog box:
Once the network is exported in a language, it can be used to infer the value of the target variable according to the observations of the other variables. Lock: allows locking the network with a password to prevent it from editing. The network can be used only in validating mode. This menu gives access to lock manager. Print: prints the Bayesian network of the active graph window. An assistant gives access to: the setup of the page-setting, the configuration of the printer, the selection of the desire scale for the network,
126
Menus
the possibility of displaying reference marks. These marks are useful when the network has to be print on more than one page. They indicate the page number (column, row), the border, and the vicinity, the possibility to center the network.
Exit: Closes all the graphs, prompting for saving if needed, and closes BayesiaLab.
Several export languages are available according to the options of your license. If you need other languages, you must update your license. It is also possible to ask Bayesia to develop other export formats that have not been already taken in account. Some of these export formats exist but are only used by Bayesia SA. You can contact us to see the conditions for that Bayesia generates the corresponding outputs. The Settings button can display a dialog box allowing specifying the options depending on the selected language.
SAS
This format is available with the convenient license.
JavaScript
This format belongs to Bayesia S.A. To export under this language, please contact us.
127
Menus
PHP
This format belongs to Bayesia S.A. To export under this language, please contact us.
128
Menus
You have simply to enter a password and to confirm it. Then, the indicator is displayed in the status bar of the network. The network has a lock, now, but it is still modifiable because it is not locked. To prevent editing, you have just to click on the indicator in order to become . This icon indicates that the network is not editable anymore. To be able to edit it again, simply click on the icon (or with the menu Network>Lock) and a dialog box asking the password is displayed:
When the network is unlocked, the menu Network>Lock displays the following dialog box:
129
Menus
This dialog box allows the user to: lock the network using the existing password remove completely the lock change the lock password.
130
Menus
2. Data
Open data source: This menu item allows opening the file or the database selector and then calls the Data importation wizard . Text file: Once the file is read and the pre-processing done, a fully unconnected network is created in a new graph window, each attribute having one corresponding node. The set of Bayesian network learning methods becomes then available. Database: Once the database table is loaded and the pre-processing done, a fully unconnected network is created in a new graph window, each attribute having one corresponding node. The set of Bayesian network learning methods becomes then available. Recent databases: Keep a list of the recently opened databases. The Data importation wizard is directly opened on the selected file. The size of this list can be modified through the settings Menus . Associate data source: This menu item allows opening the Data association wizard in order to associate data from a text file or a database with an existing Bayesian network. Recent databases: Keep a list of the recently opened databases. The Data association wizard is directly opened on the selected file. The size of this list can be modified through the settings Menus . Warning: When the network structure is modified during the association (addition of nodes or states), the conditional probability tables are automatically recomputed from the database. If the structure remains unmodified, the conditional probability tables are not modified. Associate dictionary: This menu item allows defining the properties of the active Bayesian network thanks to text files. These properties concern arcs, nodes and states: Arc: Arcs: allows associating a set of arcs to the network. The indicated arcs can be added or removed from the network. The arc removal will always be done before adding an arc. Before adding an arc, all the constraints belonging to the Bayesian network as well as the arc constraints and the temporal indices will be checked. If a constraint is not verified, then the arc won't be added. Forbidden Arcs: allows associating with the network a set of forbidden arcs . Arc Comments: allows associating with the network a set of arc comments . Arc Colors: allows associating with the network a set of colors on the arcs. Fixed Arcs: allows defining if some arcs are fixed or not. Node: Node Renaming: allows renaming each node with a new name. These new names must be, of course, all different. Comments: allows associating a comment with each node that is in the file. Classes: allows organizing nodes in subsets called classes . A node can belong to several classes at the same time.These classes allow generalizing some node's properties to the nodes belonging to the same classes. They allow also creating constraints over the arc creation during learning.
131
Menus
Colors: allows associating colors with the nodes or classes that are in the file. The colors are written as Red Green Blue with 8 bits by channel in hexadecimal format (web format): for example the color red is 255 red 0 green 0 blue, it will give FF0000. Green gives 00FF00, yellow gives FFFF00, etc. Images: allows associating colors with the nodes or classes that are in the file. The images are represented by their path relatively to the directory where the dictionary is. Costs: allows associating with each node a cost . A node without cost is called not observable. Temporal Indices: allows associating temporal indices with the nodes that are in the file. These temporal indexes are used by the BayesiaLab's learning algorithms to take into account any constraints over the probabilistic relations, as for example the no adding arcs between future nodes to past nodes. The rule that is used to add an arc from node N1 to node N2 is: If the temporal index of N1 is positive or null, then the arc from N1 to N2 is only possible if the temporal index of N2 is greater of equal to the index of N1. Local Structural Coefficients: allows setting the local structural coefficient of each specified node or each node of each specified class. State Virtual Numbers: allows setting the state virtual number of each specified node or each node of each specified class. Locations: allows setting the position of each node. State: State Renaming: allows renaming each state of each node with a new name. State Values: allows associating with each state of each node a numerical value . State Long Names: allows associating with each state of each node a long name more explicit than the default state name. This name can be used in the different ways to export a database, in the html reports and in the monitors. Filtered States: allows defining a state to each node as a filtered state . Dictionary File Structures Arc Arcs Name of the arc's starting node or class, -> , <- or even -- to indicate the both possible orientations, name of the arc's ending node or class, Equal, Space or Tab , true for an added arc or false for a removed arc. The last occurrence is always chosen. Name of the arc's starting node or class, -> , <- or even -- to indicate the both possible orientations, name of the arc's ending node or class. Name of the arc's starting node or class, -> , <- or even -- to indicate the both possible orientations, name of the arc's ending node or class, Equal, Space or Tab , comment . The comment can be any character string without return (in html or not). The last occurrence is always chosen. Name of the arc's starting node or class, -> , <- or even -- to indicate the both possible orientations, name of the arc's ending node or class, Equal, Space or Tab , color . The color is defined as Red Green Blue 8 bits by channel color written in hexadecimal (web format). For example green gives
Forbidden Arcs
Comments
Colors
132
Menus
Dictionary File Structures 00FF00, yellow gives FFFF00, blue gives 0000FF, pink gives FFC0FF,etc. The last occurrence is always chosen. Fixed Arcs Name of the arc's starting node or class, -> , <- or even -- to indicate the both possible orientations, name of the arc's ending node or class, Equal, Space or Tab , true for an fixed arc or false for a not fixed arc. The last occurrence is always chosen. Name of the node Equal,Space or Tab new node name. The new name must be valid (different from t or T and without?). A node can be present only once otherwise the last occurrence is chosen. Name of the node or the class Equal, Space or Tab Comment. The comment can be any character string without return (in html or not). A node can be present only once otherwise the last occurrence is chosen. Name of the node Equal,Space or Tab Name of the class. The class can be any character string. A node can be present several times associated with different class names. Name of a node or a class Equal,Space or Tab Color The color is defined as Red Green Blue 8 bits by channel color written in hexadecimal (web format). For example green gives 00FF00, yellow gives FFFF00, blue gives 0000FF, pink gives FFC0FF, etc. A node can be present only once otherwise the last occurrence is chosen. Name of a node or a class Equal,Space or Tab path to the image relatively to the directory where the dictionary is. The image path must be a valid relative path or an empty string. A node can be present only once otherwise the last occurrence is chosen. Name of the node Equal,Space or Tab value of the cost or empty if we want the node to be not observable. The cost is an empty string or a real number superior or equal to 1. A node can be present only once otherwise the last occurrence is chosen. Name of the node Equal,Space or Tab value of the index or empty if we want to delete an already existent index The index is an integer. A node can be present only once otherwise the last occurrence is chosen.
Node
Node Renaming
Comments
Classes
Colors
Images
Costs
Temporal Indices
Local Structural Coeffi- Name of the node Equal,Space or Tab value of the local cients structural coefficient or empty if we want to reset to the default value 1. The local structural coefficient is an empty string or a real number superior to 0. A node can be present only once otherwise the last occurrence is chosen. State Virtual Numbers Name of the node Equal,Space or Tab virtual number of states or empty if we want to delete an already existent number. The state virtual number is an empty string or an integer superior or equal to 2. A node can be present only once otherwise the last occurrence is chosen. Name of the node Equal,Space or Tab , position.The location is represented by two real numbers separated by a Space . The first number represent the x-coordinate of the node and
Locations
133
Menus
Dictionary File Structures the second number the y-coordinate. A node can be present only once otherwise the last occurrence is chosen. State State Renaming Name of the node or class dot (.) name of the state Equal,Space or Tab new state name or State nameEqual,Space or Tab new state name if we want to rename the state for all nodes. The new name is a valid state name. A state can be present only once otherwise the last occurrence is chosen. Name of the node or class dot (.) name of the state Space or Tab real value or Name of the state Equal,Space or Tab real value if we want to associate a value with a state whatever the node. The value is a real number. A state can be present only once otherwise the last occurrence is chosen. Name of the node or class dot (.) name of the state Equal,Space or Tab long name or Name of the state Equal,Space or Tab long name if we want to associate a long name with a state whatever the node. The long name is a string. A state can be present only once otherwise the last occurrence is chosen. Name of the node or class dot (.) name of the filtered state. A state can be present only once otherwise the last occurrence is chosen.
State Values
Filtered States
Caution
As indicated by the syntax, the name of the node, class or state in the text file cannot contain equal, space or tab characters. If the node names contain such characters in the networks, those characters must be written with a \ (backslash) character before in the text file: for example the node named Visit Asia will be written Visit\ Asia in the file.
Caution
In order to specifically differenciate a nam which is the same for a classe, a node or a state, you must add at the end of the name the suffix "c" for a class, "n" for a node and "s" for a state.
Important
If your network contains not-ASCII characters, you must save your own dictionaries with UTF-8 (Unicode) encoding. For example, in MS Excel, choose "save as" and select "Text Unicode (*.txt)" as type of file. In Notepad, choose "save as" and select "UTF-8" as encoding. If your file contains only ASCII character you can let the default encoding (depending on the platform) but it is strongly encouraged to use UTF-8 (Unicode) encoding in order to create dictionary files that doesn't depend on the user's platform. So, for example, a chinese dictionary can be read by a german without any problem whatever the used platforms are. If you are not sure how to save a file with UTF-8 encoding, you should export a dictionary with BayesiaLab, modify and save it (with any text editor) and load it in BayesiaLab. Export dictionary: This menu item allows exporting the different kinds of dictionaries in text files.
134
Menus
Important
The dictionary files are saved with UTF-8 (Unicode) encoding in order to support any character of any language. An option, in the Import and Associate preferences: Save Format , allows saving or not the BOM (Byte Order Mask) at the beginning of the file. The BOM increases the compatibility with Microsoft applications. On other platform like Unix, Linux or Mac OS X, the BOM is not necessary and, in come cases, is considered as simple extra characters at the beginning of the file. Associate an evidence scenario file: This menu item allows associating an evidence scenario file with the network. Export an evidence scenario file: This menu item allows exporting into a text file an evidence scenario file associated with the network. Generate data : This menu item allows generating a base of n cases in agreement with the probability law described by the active Bayesian network. It is possible to choose to generate the data as an internal database. We can also indicate the rate of missing values of the base and use the long name of the states if the database is written in a file. It is possible to generate a database with test examples by indicating the wanted percentage.
The states' long name can be saved instead of the states' name. If the user wants to save the continuous values, the numerical values will be created by randomly generating a value in each concerned interval. If the data are generated in validation mode , then the evidences are taken into account. Save data : This menu item allows saving the base associated with the network including the results of the various pre-processing that have been carried out within the data importation wizard (discretization, aggregation, filtering,). If the imported database still contains missing values and if the selected algorithm to process the missing values is one of the two imputation algorithms (static or dynamic), then option will allow you to realize all your imputation tasks by saving a database without any missing values. Indeed, each missing value is replaced by taking into account its conditional probability distribution, returned by the Bayesian network, given all the known values of the line. If the database contains data for test and data for learning, the user can choose which kind of data he wants to save: only learning data, only test data or the whole data. It is also possible to save only the data corresponding to the selected nodes.
135
Menus
The states' long name can be saved instead of the states' name. The numerical values in the database associated with the continuous nodes can be saved if they exist. If there is no numerical values associated with the database and if the option is checked, the numerical values will be created by randomly generating a value in each concerned interval. If the database contains weights, they will be saved as the first column in the output file. Imputation: Allows the imputation of the missing values of the associated database according to the mode selected in the following dialog box:
The data will be saved in the specified file and the long name of the states will be used as specified. If the database contains data for test and data for learning, the user can choose on which kind of data he wants to perform imputation: only learning data, only test data or the whole data. The states' long name can be saved instead of the states' name. The numerical values in the database associated with the continuous nodes can be saved if they exist. If there is no numerical values associated with the database and if the option is checked, the numerical values will be created by randomly generating a value in each concerned interval. However, if there are numerical values in the database, the missing numerical values will be generated from the distribution function of each interval. If the database contains weights, they will be saved as the first column in the output file. Graphs: Opens the graph editor if a database is associated with the current network.
136
Menus
137
Menus
In this step, the options allow: Specifying separators, i.e. the characters that are used to separate the variables Specifying the values which will be considered as missing values Specifying the values which will be considered as filtered values for discrete or continuous variables Indicating the presence of a title line, i.e. the values of the first line will be used to define the name of the variables Indicating the end line character Ignoring the simple and/or double quotes Taking into account consecutive separators It's possible to import only a sample of the data:
138
Menus
There are three available sampling definitions: Random Sample with percentage: a percentage is specified and BayesiaLab randomly selects the corresponding number of lines from the file. Random Sample with size: the number of lines is defined and BayesiaLab randomly selects them in the file. Custom Sample: the first and last indices that have to be imported are specified. It is also possible to specify that a sample of the rows will be used as a test database; the remaining rows will be used as a learning database. You have just to specify in the following dialog box the percentage of rows to use as test database:
139
Menus
1. JDBC driver Specification 2. The ... button allows displaying the following wizard for choosing a driver from a list of drivers. This wizard indicates: 3. the available drivers, the syntax of the database URL the installation directory.
140
Menus
4. Database URL. Each driver has its own syntax. When the driver is chosen through the driver selection wizard, the URL syntax is described.
Note
All the used URLs are saved and available for the next sessions. 5. User: allows specifying the user of the database 6. Password: allows entering the password corresponding to the database user 7. Once the connection is established with the database, this panel lists the available tables of the database. Above, the user can fill the schema and catalog in the two corresponding combo boxes to precise the tables retrieving (if empty, no filter is applied). The schema and catalog can be filled before connection, or after by clicking on the "Apply" button, to limit the number of available tables proposed. This two combo boxes will be invalidated if the schema or catalog concepts are not part of the target database. 8. Allows selecting the fields that will be used in the network. An SQL query is automatically generated. 9. Allows entering an SQL query. The button "..." opens the SQL Request dialog:
10. Open: Opens a text file containing a SQL request. Save: saves the edited request into a file.
141
Menus
Validate: closes the SQL request dialog and copy the edited request in the data selection panel. Cancel: closes the dialog without any change. 11. Allows specifying the number of lines displayed in the windows. 12. Allows entering a string indicating missing values. 13. Allows transposing the data
Note
The left panel lists the available tables in the connected database (limited to 500). However, this help can be impossible to initialize on some database manager. In this case, an error dialog will display and the left panel will invalidate (but you will be connected to the database and will be able to send SQL requests in control 8)
1. Format: to choose the variables to include in the future Bayesian network and to indicate their type: Not distributed: to specify that this column won't be included in the Bayesian network. Discrete: each different value of the variable will be considered a state Continuous: the values are considered numerical and will be discretized
142
Menus
Weight: to consider a numerical column as a weighting variable for the line. Note that there is only one weight variable. Data type: allows defining which rows will be used for learning of for test. Note that there is only one data type variable. The column must contain only two different states. Row Identifier: allows defining an identifier for each row. Identifiers can be of any type (string or numbers) but cannot contain missing values. However, unicity is not required. Two rows can have the same identifier. These identifiers will be saved with the database and kept in any derived database (generated by some analysis or tools). This identifier can be used to select a line in the database that will be observed (during Interactive Inference, Interactive Updating or manual selection on the database). The current identifier is displayed in the status bar. 2. Data View: to view the database and to select variables. Thanks to + all the columns and to choose the same processing for all the variables. it is also possible to select
3. Information: to get information about the data that will be imported while processing the variables. The statistics Others regroup information about row identifier, weight and data type columns.
1. Missing Value Processing: allows specifying, for each variable that have missing values, the kind of processing to apply: a. Filter: two filters are available OR filter: each line that has at least a missing value for one of the variable belonging to the OR rule is discarded
143
Menus
AND filter: each line that has missing value for all the variable belonging to the AND rule is discarded b. Replace by: allows specifying the value that will be used to replace the missing values: it is possible to set directly the value or to use the mean value computed with the available data (continuous variables) or to use the modal value computed with the available data (discrete variables) or it is possible to choose one of the proposed values in the combo box. These values are the existent states of the selected column. If several columns are selected, only the common states are proposed. c. Infer: Static Imputation: the probability distributions of the missing values are estimated based on the available data, by considering that all the variables are independent (fully unconnected network). Whereas the previous "Replace by" option allows replacing the missing values with the most probable values, the missing values are replaced here by values that are randomly chosen according to the probability distributions. Even if this decision rule is not optimal at the line level (the optimal rule being the one used by the "Replace by" option), it is however the best rule at the population level. This imputation process only occurs at the end of the data loading. However, it can also be launched, with respect to the current Bayesian network, by using the Learning menu, with the Learning the Probabilities menu item. Dynamic Imputation: the conditional probability of the missing values is dynamically estimated based on the current network and the available data of the line. Each missing value is then replaced by a value randomly selected with respect to the probability distribution. During learning, a new imputation is realized after each structure modification. This option then brings a rigorous solution to attack imputation tasks, as it is possible to save the database with all the data processing results included, i.e. without missing values in that case. Structural EM: the probability of each state is dynamically estimated during learning by using the current structure and to the available data. These probabilities are directly used for learning the structure and the parameters, i.e. there is no completion with a specific state. Dynamic Imputation and Structural EM represent the most rigorous way to process missing values. However, these methods are time costly as they need to carry out inference while learning the structure of the network. Note also that the choice of one of these methods is applied to all the variables for which an inference processing has been set. This choice can be changed once the data are loaded by using the Learning menu. Note that the missing values replacement is dynamically taken into account in the information panel. 2. The Data View: a left click on the icon , gives access to statistical information about the distribution of the variable. If a variable has not replaced missing values, then the icon will indicate it in the header of the column.
144
Menus
Note
When we want to check or uncheck several states at the same time, you have just to select the convenient states (by clicking on their names) and to check or uncheck one of the filters keeping the key pressed in order to do not lose the selection.
Note
The Minimum and Maximum required zones allow creating filters. It is also possible to use to define filters. Two types of filters are available: OR and AND filters. The filtering system consists in defining the lines that will be imported. If we want to describe the lines that we want to discard, we have to invert the corresponding logical expression. The example below describes how to filter the smokers that are less than 15 years old. We then have to keep the lines where the individual does not smoke OR is older than 15. 1 Choose the OR filter
Click on Age and specify 15 years old in the Required minimum field
145
Menus
Step 4: Discretization of the continuous variables, state aggregation of the discrete variables and data type
If a weight column is specified, it will be used in the different discretization and aggregation algorithms. If a data type column is specified, the current learning rows will be used in the different discretization and aggregation algorithms.
A manual and four automatic discretization methods are proposed: Decision tree : supervised induction of the most informative discretization thresholds with respect to a target variable. The target variable must be discrete, if not it must be manually discretized before.
146
Menus
KMeans : data clustering with the KMeans algorithm. The data are standardized. Each obtained cluster defines an interval. Equal distances : design of intervals that have the same width. Equal frequencies : design of intervals with the same weight, i.e. containing the same number of points. Manual : manual design of intervals using a graphical interface. The selected discretization method applies for the selected column. If you want to use a unique method (except Manual ) for all the continuous variables, you can click on Select all continuous. If you want to design your own intervals, the distribution function will then be displayed in the right window (lines are sorted according to the values of the continuous variable; the X-axis represents the number of individuals and the Y-axis represents the values of the continuous variable). All the manually discretized variables can be used as target variable for the decision tree discretization.
The user can switch the view of the data to a representation of the density curve generated by the Batch-Means method. In this view, the data's density curve is displayed. The continuous variable's values are represented along X-axis and the density of probability is represented along the Y-axis. The two red areas at each extremity indicate that the curve may not be accurate and can't be used to place here some discretization points.
147
Menus
This window is fully interactive and allows, in both view: Adding a threshold: Right Click Removing a threshold: Right Click on the threshold Selecting a threshold: Left Click on a threshold Moving a threshold: Left Click down and mouse move, the current Y-Coordinate appears in the Point box : Zooming: Ctrl + Left Click down + move + release to define the area that we want to enlarge. In the distribution function, the zoom will be done vertically and in the density curve, it will be done horizontally. It is possible to zoom successively as much as you need. Unzooming: Ctrl + Double Left Click Besides this distribution function, the button: allows having access to the three automatic discretization methods through a new dialog. This part can be considered a wizard for the manual discretization as it is possible to launch these methods, to see the resulting discretization on the distribution function, and then to modify the result by moving, deleting and adding thresholds. If the chosen discretization fails, a dialog box is displayed to warn the user. In this dialog it can change the chosen discretization. It is also possible to transfer the defined discretization points to other variables. The button: allows getting the list of the continuous variables. Simply select from that list the variables to process.
148
Menus
Note
The transfer applies only if the variation field of the selected variables is compatible with the variation field of the original variable. NB: If a filtered value is defined for a continuous variable, a filtered state will be created at the end of import as a new interval. This interval will be added following the intervals defined by the discretization. The name of the state associated with the interval is * by default.
It allows: Creating an aggregate: The list of states appears in the Aggregation zone. To make a selection, click on the chosen state and maintain pressed for a partial selection or maintain pressed for a multiple selection. Once the selection done, click on , the new aggregate appears in the Aggregates list. Modifying an aggregate: The selection of an aggregate allows adding new states. Renaming an aggregate: The name edition is available thanks to a double-click on it Removing an aggregate: You can remove an aggregate by selecting it and pressing
149
Menus
Besides this a priori aggregation process, it is also possible to use the correlation of the variable with a target variable which can be discrete or continuous but manually discretized before. By checking and selecting a target variable, BayesiaLab displays the conditional probability table of the target variable knowing the current variable.
In this screen capture, the probability distribution of the target node (Cancer) is displayed with respect to each state of the current variable. For binary or "binarized" (a target state has been set) target variables, the column "Correlation" is used to highlight the difference between the conditional probability of the first target state and its marginal probability. Green bars (the conditional probability is higher than the marginal one) and red bars (lower) greatly improve the readability of the results. It is then possible to aggregate states that have the same relation with the target just by using the colors. NB: The exact value of the difference is given in a tooltip when you point to the corresponding bar. Besides the visual help, it is also possible to have a special aggregation wizard for binary or "binarized" target variables by clicking on
The colored bar represents the variation field of the differences between the conditional and marginal probabilities of the first target state. This bar is interactive and allows defining aggregation thresholds: Adding a threshold: right click, Modifying a threshold: left click down and mouse move on the deltas-axis, or by selecting the threshold value from the list and by directly editing it thanks to a double left click. Removing a threshold: right click on the threshold.
150
Menus
Defining just one threshold equal to 0 allows grouping all the "green" and "red" states automatically. The button: displays a new dialog that allows detecting automatically how to group the states. This algorithm uses a decision tree to find what the best threshold is in order to create the given number of aggregates.
Once the wanted final state number chosen, the found best thresholds are displayed in the table and you can click on the OK button to apply the new generated aggregates. NB: If a filtered value is defined for a discrete variable, a filtered state will be created with * as name by default. This state could not be aggregate with the other states.
It allows performing automatic aggregation like in the Automatic Aggregation Wizard over the selected variables that the initial state number is greater or equals to the specified number. You have simply to specify the target node and the state and indicate how many states you want at the maximum. If the target node is one of the selected variables, it will remain unmodified.
151
Menus
Sometimes, the algorithm cannot find any grouping because the current variable and the specified target are independent. In this case, the unmodified variables will be displayed in a dialog at the end of the automatic aggregation process. It is possible to stop the process by clicking on the close button of the progress bar. The already treated variable will be conserved and the other will stay as before.
Data type
If a column if a used as data type, you can configure select which state you want to use for learning and for test. To do this, use the corresponding combo boxes to change the type associated to each state.
Import Report
After a successful database import, it is possible to display the HTML import report.
152
Menus
The first column displays the names of the imported variables. The second column displays the type associated with each variable. The third column shows the different states of each variable if necessary. Information contained into the last column may vary according to the type of the variable: Data Type: indicates, for each state, if this state will be used for learning or test and how many they are in the database. Weight: no necessary information. Discrete: indicates, for each state, the possible aggregated states that were aggregated. The color of the last cell indicates if the states of the node were really aggregated. Continuous: indicates, for each state, the intervals limits. After that, the asked discretization and the really obtained discretization are shown.The background of this cell is colored according to the obtained discretization to allow a quick identification.
153
Menus
154
Menus
The button Unmatched Columns displays all the columns in the database that are not in the network. The following dialog is displayed and allows the user to distribute or not the selected columns:
155
Menus
Discrete column - Discrete node Discrete column - Continuous node Continuous column - Continuous node If it is possible, the valid associations are detected automatically the first time the panel is displayed.
The zone 1 contains the list of the variables contained in the database and not already associated to a node of the network or added as a new node of the network. As you can see, the variable Geographic Zone contained in the database is discrete and has no corresponding node in the network. If you want to add it as a new node, simply select it and click on the button , otherwise, if you don't want to add it, do nothing. You can process in the same way for the continuous node N. You can also select and add several nodes at the same time. The zone 2 contains the list of the nodes contained in the network and not already associated to a column of the database. If you want to link a variable from the database to a node of the network, simply select each one and press the button . All remaining nodes in this list will not be linked to a column of the database and will be considered as hidden node in the network. The zone 3 contains the buttons used to add or remove associations. The zone 4 contains the list of associations. It can contain also added variables from the database that will be treated as new nodes in the network. A double-click on an association display, if necessary, a dialog used to edit a discrete or a continuous association. As you can see, some associations show a warning icon. This icon indicates that some unusual behaviors are present in those associations. The zone 5 contains a list containing the details of each warning of associations located in zone 4. If you select a warning in the list, the corresponding association will be selected in the zone 4. When the mouse hovers on the list, a tooltip shows the content of the warning. A double-click on it opens the
156
Menus
convenient association editor in order to verify or modify the association. If you want to remove an association or an added node, select it in the list and press the button . The zone 6 contains three buttons. The first and second buttons allow extending automatically the minimum and maximum of each continuous node that does not fit the database's limits. The third button allows filtering automatically each row that does not fit the network's limits.
The zone 1 contains the list of the states from the database that are not already linked to a state of the node or directly added as a new state. To perform an association, select a state in the zone1 and a state in the zone 2 and press . If you want to add a state without linking it to a state of the node, simply select it and press the button . The zone 2 contains the list of the states from the node of the network. This list will never be modified. Even if an association is done, the corresponding state of this list will not be removed and can be reused for another association. It allows linking several states from the database to the same state of the node from the network. To perform an association, select a state in the zone1 and a state in the zone 2 and press . The state from the zone 1 will be removed and the association will be added in the zone 4. The zone 3 contains the buttons to add or remove states' associations. The zone 4 contains all the associated and added states. An association can be removed by selecting it in the list and pressing the button . After association, the dialog looks like:
157
Menus
If there are still some states not linked in the zone1, these states will be removed from the database. By default, the database's states which are the same as the network's ones, as the aggregates or as the states' long names will be automatically linked. NB: If filtered values exist in the database but are not declared in the network, it is possible to merge them with the specific state *, if it exists. In this case, this state will be automatically defined as filtered for each concerned node.
This dialog is displayed only if the limits of the variable from the database are outside the limits of the node from the network. By default, the limits of the node of the network are used and all the values outside these limits will be removed from the database. If you want to keep them, use the corresponding options. NB: If filtered values exist in the database but are not declared in the network, it is possible to merge them with the specific state *, if it exists. In this case, this state will be automatically defined as filtered for each concerned node.
Step 5: Discretization of the continuous variables and state aggregation of the discrete variables
158
Menus
This step occurs only when some columns of the database are not linked with nodes of the network but are distributed. These columns will create new nodes in the network and must be discretized if they are continuous and their states can be aggregated if they are discrete. Same as Step 4 in Data Importation Wizard.
This report may contain three tables: 1. The modified nodes table: For the discrete nodes, will be indicated, if necessary, the correspondence between the states in the database and in the network. For the continuous nodes, will be indicated, if necessary, the initial minimum of the data and the retained final minimum and also the initial maximum and the retained final maximum. 2. The hidden nodes table: indicates the node that are in the network and that don't have any associated data. 3. The added nodes table: indicate the list of variables added to the network from the database. This table is the same as in the import report.
159
Menus
Kinds of evidences
There are four possible kinds of evidences which are the same as ones obtained with the monitors: 1. Exact evidence on a state of a node 2. Likelihood distribution over the states of a node 3. Probability distribution over the states of a node (fixed distribution with exact inference or computation of the corresponding likelihoods with approximate inference) 4. Target mean for a node (determining a probability distribution corresponding to the target mean and taking the others observations into account)
Caution
Sometimes a fixed probability distribution cannot be done because the algorithm used to reach the target distribution fails to converge. In this case the corresponding probability fixing is not done and the node comes back to its initial state. An information message is displayed in the console.
Temporal
When a network is temporal, it uses the Time variable or although it has at least a temporal node, the time step can be set by indicating its value with a positive or null integer.
Syntax
It is a text file where each line is formatted with the following grammar: for the temporal networks: <line> ::= <time step> [<semicolon> [<evidence> | <likelihood> | <probability>| <mean>]]+ [<comment>] <time step> is an integer representing the time step at which the following evidences are set. for the not temporal networks: <line> ::= [<evidence> | <likelihood> | <probability> | <mean>] [<semicolon> [<evidence> | <likelihood> | <probability>| <mean>]]* [<comment>] <semicolon> is the character ; (semicolon)
160
Menus
<colon> is the character : (colon) <evidence> ::= <variable> <colon> <state> it is possible to use directly a numerical value for a continuous node: <evidence> ::= <variable> <colon> <numerical value> <likelihood> ::= <variable> <colon>l{ <likelihood list> [<semicolon> <likelihood list>]+ } <likelihood list> ::= <state> <colon> <degree of likelihood> <probability> ::=<variable> <colon> p{ <probability list> [<semicolon> <probability list>]+ } <probability list> ::= <state> <colon> <degree of probability> <mean> ::= <variable> <colon> m{ <numerical value> } <variable> corresponds to the variable name, flanked by question marks, for which there is an evidence, and <state> indicates the concerned state. <comment> ::=<two slashes> <any character string> The following example contains exact evidences, likelihoods and probabilities for four time steps: 0;?Valve1?:OK;?Valve2?:OK;?Valve3?:OK //All the valves are working
2;?Valve1 t+1?:l{OK:0.8;RC:0.9;RO:0.9} 20;?Valve2 t+1?:l{OK:0.3;RO:0.3;RC:0.3};?Valve1 t+1?:p{OK:0.2;RO:0.4;RC:0.4} 30;?Valve3 t+1?:OK;?Valve1 t+1?:p{OK:0;RO:0.8;RC:0.2}
When a temporal evidence file is associated to a temporal network, the evidences are taken into account each time the time meter reaches one of the specified time steps. When the file and the network are not temporal, the evidences are taken into account during interactive inference or during interactive updating. The following example shows not temporal evidences and also evidences with numerical values:
?Smoker?:Yes;?Age?:25.5;?Bronchitis?:p{Yes:0.8;No:0.2}//Young smoker with a large probability of bronchitis ?Smoker?:No;?Age?:70;Dyspnea:l{Yes:0.8;No:0.5} //Non-smoker senior person
161
Menus
The menu Data>Export an Evidence Scenario File allows saving, into a text file, the current evidence scenario file.
Batch exploitation
If an evidence scenario file is associated to the network, it is possible to use it as data source for the batch exploitation of the network: Batch labeling Batch inference Batch labeling with most probable explanation Batch inference with most probable explanation Batch joint probability Batch likelihood
162
Menus
2.4. Graphs
The graph editor allows displaying six different graphs. The graphs are obtained from the data stored in the database. Some of these graphs cannot be used if there is no continuous node in the network. Moreover, the database must also contain the exact values of the continuous nodes. To do this, you must check the convenient option in the settings. If the database contains data for test and data for learning, the user can choose which kind of data he wants to display: only learning data, only test data or the whole data.
Bar chart: always available. Occurrences matrix: always available. Distribution function: always available. Scatter of points (2D): only with continuous. Colored Line Plot: only with at least one continuous. Scatter of points (3D): only with continuous. Bubble chart: only with continuous.
163
Menus
Graph: When you move the mouse over the graph, the information in the top panel is updated. It displays on which state the cursor is and how many times the state is in the database. The total number of states is also indicated.
164
Menus
165
Menus
Graph: The right panel contains all the information about the Khi or G independence tests and the computed degree of freedom. The missing values are not taken into account in the computation of the independence test but the weights are.
166
Menus
Graph: The horizontal lines represent the limits of the node's intervals. When you move the mouse over the graph, the information in the top panel is updated. It displays the coordinates of the cursor in the graph. The total number of displayed values is also indicated. You can perform a zoom on the graph by pressing the left button, drag the mouse and then releasing the button: the selected area will be magnified. To remove the zoom, double-click on the graph, the default view will be restored.
167
Menus
168
Menus
Graph: The black lines represent the limits of the nodes' intervals. When you move the mouse over the graph, the information in the top panel is updated. It displays the coordinates of the cursor in the graph. The total number of displayed values is also indicated. If you click inside a point (or several if the points overlap), a dialog appears containing the rows of the database corresponding the selected points.
You can perform a zoom on the graph by pressing the left button, drag the mouse and then releasing the button: the selected area will be magnified. To remove the zoom, double-click on the graph, the default view will be restored.
169
Menus
170
Menus
Graph: The black lines represent the limits of the vertical node's intervals. When you move the mouse over the graph, the information in the top panel is updated. It displays the coordinates of the cursor in the graph. The total number of displayed values is also indicated. If you click inside a point (or several if the points overlap), a dialog appears containing the rows of the database corresponding the selected points.
You can perform a zoom on the graph by pressing the left button, drag the mouse and then releasing the button: the selected area will be magnified. To remove the zoom, double-click on the graph, the default view will be restored.
171
Menus
172
Menus
Graph: The black lines represent the limits of the nodes' intervals. When you move the mouse over the graph, the information in the top panel is updated. It displays the coordinates of the cursor in the graph. The total number of displayed values is also indicated. If you click inside a point (or several if the points overlap), a dialog appears containing the rows of the database corresponding the selected points.
You can perform a zoom on the graph by pressing the left button, drag the mouse and then releasing the button: the selected area will be magnified. To remove the zoom, double-click on the graph, the default view will be restored. The right panel displays the color associated with each state of the third variable.You can modify directly the colors by clicking on each colored square.
173
Menus
174
Menus
Graph: The black lines represent the limits of the nodes' intervals. When you move the mouse over the graph, the information in the top panel is updated. It displays the coordinates of the cursor in the graph. The total number of displayed values is also indicated. If you click inside a bubble (or several if the bubbles overlap), a dialog appears containing the rows of the database corresponding the selected bubbles.
You can perform a zoom on the graph by pressing the left button, drag the mouse and then releasing the button: the selected area will be magnified. To remove the zoom, double-click on the graph, the default view will be restored. The right panel displays the color of the bubbles. You can change it by clicking on it.
175
Menus
176
Menus
3. Edit
Undo: (modeling mode) Redo: (modeling mode)
Select All: Select nodes and arcs. Select Nodes All: Select all the nodes. Discrete: Select the discrete nodes. Continuous: Select the continuous nodes. Constraint: Select the constraint nodes. Decision: Select the decision nodes. Utility: Select the utility nodes. Excluded: Select the excluded nodes. Disconnected: Select the disconnected nodes. Missing Values: Select the nodes having a percentage of missing values greater than a given threshold:
Assessments: Select allt the nodes having at least one assessment in their conditional probability tables. Select Arcs All: Select all the arcs. Fixed: Select the fixed arcs. Temporal: Select the temporal arcs. Not Oriented: Select the not oriented arcs. Invert All Selection: Select the arcs and the nodes that are not selected and unselect the others. Invert Selection Nodes: Select the nodes that are not selected and unselect the others. Arcs: Select the arcs that are not selected and unselect the others.
177
Menus
Delete Selection (modeling mode): Delete the selected nodes and arcs. Delete (modeling mode): All Arcs: Delete all arcs. Unfixed Arcs: Delete all unfixed arcs. Disconnected Nodes: Delete all nodes that are disconnected. Virtually Disconnected Nodes (KL Force): Delete all virtually disconnected nodes only if the Arc force analysis has occurred and the Arc force trim has been used in the validation mode : Edit Structural Coefficient (any mode): opens a dialog box in order to change the structural coefficient from 0 to 150. The default value is 1. Edit Costs (any mode): opens the dialog box that allows associating a cost with the observation of a variable (see Cost management). Edit Classes (any mode): opens the dialog box that allows the creation and the edition of the classes associated with the nodes (see Classes management). Edit Constants (modeling mode): opens the dialog box that allows the creation and the edition of the constants that will be used in the formulas describing the probability distributions of the nodes (see Constants management). Edit the Forbidden Arcs (modeling mode): opens the dialog box that allows the creation of forbidden arcs in the network's structure (see Forbidden arcs management). Edit the Temporal Indices (modeling mode): opens the dialog box that allows editing the temporal indices associated to the nodes of the network (see Temporal indices). Edit State Virtual Numbers (modeling mode): opens the dialog box that allows editing the state virtual numbers associated to nodes and used for learning. Edit Local Structural Coefficients (modeling mode): opens the dialog box that allows editing the local structural coefficients of each node. This coefficient is used for structural learning. Edit Experts (modeling mode): opens the dialog box that allows editing the experts of the network. Experts are used for assessment sessions. Use Time Variable (modeling mode): allows using the parameter variable that represents the time in the equations. Cut: Cut the selected nodes and put them in the clipboard. Copy: Copy the selected nodes in the clipboard. Paste: Paste the nodes that are in the clipboard (a right click before pasting allows specifying the destination). Search: Search for nodes, arcs and edges by the names of the nodes and/or the classes.
178
Menus
4. View
Modeling Mode: mode where the graph visualization panel is visible and where actions of modeling and learning are carried out. Validation Mode: mode where the graph visualization panel and the monitor visualization panel are visible and where actions of validation and exploitation of the networks are carried out.
Automatic Positioning: uses original algorithms to try to layout the network as well as possible. A slider is added to the tool bar to set the arc length. 1. The symmetric algorithm uses repulsive and attractive forces to define the graph layout. It's a very effective algorithm that returns good graph layouts for moderately connected graphs. The Settings allows modifying the parameters of this algorithm. 2. The Dynamic algorithm is particularly efficient with weakly connected networks. The goal of this algorithm is to try to: set the parents above their children, avoid crossing arcs, set the arc length proportional its force (the stronger the probabilistic relation represented by the arc is, the shorter the arc is) when the automatic positioning is launched within the Validation mode jointly with the Arc Analysis option active. 3. The Genetic Algorithm is useful for highly connected networks. This algorithm takes into account an evaluation function that is based on: the relationship between the nodes (the parents try to get a position above their children), the verticality of the arcs the overlapping of the nodes the force of the arcs when the automatic positioning is launched within the Validation mode jointly with the Arc Analysis option active; the intersection of the arcs with other arcs and with the nodes. The Settings allows to weight the evaluation function parameters and to change those corresponding to the genetic algorithm. 4. The Mutual information mapping is only available for fully unconnected networks with an associated database. This layouting algorithm computes the mutual information matrix and then uses the genetic algorithm to try to find a global node mapping where the proximity of two nodes is inversely proportional to their mutual information. The Settings allows weighting the genetic algorithm parameters. 5. The Random positioning allows setting a random position to each node. Zoom: zooming the nodes Zoom in: increase the size of the network. If a node or a group of nodes is selected, the selection will be centered in the window (if possible).
179
Menus
Zoom out: decrease the size of the network. If a node or a group of nodes is selected, the selection will be centered in the window (if possible). Default zoom: go back to the default zoom level. If a node or a group of nodes is selected, the selection will be centered in the window (if possible). Best fit: move and adjust the size of the network in order to fit the window. If a node or a group of nodes is selected, the selection will fit the window. Center: center the network in the window. Horizontal mirror: perform a horizontal inversion of the positions of the nodes. Vertical mirror: perform a vertical inversion of the positions of the nodes. Top left corner: move the graph in the top left corner of the window.
Hide Node Names: the name of the nodes is no more displayed under them. Hide information: hide comment indicators of nodes and arcs as well as the missing values, filtered state and error indicators of each node. Display node comments: display the comment of the nodes over them. Display arc comments: display the comment of the arcs over them. Display node tags: display the color tags of the nodes. Display arc tags: display the color tags of the arcs. Display the node's image: display the image associated to a node instead of its default representation. Display grid: display the positioning grid. The spacing can be modified through the display preferences. Display the network's skeleton: hide the head of the arcs in the network in order to avoid invalid causal interpretation with the direction of the arcs. Only in Validation mode.
180
Menus
5. Learning
This menu gives access to different learning algorithms. Missing value processing: allows to choose the missing value processing algorithm Static completion Dynamic completion Structural EM Stratification : when the target value has a very weak representation (as usually the fraud for example), stratification allows modifying the probability distribution of the target variable (by using the internal weights associated to the states). This modification of the probability distribution can then permit to learn a network that is structurally more complex. Once the structure learned, the parameters (i.e. the probability tables) are estimated on the unstratified data. In the following dialog box, you can indicate the proportion of each state of the specified node you want to obtain. The initial value corresponds to the proportion of the database. You have simply to move the slider or edit directly the value for each state.
When stratification is done, the icon is displayed in the status bar. It is possible to remove the stratification by right-clicking on this icon to display the contextual menu and to choose to remove the stratification. Estimation of the probabilities: allows updating the probabilities tables by using the frequencies of the cases that are observed in the database, smoothed or not depending on the user choice defined by the settings. If the database contains missing values, this algorithm also launches the missing value processing. At the end of the estimation, the score of the Bayesian network (structure and new probabilities), is displayed in the console and automatically inserted in the network's comment. Structural learning of Bayesian networks: a broad set of learning algorithms is proposed to be able to solve any kind of Data Mining tasks: unsupervised learning for discovering all the probabilistic relations in the data, supervised and semi-supervised learning for the characterization of a particular variable, unsupervised learning to invent new concepts. If the current Bayesian network has existing arcs, a dialog box appears to indicate if this Bayesian network corresponds to an a priori knowledge that the learning algorithms have to take into account.
181
Menus
In that case, an equivalent number has to be specified to indicate how many cases have been used for the construction of that network (by learning or expertise). This number is automatically set when the network is learned from a database. It corresponds to the sum of the weights of the database's learning set if weights are associated, or to the number of examples in the learning set. A virtual database representing that knowledge is then added to the current database in order to take into account this a priori knowledge. A new icon is then added in the task bar: All the learning algorithms that are in BayesiaLab are based on the minimization of the MDL score (Minimum Description Length). This score takes into account both the adequacy of the Bayesian network to the data, and the structural complexity of the graph. The score values are available in the console during learning. The score of a given network can always be computed by updating its probabilities. The excluded nodes are not taken into account in the learning algorithms. The filtered states are taken into account. A compression rate is available in the console.This indicator measures the data compression obtained by the network with respect to the previous network (usually, the unconnected network). This rate, which corresponds to the part "adequacy of the network to the data" of the MDL score, then not only gives an indication on the probabilistic links that are in the network, but also the strength of these links. For example, with a database containing two binary variables that are strictly identical, the corresponding network will link these variables and describe in the conditional probability table that the value of the second variable is deterministically defined by the first variable. The compression rate will be then equal to 50%. learning policies: if the network has Decision and Utility nodes, action policies can be learned for static Bayesian networks and for Dynamic Bayesian networks. The scheme used for policy learning relies on dynamic programming and reinforcement learning principles, which makes this learning available with exact as well as approximate inference. Learning policies is only available in Validation mode.
182
Menus
The number of Bayesian networks that can be designed for a given number of variables is so great that it is impossible (except in extreme cases) to carry out an exhaustive search of the best network. The learning algorithms rely then on a set of heuristics that allows reducing the search space. BayesiaLab comes with four structural learning algorithms (discovering of the network structure and estimation of the corresponding conditional probability tables) that are conceptually different, from the faster to the slower. The heuristics that are used being different, the results of each method can be different. However, as each learning methods uses the same metric (the MDL score), the resulting networks can be easily compared. The score is available in the console and is also automatically inserted in the comment associated with the network. The lower the score is, the better the network is. For each learning algorithm there are startup options indicating that all arcs will be deleted before learning and allowing the user to define a Structure Equivalente Example Number, i.e. a virtual database with the indicated size representing the current sructure:
183
Menus
However, arcs that are fixed (the blue ones) are treated as normal arcs but the forbidden arcs are taken into account. The temporal indices are also taken into account. At the end of the learning, a tree without oriented arcs is obtained. To obtain a Bayesian network, the arcs will be oriented so as to avoid introducing V-structures. However, the use of fixed arcs can introduce V-structures.
Taboo
Structural learning implementing the Taboo search in the space of the Bayesian networks. This method is particularly useful for refining a network built by human experts or for updating a network learned on a different data set. Indeed, beyond taking into account the a priori knowledge represented by a network and an equivalent number of cases, the starting point of Taboo is the current network (and not the fully unconnected network (no arc), as this is the case for SopLEQ and Taboo Order). Furthermore, arcs that are fixed (the blue one) remain unchanged and the forbidden arcs are taken in account. The temporal indices are also taken into account. It is possible to define the size of the taboo list as well as the maximum numbers of parents and children allowed. If these options are not checked, they are not taken into account. In addition to standard options, it is possible to maintain the current structure of the network to start learning.
184
Menus
EQ
Search method looking for the equivalence classes of Bayesian networks. This method is very efficient because it allows avoiding a lot of local minima and to strongly reduce the size of the search space. As the Taboo algorithm does, EQ can start with the current network. Furthermore, the fixed arcs are treated as normal arcs and the forbidden arcs are taken into account. The temporal indices are also taken into account. In addition to standard options, it is possible to maintain the current structure of the network to start learning.
Here is a scientific reference about this learning method: P. Munteanu, M. Bendou, The EQ Framework for Learning Equivalence Classes of Bayesian Networks, First IEEE International Conference on Data Mining (IEEE ICDM), San Jos, novembre 2001.
SopLEQ
Search method based on a global characterization of data and on the exploitation of the equivalence properties of Bayesian networks. Arcs that are fixed (the blue ones) are treated as normal arcs but the forbidden arcs are taken into account. The temporal indices are also taken into account. Here are some scientific references about this learning method (sorry, but the direct reference is in French):
185
Menus
L. Jouffe, Nouvelle classe de mthodes d'apprentissage de rseaux baysiens, Journes francophones d'Extraction et de Gestion des Connaissances (EGC), Montpellier, janvier 2002 P. Munteanu, M. Bendou, The EQ Framework for Learning Equivalence Classes of Bayesian Networks, First IEEE International Conference on Data Mining (IEEE ICDM), San Jos, novembre 2001. L. Jouffe, P. Munteanu, New Search Strategies for Learning Bayesian Networks, Proceedings of Tenth International Symposium on Applied Stochastic Models, Data Analysis, Compigne, juin 2001.
Taboo Order
Learning method using Taboo search in the space of the order of the Bayesian network nodes. Indeed, finding the best Bayesian network for a fixed node order is an easy task that only consists in choosing the parents of a node among the nodes that appear before it in the considered order. This is the more complete search method, but also the more time consuming. Arcs that are fixed (the blue ones) are treated as normal arcs but the forbidden arcs are taken into account. The temporal indices are also taken into account. In addition to standard options, it is possible to define the size of the taboo list for Taboo Order.
186
Menus
Naive Bayes: Bayesian network with a predefined architecture in which the target node is the parent of all the other nodes. This structure thus states that the target node is the cause of all the other nodes and that the knowledge of its value makes each node independent of the others. In spite of these strong assumptions, which are false in the majority of the cases, the low number of probabilities to estimate makes this structure very robust, with a very short learning time as only the probabilities have to be estimated.
Augmented Naive Bayes: partially predefined structure allowing relaxing the strong constraint of conditional independence mentioned above. This architecture is made up of a naive architecture, enriched by the relations between the child nodes knowing the value of the target node (the common parent).The prediction accuracy of this algorithm is better than those obtained by the naive architecture, but the unsupervised search of the child relationships can be time consuming.
187
Menus
Tree Augmented Naive Bayes: partially predefined structure allowing relaxing the strong constraint of conditional independence mentioned above. This architecture is made up of a naive architecture on which a maximum spanning tree is learned. The prediction accuracy of this algorithm is better than those obtained by the naive architecture, but not as good as obtained with Augmented Naive Bayes; however, this algorithm is much quicker than it.
Sons & Spouses: structure in which the target node is the parent of a subset of nodes having potentially other parents (spouses). This structure is to some extent an augmented naive architecture in which the children set is not fixed a priori, but searched according to the marginal dependence of the nodes on the target. This algorithm thus has the advantage of highlighting the nodes that are not correlated to the target. The learning duration is comparable with the augmented naive architecture one.
188
Menus
Markov Blanket Learning : algorithm that searches the nodes belonging to the Markov Blanket of the target node, i.e. fathers, sons and spouses. The knowledge of the values of each node of this subset of nodes makes the target node independent of the all the other nodes. The search of this structure, which is entirely focused on the target node, makes it possible to obtain the subset of the nodes that are really usefulmuch more quickly than the two previous algorithms. Furthermore, this method is a very powerful selection algorithm and is the ideal tool for the analysis of a variable: a restricted number of connected nodes, different kinds of probabilistic relations: fathers: nodes that bring more information jointly than alone; sons: nodes having a direct probabilistic dependence with the target; spouses: nodes those are marginally independent of the target but which become informative when knowing the value of the son.
Augmented Markov Blanket Learning: algorithm that is initialized with the Markov Blanket structure and that uses an unsupervised search to find the probabilistic relations that hold between each variable belonging to this Markov Blanket. This unsupervised search implies additional time cost but allows having better prediction results compared to the first version.
189
Menus
Minimal Augmented Markov Blanket Learning : the selection of the variables that is realized with the Markov Blanket learning algorithm is based on a heuristic search. The set of the selected nodes can then be non minimal, especially when there are various influence paths between the nodes and the target. In that case, the target analysis result takes into account too much nodes. By applying an unsupervised learning algorithm on the set of the selected nodes, the Minimal Augmented Market Blanket learning allows reducing this set of nodes, and it results then in a more accurate target analysis.
However, if the task is a pure prediction task (as for example a scoring function), the Augmented Markov Blanket algorithm is usually more accurate than its Minimal version since it uses more "pieces of evidences". Semi-Supervised Learning : unsupervised learning algorithm that searches the relationships between the nodes that belong to a predefined distance of the target. This distance is computed by using the Markov Blanket learning algorithm. The semi-supervised learning algorithm allows learning a network fragment centered on the target variable. This algorithm is very useful for tasks that involve a lot of nodes, as for example in micro-arrays analysis (thousand of genes), and for prediction tasks where the Markov Blanket nodes have missing values, as these nodes do not allow to separate the target node from the other nodes anymore.
In addition to the standard options, the interface allows defining the search depth from the target node:
190
Menus
191
Menus
5.3. Clustering
This menu gives access to algorithms that allow clustering data in an unsupervised way, in order to find partitions of homogeneous elements. These algorithms are based on a naive architecture in which node CLUSTERS, which is used to model the partitions, is the parent of all the other variables. Unlike supervised learning, the values of node CLUSTERS are never observed in a database. All these algorithms then rely on Expectation-Maximization methods for the estimation of these missing values.
Output: It is possible to create cluster with ordered numerical states. These values are the mean of the score of each connected node for each state of the cluster node. This score is weighted by the binary mutual information of each node in order to be more representative of the relationships. If two of these values are strictly identical, an epsilon is added to one of them to obtain two different values.The excluded nodes are not taken into account for the computation of the numerical values. Clustering Settings: The assistant gives access to the different search methods: Fixed Class Number: the algorithm tries to segment data according to a given number of classes (ranging from 2 to 127). However, it is possible to obtain less clusters than desired; Automatic Selection of the Class Number: a random walk is used to find the optimal number of classes, starting with the specified number of clusters and increasing that number until obtaining empty or unstable clusters, or reaching the specified maximum number of clusters. The random walk is guided by the results obtained at each step; Options:
192
Menus
Sample Size: the sample size option makes it possible to search for the optimal number of classes on data subsets to improve the convergence speed (a sampling by step/trial).The partition obtaining the best score is then used as the initial partition for the search on the entire data set. It is possible to indicate either a percentage or the exact number of lines to use. Steps Number: the number of steps for the random walk. Knowing that it is possible to stop the search by clicking on the red light of the status bar while preserving the best clustering, this number can be exaggeratedly great. Maximum Drift: indicates the maximum difference between the clusters probabilities during learning and those obtained after missing value completion, i.e. between the theoretical distribution during learning and the effective distribution after imputation over the learning data set. Minimum Cluster Purity in Percentage: defines the minimum allowed purity for a cluster to be kept. Minimum Cluster Size in Percentage: defines the minimum allowed size for a cluster to be kept. Edit Node Weights: A button displays a dialog box in order to edit weights associated to each variable.
Those weights, with default value 1, are associated with the variables and permit to guide the clustering. A weight greater than 1 will imply that the variable will be more taken into account during the clustering. A zero weight will make the variable purely illustrative. At the end of clustering, an algorithm allows finding automatically if one of the Clusters node's states is a filtered state or not. If so, this state is marked as filtered. At the end of the segmentation, an automatic analysis of the obtained segmentation is carried out and returns a textual report.This report is a Target Report Analysis, but contains some additional information. It is made of: A summary of the learning characteristics (method and parameters, best number of clusters, score and time to find the solution); A sorted list of the obtained results (number of clusters and corresponding score); A list of the clusters sorted according to the marginal probabilities of each cluster (cf. Target Report Analysis) ; A list of the clusters sorted according to the number of examples really associated to each cluster when using a decision rule based on the maximum likelihood for each case described in the learning set; A list of the clusters sorted according to the purity of each cluster. The purity corresponds to the mean of the cluster probability computed from each associated example of the learning set. This list also comes with the neighborhood of each cluster. The neighborhood of a cluster is the set of clusters that have a non zero probability during the association of an example to that cluster;
193
Menus
A list of the nodes sorted according to the quantity of information brought to the knowledge of the Cluster node (cf. Target Report Analysis); The probabilistic profile of each cluster (cf. Target Report Analysis). The Mapping button of the report window allows displaying a graphical representation of the created clusters:
This graph displays three properties of the found clusters: the color represents the purity of the clusters: the more a cluster is blue, the more it is pure the size represents the prior probability of the cluster the distance between two clusters represents the mean neighborhood of the clusters The rotation buttons at the bottom right allow rotating the entire graph.
Note
In order to ease the understanding of the obtained clusters, and if at least one variable used in the clustering has numerical values associated to its states, the states of the node Cluster will have long names automatically associated. This name will contain the mean value of all the clustered variables obtained when observing the state of the Cluster.
194
Menus
195
Menus
196
Menus
197
Menus
In the Output area, the wizard allows selecting the directory where the various generated networks will be saved (a network by class [Factor_i] and the final network with all the latent variables). The intermediate and final databases are saved with the generated networks. The continuous values can be saved with the databases. This wizard also allows us to add or not all the nodes of the initial network to the final one. It is also possible to display or not the intermediate report generated for each intermediate network. However, the intermediate reports will always be saved in the target directory. It is possible to create, for each network, a cluster node with ordered numerical states. These values are the mean of the score of each connected node for each state of the cluster node. If two of these values are strictly identical, an epsilon is added to one of them to obtain two different values. As in data clustering, the number of values of the latent variables can be a priori fixed or found by a random walk. It can also be defined as being equal to the average number of values of the variables belonging to [Factor_i]. The remainder is strictly identical to data clustering. At the end of each clustering, an algorithm allows finding automatically if one of the [Factor_i] node's states is a filtered state or not. If so, this state is marked as filtered. At the end of each clustering, an automatic analysis of the obtained Bayesian network is carried out and a target analysis report is generated. This report is identical to the one generated by data clustering. It can be displayed if the convenient option has been selected. However, it is always saved in the target directory.
198
Menus
At the end of the last clustering, a synthetic report is generated. At the beginning of the report, a summary indicates the number of factors found, and the minimum, average and maximum number of clusters, mean purity and contingency table fit. This report describes, for each latent variable, the mean purity, the contingency table fit and the deviance. Those indices are described in the Correlation with Target Node's report. This new measure can be used to qualify the clustering result, to measure how well the Joint Probability Distribution is represented through each [Factor_i] variable. It also describes the distribution of its values on the learning set, and the list of the nodes sorted according to the quantity of information brought to its knowledge (cf. Target analysis report). The final network is automatically opened and the final database is associated with it. If the initial database contains a test set, it is also transferred and the missing value imputation is performed on the new [Factor_i] variables. At least, the final database is saved in the target directory.
199
Menus
200
Menus
The button activates Exploration (testing some random actions) during the temporal simulation. The button deactivates Exploration. The button activates the Learning of the state/action qualities during the temporal simulation. The button deactivates Learning. The parameters of the reinforcement learning algorithm that will update the quality of each state/action pair based on the discounted sum of expected utilities can be changed by using the settings.
Caution
The complexity of this type of problems implies that it is impossible to guarantee that the obtained policy the optimal one. It is thus advised to carry out several tests and to keep the policy giving the best results.
201
Menus
6. Inference
This menu gives access to: adaptive questionnaire over a network, interactive inference over a network, interactive Bayesian updating of a network, batch labeling of a network, batch inference over a network, batch most probable explanation labeling over a network, batch most probable explanation inference over a network, batch joint probability over a network, batch likelihood over a network, and allows choosing the inference type for the active network. The different batch analysis can be performed on a database that will be loaded from a text file or a jdbc database. Whatever the database is, it can't modify the structure of the network when it is imported, contrary to the classic database association. However, if a database is already associated with the network, these batch analyses can be performed on it. Two inference methods are proposed: Exact inference by junction trees (by default) This inference method is based on the construction of a new structure (the junction tree) before making the first inference.This construction can be time and memory costly depending on the network size and connectivity. On the other side, once the junction tree is constructed, the inference is realized very quickly. This tree has to be reconstructed after each modification of the network (structure or parameters). Approximate inference by Likelihood Weighting When the construction cost of the junction tree is prohibitive (not enough memory), it is possible to use approximate inference. This method is based on the law of large numbers and use network simulation to approximate the probabilities. While the exact inference cost appears in the junction tree construction (when passing from Modeling to Validation mode only), approximate inference requires very few memory but has a time cost at each inference. Some of the networks have a too important complexity to perform exact inference on them. The junction tree may be too big to be represented in memory and the inference time can be extremely important. In this case, when the user asks to go in inference mode, a dialog box is displayed to propose several options:
202
Menus
The use of the approximate inference avoids the memory size problem but the exactness of the computation is lost as well as some analyses that are design to work only with exact inference. A complexity reducing algorithm allows removing the less important arcs in the network. To do this, it uses the current database or generates one according to the probability distributions in order to compute the importance of each arc in the network. The less important arcs will be removed until the exact inference becomes possible in memory and time. It is possible to go back to modeling mode in order to modify by hand the network structure to be usable. It is possible to continue with exact inference without take the warning into account.
It lets the user to indicate the target node and possibly a target state to perform an adaptive questionnaire based on a particular target value. It is also possible to precise the number of monitors to display (the ordered list of questions), to force the removal of previous variable observations and to save these options in the BayesiaLab settings. This wizard also allows editing the costs associated with the observation of the variables. Once the adaptive questionnaire is launched, a new toolbar is displayed:
The button resets the adaptive questionnaire. The button all the monitors and removes all the observations.
203
Menus
The adaptive questionnaire allows an automatic and dynamic organization of the monitors while taking into account what information is being contributed to the knowledge of the target variable and the corresponding cost of the questions. The target monitor appears with a pink background at the top of the list, followed by the monitors ranked by their relevance.
The monitors of the observed variables appear at the end of the list (grayed) in the order of observation. The last observed variable then appears just after the last most relevant variable. It is possible to remove an observation or to change the observed value by double-clicking on the monitor. It is also possible to observe a variable that does not appear in the proposed list by monitoring it manually. The translucent nodes are never proposed to the user.
204
Menus
205
Menus
The button allows coming back to the first example of the data source; the button navigates to the last one. The button goes to the previous example if it is possible and the button goes to the next one. The text field indicates the index of the current example. It is possible to enter an index in the field to go to it directly. The button stops the interactive inference and removes all the observations. If the used data source is an evidence scenario file, the comment associated to each evidence set is displayed in the status bar of the network, if it exists. In the same way, if a database with row identifiers is used, they will be displayed in the status bar. If the inference is done from an evidence scenario file, a right click on the index text field displays a list of the evidence sets contained in the file. A click on a line sets the corresponding observations:
If the inference is done from a database and this database contains row identifiers, a right click on the index text field displays a new floating panel allowing the user to perform a search among the identifiers. The search is done thanks to a text field. The wildcards characters ? and * can be used. ? can replace one and only one character. * can replace none or several characters. The Case Sensitive option allows the search to respect the case. After pressing enter, the search is performed and the list of the results is displayed. The number of corresponding rows is displayed at the bottom of the panel. A click on a line sets the corresponding observations:
206
Menus
Caution
Sometimes a fixed probability distribution can't be done exactly because the algorithm used to reach the target distribution fails to converge towards the target distribution. In this case a warning dialog box is displayed and an information message is also written in the console. In the following picture, Cancer is the target node (pink background) and is not observed. The corresponding value in the database is No (blue sky) and corresponds to the value predicted by the network (99,97%).The node TbOrCa is not observed because it is declared as not observable (mauve background) and the corresponding value in the database is False (blue sky). The node Smoking is not observed because the corresponding value is missing in the database:
Thus, it is possible to see interactively the behavior of the network and to check its validity.
207
Menus
The button allows coming back to the first example of the database and resetting the probability distributions of the "not observable" nodes. The button performs an updating from the current index to the last in the database. This process can be stopped while running by clicking on the red light of the Status bar. The button steps to the next example. The text field indicates the index of the current example. It is possible to enter an index in the field to perform updating from the current index to the new index. If the new index is lower than the current one, the probability distributions are reset and the updating goes from the index 0 to the specified one. The button validates the updated conditional probability tables. The button stops the interactive updating and reinitializes the conditional probability tables. It also removes all the observations. Going back to the Modeling mode has the same effect. If the used data source is an evidence scenario file, the comment associated to each evidence set is displayed in the status bar of the network, if it exists. In the same way, if a database with row identifiers is used, they will be displayed in the status bar. If the updating is done from an evidence scenario file, a right click on the index textfield displays a list of the evidence sets contained in the file. A click on a line performs the updating starting from the current index up to the specified index, taking into account the corresponding observations:
If the updating is done from a database and this database contains row identifiers, a right click on the index text field displays a new floating panel allowing the user to perform a search among the identifiers. The search is done thanks to a text field. The wildcards characters ? and * can be used. ? can replace one and only one character. * can replace none or several characters. The Case Sensitive option allows the search to respect the case. After pressing enter, the search is performed and the list of the results is displayed. The number of corresponding rows is displayed at the bottom of the panel. A click on a line performs the updating starting from the current index up to the specified index, taking into account the corresponding observations:
208
Menus
Let us take an example with a network made of two nodes. The first node Alpha is continuous and contains 10 intervals from 0 to 1. The other node named Measure is Boolean (its values are True and False) and is the son of Alpha. The conditional probability table of Measure is obtained from the probabilistic formula: P(?Measure?|?Alpha?) = If(?Measure?, ?Alpha?, 1-?Alpha?).
We associate with the network a database containing the values of the node Measure. After each observation, the probability distribution of Alpha is updated to take into account its initial distribution and the set of all the observations of Measure that have been set. Here is a short simulation: Initial distributions:
Example 0:
209
Menus
Example 1:
Example 2:
210
Menus
Caution
If "not observable" nodes are marginally independent but are linked to a common measure node (directly or not), it is necessary to add an arc between these nodes to indicate the conditional dependence of these nodes knowing the observations. In the example below, we then have to simply add an arc between the hyper-parameters a and b of the Beta law, without modifying the conditional probability table to keep the marginal independence.
Caution
Sometimes a fixed probability distribution can't be done exactly because the algorithm used to reach the target distribution fails to converge towards the target distribution. In this case a warning dialog box is displayed and an information message is also written in the console.
211
Menus
The results are stored in an exploitation file that takes the selected fields of the input file and creates two new fields: one for the predicted value, the other one for its corresponding probability. If the data source is an external database, the fields of the input file that are included in the exploitation file are selected via the wizard illustrated below:
212
Menus
If the data source is the associated database, a dialog allows the user to choose on which part of the database (all, learning or test) the operation is done and which nodes will be saved in the destination file. It is also possible to choose if the state's long names are used and if the continuous values are saved:
The observations are done according to the cost of each node: if a node is not observable, it won't be observed even if there is a corresponding value in the associated database.
213
Menus
If the data source is the associated database, a dialog allows the user to choose on which part of the database (all, learning or test) the operation is done and which nodes will be saved in the destination file. It is also possible to choose if the state's long names are used and if the continuous values are saved:
214
Menus
The observations are done according to the cost of each node: if a node is not observable, it won't be observed even if there is a corresponding value in the associated database.
215
Menus
If the data source is the associated database, a dialog allows the user to choose on which part of the database (all, learning or test) the operation is done and which nodes will be saved in the destination file. It is also possible to choose if the state's long names are used and if the continuous values are saved:
216
Menus
217
Menus
If the data source is the associated database, a dialog allows the user to choose on which part of the database (all, learning or test) the operation is done and which nodes will be saved in the destination file. It is also possible to choose if the state's long names are used and if the continuous values are saved:
218
Menus
The observations are done according to the cost of each node: if a node is not observable, it won't be observed even if there is a corresponding value in the associated database.
219
Menus
If the data source is the associated database, a dialog allows the user to choose on which part of the database (all, learning or test) the operation is done and which nodes will be saved in the destination file. It is also possible to choose if the state's long names are used and if the continuous values are saved:
220
Menus
Sorting the resulting file with respect to this joint probability can thus allow detecting the atypical records, the outliers (cases with a very weak joint) by taking into account all the variables.
221
Menus
If the data source is the associated database, a dialog allows the user to choose on which part of the database (all, learning or test) the operation is done and which nodes will be saved in the destination file. It is also possible to choose if the state's long names are used and if the continuous values are saved:
222
Menus
The observations are done according to the cost of each node: if a node is not observable, it won't be observed even if there is a corresponding value in the associated database.
223
Menus
7. Analysis
This menu gives access to: several graphic analysis tools that modify the graph view or display diagrams, several analysis reports, tools for displaying the edges, i.e. arcs that can be inverted, two tools for the performance evaluation of networks, a tool for the optimization of the target state. a target's optimization tree. a target's interpretation tree. This menu is visible in validation mode only.
Variable Clustering
This tool allows clustering the network variables into group of variables that are close semantically (from the Analysis menu or the shortcut ). These clusters are designed according to the node proximity in the graph and based on the force of the arcs. A color is automatically associated with each cluster to highlight the clustering:
The number of clusters is automatically computed by using the force of the arcs. However, the associated toolbar contains a slider that allows choosing the desired number of clusters (4 in the previous example) : The button displays the hierarchical representation of the current clustering as a dendrogram. It is always possible to modify interactively the number of clusters and to observe the result on the dendrogram. A contextual menu allows displaying the comment associated with the node instead of the name. You can also copy the graph as an image. The length of the links joining the clusters is inversely proportional to
224
Menus
the strength of the relationships between the two sets of variables: more the link is short more the relationship is strong. When the cursor is moved over the junction point of the links, a tooltip containing the value of the link based on the arc force is displayed.
The button validates the current clustering and associates to each set of variables a class named [Factor_i]. When the button is pressed, a HTML report of the current clustering is displayed:
The button
Once the classes created and associated with the clusters, it is possible to perform the multiple clustering, in modeling mode that allows, for each class named [Factor_i], generating a synthetic variable from the nodes belonging to this class.
225
Menus
The number of clusters automatically determined by the algorithm can be limited by the convenient option in the settings. The user can modify also the stop threshold corresponding to the max KL weight a cluster can reach to be kep
Arc Force
It allows highlighting the importance of the arc with respect to the complete structure (from the Analysis menu or the shortcut ). The thickness of the arc is proportional to the importance of the probabilistic relation; it represents in the total probability law.
You can use this tool to make translucent all the arcs having a force lower than the value indicated. Go back to the previous threshold Go to the next threshold Store the current arc forces in the arc comments if those comments are displayed. Stop analysis.
The KL force of each arc as well as its global contribution are displayed in the comment of the arcs. You have to press the button in the display tool bar in order to show them. The filtered values are taken into account when the KL forces are computed.
226
Menus
The thickness of an arc is directly proportional to the mutual information. Three values are displayed in the arcs' comments: In black, the mutual information of the arc. In blue, the relative mutual information in the direction of the arc. In red, the relative mutual information in the opposite direction of the arc.
You can use the slider to change the arc display threshold: Go back to the previous threshold. Go to the next threshold. Store the mutual information in the comments of the arcs if those comments are displayed. Stop analysis.
If all the arcs of a node became transparent, the node becomes transparent.
Pearson's Correlation
The association of values to nodes' states allows computing R, the Pearson's linear correlation coefficient between two nodes linked by an arc (from the Analysis menu or the shortcut ). If the states don't have associated values they can be generated from different ways: if the node is continuous the mean of each interval is used if the node is discrete with integer or real states, they are used as values otherwise, if there is no possibility to generate values, default values are defined in order to compute R (from 0 to n-1 for a node with n states). Note: if there is no value associated with the states, the index of the state starting from 0 is used. If the node is continuous, the used values are the mean of each interval. If the node is discrete with integer values as states, the integer represented by the state is used.
227
Menus
You can use the slider to change the arc display threshold according to the selected filter button: Go back to the previous threshold according to the selected correlation Go to the next threshold according to the selected correlation Displays only arcs having a negative correlation greater than the given threshold in absolute value Displays only arcs having a correlation greater than the given threshold in absolute value Displays only arcs having a positive correlation greater than the given threshold Store the current Pearson's correlations in the arc comments as well as the associated color if those comments are displayed. Stop analysis.
If all the arcs of a node became transparent, the node becomes transparent.
Node Force
It allows highlighting the importance of the node with respect to the complete structure (from the Analysis menu or the shortcut ). Three kinds of node forces are computed: The entering node force: it is the sum of the force of the entering arcs. The outing node force: it is the sum of the force of the outing arcs. The global node force: it is the sum of the force of the entering arcs and the outing arcs.
You can use this tool to make translucent all the nodes having a force lower than the value indicated. Go back to the previous threshold according to the selected force Go to the next threshold according to the selected force Computes only the entering force of the nodes and displays if greater than the given threshold Computes the global force of the nodes and displays if greater than the given threshold Computes only the outing force of the nodes and displays if greater than the given threshold Stop analysis.
228
Menus
The symbol inside the node corresponds to the pattern of the probability distribution of the target value (CLUSTERS), conditionally to each value of this node.This symbol schematizes the probability distribution curve (cf the pattern of node CATECHOL relative to the associated distribution, obtained by the means of the contextual menu of node CATECHOL: Influence Analysis wrt target node The symbol indicates that the distribution curve does not correspond to any of the four basic patterns.
Neighborhood Analysis
This kind of analysis allows visualizing, for the selected nodes, what is the set of nodes that are belonging to it according to the mode chosen in the toolbar:
229
Menus
The nodes that are not belonging to the selected node are made translucent and are not selectable anymore. When we click on a visible node, the nodes that are not belonging to it are made translucent. In order to make the node visible again, you have just to click on any location except on a visible node. It is possible to display, through the combo box: The Markov blanket The spouses The parents. It is possible to specify until which distance the ancestors are displayed with the corresponding field. The children. It is possible to specify until which distance the descendants are displayed with the corresponding field. The neighbors. It is possible to specify until which distance the neighbors are displayed with the corresponding field. The number of displayed nodes is indicated in the status bar of the graph's window. In the following example, the Markov blanket of the selected node is displayed, i.e. the not concerned nodes are made translucent :
Mosaic Analysis
This analysis is used to display on a two-dimensional graph, the marginal probabilities of a node based on all possible combinations of evidences set on nodes (from the Analysis menu or the shortcut + ). The Pearson's standardized residual is also computed for each combination. These probabilities are displayed with colored rectangles that can be easily identified and compared to each other. The analysis is performed only on the selected nodes in the network. Depending on the number of selected nodes, the dialog settings may slightly vary. The most complete version is displayed when three nodes are selected. The following version is the simple version:
230
Menus
The selected nodes are displayed in the table and their positions in the graph are displayed on the left. It is possible to modify their respective positions by selecting the desired node and using the Up and Down buttons. By default the display of variables is done in alternating horizontal and vertical positions. With one variable, the graphic will represent P(Horizontal0). With two variables, the graphic will show P (Vertical0 | Horizontal0). With three variables, the graphic will display P(Horizontal1 | Vertical0, Horizontal0). With four variables, the graphic will display P(Vertical1 | Horizontal1, Vertical0, Horizontal0). And so forth. If Horizontal Diagram is checked, then the graphic will be displayed with the first variable in vertical position and all others in horizontal position inside a separate chart for each horizontal variable that represents P (Vertical | Horizontal i). If Display P(Horizontal | Vertical) is checked, then each graphic will represent P (Horizontal i | Vertical). If a database is associated to the network, it is possible to choose the data source used to compute the standardized Pearson's residual: Network: the Structure Equivalent Example Number setting allows simulating a set of data in order to compute the standardized Pearson's residual. This number is, by default, the number of examples used to learn the network. Database: the standardized Pearson's residual is directly computed from the populations of the associated database. If a database is associated to the network and one of the nodes is hidden (there is no data corresponding to the node) the default selected data source is the network. If a database is associated to the network and it contains learning and test data, a combo box allows us to choose which data will be used (all, learning or test). The following image is a chart with three variables. The first variable is the horizontal variable Eyes, the second is the vertical variable Hair and the third is the horizontal variable Sex. The horizontal and ver-
231
Menus
tical cells represent the marginal probabilities of each variable's states without any evidence set. The central cells represent the conditional probabilities P(Eye | Hair, Sex). For each cell, the standardized Pearson's residual is computed as : Di = (ni - Ni) / SQRT(Ni) The Chi test equals the sum of Di . The value of the Chi test, the degree of freedom and the independence probability associated are dis2 played at the top of the graph. It is possible to display the result of the G-test instead of Chi by modifying the corresponding settings in the preferences of the statistical tools. A tooltip is displayed when the cursor hovers on each cell. It contains the list of the evidences corresponding to this cell as well as the joint probability and the conditional probability of the evidence. The population of the cell and the value of the residual are also displayed.
2 2 2
232
Menus
The result display panel is also modifiable: The option DisplayStandardizedPearson's Residual toggles between classic display with colors corresponding to the states of the first horizontal variable and the display with the color code of the Pearson's standardized residual. The color code is as follows: simulated data are in very significant overrepresentation (D > 4) simulated data are in significant overrepresentation (D > 2) simulated data are in not significant overrepresentation (D > 0)
233
Menus
simulated data are in not significant under-representation (D < 0) simulated data are in significant under-representation (D < 2) simulated data are in very significant under-representation (D < 4) absence of simulated data
The option Resizable Graphic allows enlarging or reducing the graphic according to the window's size. If this option is unchecked, the graphic has a predefined constant size and scroll bars are displayed if necessary. There are two possibilities of separation between the cells of the graph: Automatic Gap: it is computed according to the depth and the number of states of each variable. More the depth is important, more the gap is reduced. Constant Gap: we indicate what the number of pixels between two cells regardless of the depth of the variable is. 1. 1-dimensional charts:
On the left the simple chart and on the right the chart with the Pearson's standardized residual. The width of cells corresponds to the marginal probability of each state of the horizontal variable. This is the same as the monitor of this variable. 2. 2-dimensional charts:
On the left the simple chart and on the right the chart with the Pearson's standardized residual. The width of cells corresponds to the marginal probability of each state of the horizontal variable P(H). The height of the cells is the conditional probability of the vertical variable knowing the horizontal variable P(V | H). The area of the cell represents the joint probability P(V, H).
234
Menus
3. 3-dimensional charts:
On the left the simple chart and on the right the chart with the Pearson's standardized residual. The width of cells corresponds to P(H1 | V0, H0). The height of the cells represents the conditional probability P(V0 | H0). The area of the cell represents the joint probability P(H1, V0, H0). The Pearson's standardized residuals show that, for example, the correlation between the fact of having blond hair and the fact of having blue eyes is very significant. When you have three selected variables, the dialog box setup is modified to allow choosing how the Pearson's standardized residual will be computed. By default the Pearson's standardized residual is computed in relation to a fully unconnected network. It is possible to choose another reference model in the following combo box:
235
Menus
So, we will compare with the addition of an arc between V0 and H1. 4. Horizontal charts:
236
Menus
Above the simple chart and below the chart with the Pearson's standardized residual. This chart corresponds to a sequence of 2-dimensional graphics involving the vertical variable and each horizontal variable. The width of cells corresponds to the marginal probability of the states of each horizontal variables P(Hi). The height of the cells is the conditional probability of the vertical variable knowing the horizontal variable P(V | Hi). 5. Inverted horizontal charts:
237
Menus
Above the simple chart and below the chart with the Pearson's standardized residual. Like the previous one, this chart corresponds to a sequence of 2-dimensional graphics involving the vertical variable and each horizontal variable. However, instead of representing P(V | Hi), this chart represents P(Hi | V). The height of cells corresponds to the marginal probability of the states the vertical variable P(V). The width of the cells is the conditional probability of each horizontal variable knowing the vertical variable P(Hi | V).
238
Menus
graph, under the title. There are two kinds of display that can be chosen up to the checkbox Inverse Influence: Display of each target's state's probability knowing each state of each node:
The target's state's prior probability is shown by a red line. Display of each state's probability for each node knowing each one of the target's state:
239
Menus
For each node, each state's prior probability is shown by a red line. In this mode, the nodes' current values and delta are displayed in each "monitor". This value is computed from each node's values if they exist. If not, they are computed from the data for continuous nodes or from the intervals if no data is associated. If the node has real or integer states, they will be used. Otherwise, integer values are automatically generated, starting from 0. A checkbox Compute from Database is displayed if a database is associated to the network. It performs the analysis not by using the bayesian inference anymore but by using the probabilities directly computed from the database. A checkbox Fix References becomes available if Inverse Influenceis selected. It allows keeping the value of the a priori probability (red line) as reference if new evidence is set on the monitors. When this option is chosen, a priori is replaced by reference in the caption. This option is also available when data are used. The computed delta values take into account the reference probabilities as well.
240
Menus
When the mouse is moved over a bar chart, a tooltip displays the state's name and the associated exact probability. It is possible to save the graph's image in a file or to print it. When the colors associated with the nodes are displayed thanks to the convenient button, the borders of each box are painted with the corresponding color. The graph's contextual menu allows the user to: Sort the displayed nodes within three modes: 1. Default Order: the nodes' initial order is used or the displayed monitors' order, if they are monitored 2. Sort by Global mutual Information 3. Sort by Binary Mutual Information Display Comment Instead of Name Display States' Long Name Copy the image of the graph
241
Menus
This tool allows you to graphically view the impact of changes in the selected nodes' means on the target node's mean. This lets you see the relationship between each node and the target variable in the form of curves. The translucent nodes are not taken into account. For each node, its mean will vary from the minimum to the maximum of the variation domain and determine, for each variation, the corresponding probability distribution up to MinXEnt method, in the same manner as for the observation of a node's mean in the monitors. Each node will be observed with this probability distribution and the corresponding value of the target node is calculated. For each node the mean is computed from the values associated with the node: if the node has values associated with its states, the mean is computed from them, otherwise if the node is continuous, its mean is computed from the intervals, and if the node is discrete with integer or real states, the mean is computed from them. If there is no possibility to compute the mean, a default set of values from 0 to the number of states minus one is used. A dialog box lets you configure the display of curves:
For the target variable, it is possible to display either the actual target's mean or the delta of the mean with the prior target's mean. Likewise for other variables, it is possible to display either the actual mean or the delta of the mean with the prior mean. However, it is also possible to display the real variables' means, but all standardized between 0 and 100. The option Use Hard Evidences replace the previously used algorithm to compute the points of each curve simply by observing each node states and by determining the node's mean and the corresponding target's mean. The option Order by Strength sorts the variables from the slope the more towards 90 degrees to the slope the more towards -90 degrees. This sort variables that have the most positive impact on the target's mean to the most negative impact. This slope is calculated at the midpoint of the curve. By default, the nodes are sorted according to the order of the corresponding monitors, if they exist.
242
Menus
This figure represents the result of the previous settings. The legends are sorted according to the slope of the curves. When the mouse hovers on any of the points on the curves, a tooltip displays the name of the variable represented by the curve. The coordinates are displayed at the top of the window. The evidence context, if it exists, is also displayed at the top of the window. The following figure represents the display of the deltas. A zoom has been done on the central points. The zoom is operated as in the scatter graph. It is also possible to change the curve's color by clicking on the colored square to the left of the node names in the legend.
243
Menus
The chart's contextual menu allows the user to: Display Comment instead of Name Copy the image of the chart or the curves' points as text or html Print the image of the chart
244
Menus
245
Menus
246
Menus
The result of the simulation can be saved in a file. The result of the analysis is presented with a curve representing the repartition function of the probabilities of each state, and, a bar chart representing the probability density function. Besides these graphical results, the mean and the standard-deviation of the probabilities of the target states are also given. Obviously, the mean corresponds to the marginal probability displayed in the monitors.
247
Menus
The analysis is performed over the whole nodes or over a subset of selected nodes. The translucent nodes are not taken into account. The context of the evidences is taken into account and displayed under the graphs. A contextual menu allows displaying the comment associated with the nodes instead of the name and also the long names of the states. It allows copying the chart as an image.
The icon
Three policies are available for sampling: Random selection of one expert per network: for each sample an expert will be drawn at random and the conditional probability tables will be generated only with the assessments of the expert. Random selection of one expert per node: for each sample and for each node an expert will be drawn at random and the conditional probability tables will be generated accordingly.
248
Menus
Random selection of an assessment per parent combination of each node: that is to say randomize an assessment by row of the conditional probability table. For each sample and each row of the table of each node, an assessment is used to generate the probabilities of this line. In each case, if there is no expertise that matches the sample, then the consensus is used. For each target node, the result of the analysis is presented with a curve representing the repartition function of the probabilities of each state, and, a bar chart representing the probability density function. Besides these graphical results, the mean and the standard-deviation of the probabilities of the target states are also given. The user selects the node and the state to see by clicking on their tabs.
In this example, we have 32% chance that the state "Weak" of the node "TARGET" has a probability between 70% and 72.5%.
249
Menus
The analysis is performed over the whole nodes or over a subset of selected nodes. The translucent nodes are not taken into account. The context of the evidences is taken into account and displayed under the graphs. A contextual menu allows displaying the comment associated with the nodes instead of the name and also the long names of the states. It allows copying the chart as an image.
250
Menus
The means and the standard deviations of the continuous nodes are no more computed and displayed. It is the same thing for the mean value, the total value and if necessary the uncertainty and the likelihood which are no more computed and displayed in the information panel.
Global Orientation of Edges: This function allows converting an essential, or semi directed graph, into a Bayesian network graph, by directing all the edges while preserving the same probabilistic relations. This function is directly available via key < I >. It is also possible to direct specific edges by using the contextual menu associated with the arcs. The activation of this contextual menu on an edge or
251
Menus
a previously directed edge allows its orientation/inversion. These arc orientations trigger the propagation of the new information through the network so that the probability law remains identical, namely: orientation of the edges that cannot be inverted anymore, inversion of previously oriented arcs in order to preserve the probabilistic relations encoded by the initial Bayesian network, removing of the orientation of the arcs that have been previously oriented threw propagation, changing of the conditional probability tables
252
Menus
253
Menus
1. Context of the analysis Description of the observed variables when the analysis is carried out. 2. Marginal probability distribution Probability distribution of the target variable knowing the observed variables (context). 3. Distribution on the Learning Set If a database is associated to the network, the target's probability distribution is computed on the database. 4. Target Mean Purity The target's mean purity is computed when a database is associated to the network. For each case in the database, we look if the network correctly determines the corresponding target's state present in the database. The table below details state by state the results of the predictions. The column Neighborhood indicates for what state the network was wrong and with the corresponding percentage.
254
Menus
5. Performance Indices Two performance indices of the network are computed if a database is associated: a. Contingency Table Fit: Used to represent the degree of fit between the network's joint probability distribution and the associated data. More the network correctly represents the database, the more the value tends towards 100%. This measure, computed from the database's mean log-likelihood is equal to 100% when the joint is completely represented in like in the fully connected network or to 0% when the joint is represented by a fully disconnected network. The dimensions represented by the not observable nodes are excluded from the computation. b. Deviance: This measure is computed from the difference between the network's mean log-likelihood and the database's mean log-likelihood. More the value tends towards 0 the more the network is close to the database. 6. Node relative significance with respect to the information gain brought by the node to the knowledge of the target node List of nodes, sorted by descending order according to the information they bring to the knowledge of the target variable. The nodes that do not bring any information do not appear in this list. That corresponds to the target node analysis. a. Mutual Information: Amount of information brought by each variable to the target variable. b. Mutual Information (%): Amount of information brought by each variable to the target variable compared to an unconnected network. c. Relative Significance: Ratio between the mutual information brought by each variable and the greater mutual information. d. Mean Value: Displays the nodes' mean value. Each node's mean is computed like this: if the node has values associated with its states, the mean is computed from them, otherwise if the node is continuous, its mean is computed from the intervals, and if the node is discrete with integer or real states, the mean is computed from them. If there is no possibility to compute the mean, a default set of values from 0 to the number of states minus one is used. e. Chi -test or G-test: The independence tests Chi or G-test are computed from the network between each variable and the target variable. It is possible to change the used independence test thanks to the statistical tool settings. f. Degree of Freedom: Indicates the degree of freedom between each variable and the target variable in the network. g. p-value: Represents the independence probability between each variable with the target variable in the network. h. Chi or G-test on data: The independence tests Chi or G-test are computed as previously but from the associated database, if it exists. i. Degree of Freedomon data: The degree of freedom is computed as previously but from the associated database, if it exists. j. p-valueon data: The independence probability of the test is computed as previously but from the associated database, if it exists. 7. Node relative significance with respect to the information gain brought by the node to the knowledge of the target value For each value of the target, except for the filtered state, list of nodes, sorted by descending order according to their relative contribution to the knowing of the target value (if the node has only two
2 2 2
255
Menus
states, this list is identical to the preceding one). The nodes that do not bring any information do not appear in this list. That corresponds to the target state analysis. a. BinaryMutual Information: Amount of information brought by each variable to the knowledge of the state of the target variable. b. Binary Mutual Information (%): Amount of information brought by each variable to the knowledge of the state of the target variable compared to an unconnected network. c. Binary Relative Significance: Ratio between the mutual information brought by each variable and the greater mutual information. d. Mean Value: Displays the nodes' mean value fro each target's state. e. Modal Value: For each influencing node, description of the modal value (the most probable) with respect to the context and the observed state of the target node. This modal value comes with its probability. This section allows establishing the profile of this target value. f. A priori Modal Value: For each influencing node, description of the modal value when the target node is unobserved (but knowing the context). That makes it possible to define the profile when the target variable is unobserved. g. Variation: Measure indicating the variation between the a priori modal value and the modal value when we know what the value of the target variable is. The used formula is: -log2(P(X=modal value)) +log2(P(X=modal value|Target=observed value)). In Information Theory, this measure represents how many bits are won to represent the probability of X when the target value is known. Values printed in blue represent positive variations (the posterior probability of the modal value is greater than the prior one) while values printed in red represent negative variations. Obviously, no variation is reported if the posterior modal value is different from the prior modal value. The modal value is then displayed in blue. h. Maximal positive/negative variation: Measures indicating the states that have been the most impacted by the observation of the corresponding target state. The first one indicates the state with the greatest increase whereas the second one indicates the state with the greatest decrease.
These measures correspond to the longest (right and left) grey arrows in the monitor (Inference mode) showing the probability variation of states. These measures are particularly useful for the variables that have more than two states! When a node has an associated comment, this node appears as a HTML link, and the comment can be displayed by pointing on the node.
If the target variable is a hidden variable, as it is the case for example for the variable Cluster induced by data clustering, a button Mapping allows generating a mapping of the values of this variable. A button Quadrants displays the Quadrant chart of the node's relative significance with the target relatively to each node's mean:
256
Menus
The points represent the variables. If a node has a color then the point will be displayed in this color. The node's name is displayed at the right of the point. When moving the mouse on the point, the coordinates are displayed in the top panel. The top panel shows the number of points displayed as well as the evidence context. It is possible to zoom in the chart as in the scatter of points. The chart's contextual menu allows the user to: Display or not the nodes' comment instead of their names Copy the chart as image or as table of points (text or html)
257
Menus
Print the chart The chart is automatically resized when the size of the window changes. This chart is cut into four quadrants whose separations are: Along X-axis: the mean of the variables' means Along Y-axis: the mean of the relative significance of the nodes with the target Quadrant 1 : Top right: contains the important variables greater than the mean Quadrant 2 : Bottom right: contains the important variables few or not important but greater than the mean Quadrant 3 : Bottom left: contains the few or not important variables that are below the mean Quadrant 4 : Top left: contains the important variables but that stay below the mean It is possible to save the image of the chart in a file with the corresponding button.
258
Menus
1. Context of the analysis Description of the observed variables when the analysis is carried out. 2. Probability of the target state Probability of the target state knowing the observed variables (context). 3. Maximal Derivative
259
Menus
This table summarizes the nodes with a maximal derivative different from zero. It is ordered from the greatest to the smallest derivative. 4. Insensitive Nodes This table contains the nodes with a derivative equal to zero. 5. Tables of sensitive nodes For each sensitive node, a table represents each combination of its values and its parents' values. Each cell contains two parts: The part on the left indicates the derivative computed for each combination of states.The background of the maximal derivative is yellow. The part on the right shows the probabilities of the target state when the probability of the a current node's state is set to 0% and to 100%. Obviously, those probabilities are symmetrical for binary nodes.
260
Menus
Profile Search Criterion: One of these profile search criterions must be selected: Probability: For each state of the node, its associated probability will be maximized or minimized as needed. Mean: The mean of the target node will be maximized or minimized as needed. If the node has values associated with its states, the mean is computed from them, otherwise if the node is continuous, its mean is computed from the intervals, and if the node is discrete with integer or real states, the mean is computed from them. If there is no possibility to compute the mean, a default set of values from 0
261
Menus
to the number of states minus one is used. If the equivalent example number of the network exists, the 95% credible interval of the mean is computed and displayed in the report. Probability Difference between Two States: The algorithm tries to maximize or minimize the difference between the probabilities between the selected states. Criterion Optimization: In the criterion optimization area, the user can choose to minimize or maximize the selected criterion. He can also take into account the probability of the evidence. In this case, the computed criterion is weighted by the probability associated with evidence that will be set. In the same way, the costs associated with the nodes can be also taken into account. Search Method: Four search methods for criterion optimization are available: Hard Evidences: Only hard evidences will be used for the optimization. Value/Mean Variations in %: The observation can be done either by fixing means, i.e. the convergence algorithm will find, at each new observation, a probability distribution to obtain the wanted mean, or by fixing the probabilities, i.e. at startup, the probability distributions corresponding to the wanted means are computed once for all and will not change anymore. The observed means will vary in percentage: Of the initial means of the nodes: Each driver variable will see its initial mean vary according to the given negative and positive percentages. The computed mean is bounded by the variable's variation domain's limits. Of the variation domains of the nodes: Each driver variable will see its initial mean vary according to the given negative and positive percentages of the variable's variation domain. The computed mean is bounded by the variable's variation domain's limits. Of the margin progress of the nodes: Each driver variable will see its initial mean vary according to the percentages of the difference between the initial mean and the minimum of the domain and the difference between the initial mean and the maximum of the domain. Each node's mean is computed from the values associated to the states. If there is no associated value, if the node is continuous, its mean will be computed from the intervals, and if the node is discrete with integer or real states, the mean is computed from them. If there is no possibility to compute the mean, a default set of values from 0 to the number of states minus one is used. When an observed mean reach the limits of the variation domain of a node, the corresponding hard evidence is used instead. The percentages of the negative and positive variations of each node can be modified with the mean variation editor. In each case, the variables' filtered states will never be used. Stop Criterion: The search is stopped when the joint probability of the network reaches zero. But this stop criterion can be modified by setting a maximum number to the evidences done and by modifying the minimum joint probability allowed. It is also possible to use the automatic stop criterion that allows finding a good balance between search depth and usefulness of the levers. Options: An option allows computing only prior variations, i.e. without cumulative effect: previous evidence set will be removed before finding the best following evidence. It is also possible to associate to the network the evidence scenario file corresponding to the found results. If an evidence scenario file already exists, it can be replaced by the new one or the evidences can be appended to it. Here is the result corresponding to the parameters above:
262
Menus
In the report's window, in addition to the usual saving and printing options, a button Save Scenario allows you to save in a file the evidence scenario corresponding to the chosen optimization.
263
Menus
2 2
Chi -test or G-test: The independence tests Chi or G-test are computed from the network between each variable and the target variable. It is possible to change the used independence test thanks to the statistical tool settings. Degree of Freedom: Indicates the degree of freedom between each variable and the target variable in the network. p-value: Represents the independence probability between each variable with the target variable in the network. If a database is associated to the network, so the computations of the independence test, the degrees of freedom and the p-value are also done on the data.
In addition to the saving and printing classical options, a button Quadrants allows displaying a Quadrant chart of the target's mean relatively to the total effect or the standardized total effect. The choice is done in the following dialog box:
264
Menus
The points represent the variables. If a node has a color then the point will be displayed in this color. The name of the node is displayed at the right of the point. When moving the mouse on the point, the coordinates are displayed in the top panel. The top panel shows the number of points displayed as well as the evidence context. The chart's contextual menu allows the user to: Display or not the nodes' comment instead of their names Copy the chart as image or as table of points (text or html)
265
Menus
Print the chart The chart is automatically resized when the size of the window changes. This chart is cut into four quadrants whose separations are: Along X-axis: the mean of the variables' means Along Y-axis: the mean of the criterion's importance (here the standardized total effect on the target) Each quadrant has a specific meaning according to the knowledge represented by the bayesian network (marketing, satisfaction, etc.). In a general way, the quadrants are: Quadrant 1 : Top right: contains the important variables greater than the mean Quadrant 2 : Bottom right: contains the important variables few or not important but greater than the mean Quadrant 3 : Bottom left: contains the few or not important variables that are below the mean Quadrant 4 : Top left: contains the important variables but that stay below the mean It is possible to save the image of the chart in a file with the corresponding button.
266
Menus
1. Context of the analysis Description of the observed variables when the analysis is carried out. 2. Global contradiction measure The measure that is used to determine if the observations are contradictory is defined by: log2( p(e1) p(e2) ... p(ej) / p(e1, e2, ..., ej)). When this measure is negative (blue background), i.e. when the joint probability of the evidences is greater than the product of the marginal probability of these evidences, this means that the evidences globally support the same conclusion. Otherwise, when this measure is positive (red background), it means that we have globally contradictory evidences. This information is only available for hard evidence. 3. Evidence analysis with respect to the reference evidence The reference evidence is used to indicate the referential conclusion.
267
Menus
a. Set of evidences that confirm the reference evidence List of the evidences that confirm the reference evidence (blue). b. Set of evidences that contradict the reference evidence List of the evidences that contradict the reference evidence (red). c. Set of neutral evidences List of the evidences that do not confirm nor contradict the reference evidence (independent evidences).
268
Menus
11. Pearson's Correlation: The Pearson's Correlation is displayed for each arc. The second table represents the node force analysis. For each node it displays: 1. Outing Force: It corresponds to the sum of the forces of the outing arcs of each node. 2. Entering Force: It corresponds to the sum of the forces of the entering arcs of each node. 3. Global Force: It corresponds to the sum of the forces of both the entering and outing arcs of each node.
269
Menus
270
Menus
In the case of equal probabilities, the following dialog box allows specifying the choice policy:
271
Menus
When the evaluation on multiple thresholds is selected, it is possible to display the confusion matrices according to each threshold by selecting the threshold wanted in the box of choice called Acceptance threshold. The obtained result for multiple states:
272
Menus
There is one thumbnail for each evaluated state. It allows displaying the curves corresponding to each state. As the above figure shows it, the evaluation is carried out by four complementary tools. A total precision of the model (function of the number of correct predictions of the target variable), a Confusion Matrix in which figure the occurrence number, the reliability and precision of each prediction, and finally the gain curve, the lift curve and the ROC curve centered on the target value. At the top of the report, the Total Precision of the network is specified. Below that, two indices are displayed, in order to measure the quality of the numerical prediction: R: the Pearson coefficient
273
Menus
R2: the square value of the Pearson coefficient The button Global report allows generating a HTML report that contains, for each acceptance threshold, the total precision and the corresponding confusion matrices. It indicates, for each state, the computed Gini index, Relative Gini index, Mean Lift and the Relative Lift index. It contains also the Relative Gini Global Mean and the Relative Lift Global Mean. Below, the total precision, the Pearson's coefficient and its square value are added.
Confusion Matrix
The direct evaluation of a network modeled for the prediction of a target variable can be realized by computing its total precision, i.e. the ratio between the number of correct predictions and the total number of cases. This measure is useful but can be too general. The Confusion Matrix proposed by BayesiaLab allows having a more precise feedback about the model performances. The predictions of the model appear on the lines, the column representing the real values that are in the database (the model has
274
Menus
predicted 723 C in the following matrix, whereas there are 730 C in the file). Three visualization modes are available: 1. Occurrence Matrix: number of cases for each Prediction/Real value 2. Reliability Matrix: ratio between each prediction and the total number of the corresponding prediction (the sum of the line) 3. Precision Matrix: ratio between each prediction and the total number of the corresponding real value (the sum of the column)
A right click on the matrix allows copying it into the clipboard. It is then possible to paste it directly as an image, or to paste it as a data array.
Gain Curve
When one is faced to the problem of predicting the value of a particular variable (the value of the target node), the evaluation of the model can be realized by using the gain curve. The Gain curve is generated by sorting, in the decreasing order, the individuals according to the target value probability returned by the network, e.g. the appetence/fraud/churn probability.The X-axis represents the rate of individuals that are taken into account. The Y-axis represents the rate of individuals with the target value that have been identified as such. In the Gain curve below, there are almost 5% of individuals that have the target value (yellow). The blue curve represents the gain curve of a pure random policy, i.e. choosing the individuals without any order. The red curve represents the gain curve corresponding to the optimal policy, i.e. where the individuals are sorted according to the perfect model. Choosing the first 5% of individuals allows then getting 100% of the individuals with the target variable with the optimal policy, against only 5% with the random policy.
275
Menus
A left click on the graphical zone of the curve allows getting the exact coordinates of the corresponding point and the probability of the target value of the associated case. For example, the screen shot below indicates that the selection of the first 5.87% cases imply a detection rate of 79.71%. It also indicates that the last case of that selection has a probability of having the target value equals to 52.13%. The Gini Index and the Relative Gini Index are computed according to the curve and displayed at the top of the graphic. The Gini Index is computed as the surface under the red curve and above the blue curve divided by the surface above the blue curve. But, as shown above, the surface of the optimal policy is less than the surface above the blue line, so the relative Gini index is computed as the surface under the red curve and above the blue curve divided by the surface under the curve of the optimal policy and above the blue curve. It is a more representative coefficient. This interactive curve is not only an evaluation tool. It is also a decision support tool that allows defining the best probability threshold from which an individual will be considered as belonging to the target.
276
Menus
A right click on the graphical zone allows choosing between printing the curve and copying it to the clipboard. In the last case, it is then possible to paste it directly as an image, or to paste the corresponding data points. The gain curve has a tool used to automatically analyze the expected economical gains with the evaluated model. These computations follow the definition of the unit costs corresponding to the treatment of each individual (x-axis), of unit gains corresponding to each positive answer (y-axis), and finally of a target population's size. The economical gain is then defined as the difference between the profit corresponding to the treatment of x% of the population and the profit corresponding to the treatment of the whole population. As the following screen captures shows it, the result is displayed as a curve (blue curve) and as a gradient of color (the closer we are to the yellow, the more we are close to optimality).
277
Menus
Lift curve
When one is faced to the problem of predicting the value of a particular variable (the value of the target node), the evaluation of the model is often realized by using the lift curve. This curve comes from the Gain curve and highlights the improvement between the policy returned by the current Bayesian network and the random policy. The X-axis represents the rate of individuals that are taken into account. The Y-axis represents the lift factor defined as the ratio between the rate of the targeted population obtained with the current policy and the rate obtained with the random policy. The curve below is the optimal Lift curve corresponding to the first Gain curve. The best lift value, 20.88, is defined as the ratio between 100% and 4.8% (optimal policy/random policy). The lift decreases then when more than 4.8% of the individuals are considered, and is equal to 1 when all the individuals are taken into account.
278
Menus
The Mean Lift and the Relative Lift Index are computed according to the curve and displayed at the top of the graphic. The Mean Lift is the mean of all the points in the curve. The relative Lift Index is computed as the surface under the Lift curve divided by the surface under the lift curve of the optimal policy.
279
Menus
The following figure depicts the ROC curve corresponding to an optimal model. In fact, the optimal curve indicates that all the cases with the target value have a probability greater than those without this target value.
A left click on the graphical zone of the curve allows obtaining the exact coordinates of the corresponding point and the value of the corresponding threshold. For example, the screen shot below indicates that a threshold of 53.13% implies a detection rate of 79.89% with 1.59% of false positive examples.
The ROC index is computed according to the curve and displayed at the top of the graphic. It represents the surface under the ROC curve divided by the total surface. A right click on the graphical zone allows choosing between printing the curve and copying the curve to the clipboard. In the last case, it is possible to paste directly the curve as an image, or to paste the data points corresponding to the curve.
280
Menus
281
Menus
282
Menus
Comparison between density functions for the learning and the test databases
283
Menus
The results for learning are always represented in red and the results for test in green. The transparency of the display for test results allows showing the learning results under. Both functions are at the same scale.
284
Menus
285
Menus
286
Menus
The results for learning are always represented in red and the results for test in green. The learning and test database graphics have the same vertical scale but don't have a common horizontal scale. These curves are represented horizontally with the computed line percent of their own database. When an example of a database is impossible, i.e. it represents an impossible combination of evidences; the example is not taken in account in the final result and is displayed in a table in a HTML report. This report is displayed by pressing on the Skipped Rows button.
287
Menus
The button Extract Database allows saving only the examples of the database with log-likelihood contained in the interval specified in the following dialog box. The other parameters are the same as for database saving:
288
Menus
289
Menus
The filtered states are not taken into account during the search. A dialog box indicates the end of the optimization.
290
Menus
291
Menus
Target:
You can select the target node. By default, the current target node is used. After that, you can choose to optimize the selected target state or the mean.
Criterion Optimization:
You can choose to maximize or minimize the criterion. A checkbox allows taking the joint probability of the evidence into account during the search. In this case, we will optimize the a posteriori (probability of the target state knowing the evidence weighted by the occurrence probability of that evidence). In the
292
Menus
same way, the cost of evidence can be taken into account. The observation context is also taken into account during the analysis.
Search Method:
Several search methods are available: By hard evidence on nodes' states. By observation of the value/mean of the nodes.The observation can be done either by fixing means, i.e. the convergence algorithm will find, at each new observation, a probability distribution to obtain the wanted mean, or by fixing the probabilities, i.e. at startup, the probability distributions corresponding to the wanted means are computed once for all and will not change anymore. The observed means will vary in percentage: Of the initial means of the nodes Of the variation domains of the nodes Of the margin progress of the nodes, i.e. the difference between the initial mean and the minimum of the domain and the difference between the initial mean and the maximum of the domain. The percentages of the negative and positive variations of each node can be modified with the mean variation editor.
Output:
The results will be associated as evidence scenario file. Each branch of the tree corresponds to a line of evidence in the evidence scenario file. If an evidence file is already associated, you can choose to replace it or to append the examples at the end.
Results:
The results are displayed in a binary tree. Each node of the tree represents the node chosen by the algorithm. The evidence context will be displayed at the top if necessary. A node contains four parts:
293
Menus
The name or comment of the target node. The used background is the node's color. The probability of the target state of the target node before setting the evidence if we optimize the target state or the mean of the target node before setting the evidence if we optimize the mean. The joint probability before setting the evidence. If the node is not a leaf, the name or comment of the chosen node that will be observed. The used background is the node's color. Otherwise, it shows the value of the optimization score with a white background. The last node of the "optimal branch" has a thick red border in order to find it quickly. Two kinds of branches exist: The left branches on which the chosen nodes are observed. If hard evidence is used, the name of the observed state is displayed on the branch. If the mean is observed, the value of the mean of the chosen node is displayed. If this mean corresponds to the negative variation, it is displayed in red, otherwise, if it is the positive variation, it is displayed in blue. The right branches represent that the chosen node is not observed and skipped. It won't be used in the corresponding subtree.
294
Menus
Toolbar:
The toolbar at the top of the window allows the user: To zoom in the tree To zoom out the tree To go back to the initial zoom To fit the tree to the window To display the tree horizontally
295
Menus
Actions:
The buttons at the bottom of the windows allow closing the window, saving the image, printing the tree and displaying the report. The contextual menu allows: Displaying the comment instead of the name of the nodes Displaying the long names of the states Copying the tree as an image or as html. The html copy reproduces the vertical display as well as the horizontal one. The displayed colors are also used. It is possible to perform a manual zoom by clicking and dragging the mouse to select the area that will be zoomed in. The shortcut + mousewheel performs zooming (and unzooming) on the tree centered on the cursor. Moving the mouse over a node displays a tooltip containing the chain of evidence done until this node. A double-click on a node closes its subtrees in the window. Double-click on it to display them again.
Report:
Depending on the options, the button Report displays an html report containing the list of the best chosen nodes in the whole tree with their best observation and the corresponding optimization score.
296
Menus
297
Menus
Target :
298
Menus
You can select the target node. By default, the current target node is used. After that, you can choose to center the search on a specific state or not.
Search Method:
Several search methods are available: By hard evidence on nodes' states. By observation of the value/mean of the nodes.The observation can be done either by fixing means, i.e. the convergence algorithm will find, at each new observation, a probability distribution to obtain the wanted mean, or by fixing the probabilities, i.e. at startup, the probability distributions corresponding to the wanted means are computed once for all and will not change anymore. The observed means will vary in percentage: Of the initial means of the nodes Of the variation domains of the nodes Of the margin progress of the nodes, i.e. the difference between the initial mean and the minimum of the domain and the difference between the initial mean and the maximum of the domain. The percentages of the negative and positive variations of each node can be modified with the mean variation editor. A checkbox allows taking the cost of the evidence into account during the search by dividing the score based on the mutual information. The observation context is also taken into account during the analysis.
Output:
The results will be associated as evidence scenario file. Each branch of the tree corresponds to a line of evidence in the evidence scenario file. If an evidence file is already associated, you can choose to replace it or to append the examples at the end.
Results:
The results are displayed in a binary tree if the mean variations are used or in a n-ary tree if hard evidence is used. Each node of the tree represents the node chosen by the algorithm. The evidence context will be displayed at the top if necessary. Nodes are very similar to monitors. A node contains four parts:
299
Menus
The name or comment of the target node. The used background is the node's color. The value of the node with its delta and the joint probability with its delta before setting the evidence. The probability distribution of the target node before setting the evidence. Each state with the corresponding probability is displayed as in the monitor. If the search is centered on a state, this state is painted in light blue. If the node is not a leaf, the name or comment of the chosen node that will be observed. The used background is the node's color. The score based on the mutual information (divided by the cost if the cost option is checked) is displayed in percentage. If hard evidence is used, the name of the observed state is displayed on the branch. If the mean is observed, the value of the mean of the chosen node is displayed. If this mean corresponds to the negative variation, it is displayed in red, otherwise, if it is the positive variation, it is displayed in blue.
Toolbar:
The toolbar at the top of the window allows the user: To zoom in the tree To zoom out the tree To go back to the initial zoom To fit the tree to the window To display the tree horizontally
300
Menus
Actions:
The buttons at the bottom of the windows allow closing the window, saving the image and printing the tree. The contextual menu allows: Displaying the comment instead of the name of the nodes Displaying the long names of the states Copying the tree as an image or as html. The html copy reproduces the vertical display as well as the horizontal one. The displayed colors are also used. It is possible to perform a manual zoom by clicking and dragging the mouse to select the area that will be zoomed in. The shortcut + mousewheel performs zooming (and unzooming) on the tree centered on the cursor. Moving the mouse over a node displays a tooltip containing the chain of evidence done until this node. A double-click on a node closes its subtrees in the window. Double-click on it to display them again.
301
Menus
302
Menus
8. Monitor
This menu gives access to: Monitors Sorted wrt Target Variable Correlations: displays the monitors sorted wrt the target variable correlations. If a subset of nodes is selected in the graph, this function is restricted to the monitors of those nodes. If some nodes are translucent, their monitors are not displayed. Monitors Sorted wrt Target Value Correlations: displays the monitors sorted wrt the target value correlations. If a subset of nodes is selected in the graph, this function is restricted to the monitors of those nodes. If some nodes are translucent, their monitors are not displayed. Sort by Less Probable Evidence: displays the monitors corresponding to the observed nodes ordered from the less probable evidence (that degrades the most the joint probability) to the most probable evidence. Considering that the current network represents mainly the "normal cases", this tool is then very useful to diagnose the variables that have caused a very low joint probability (atypical cases wrt the given network). Sort Monitors by Name: displays the monitors under increasing lexicographic order of their names or comments (if comments are displayed instead of names). Zoom: zooming the monitors Zoom In: increase the size of the monitors. Zoom Out: decrease the size of the monitors. Default Zoom: go back to the default zoom level. Remove All Observations: removes all the observations. Delete All: deletes all the monitors. Reset the Probabilities Variations: remove all probability shifts. Fix the Reference Probabilities: referencing of the probability shifts Hilghlight the Probability Maximum Variations: displays the maximum probability shifts. Store Evidence Scenario: adds the current set of evidences to the current evidence scenario file.
Select All: selects all the monitors. Display Comment Instead of Name: displays the comment associated with the node instead of the name in the monitor. If the comment is too long, it will be truncated like the name would be. This option is saved with the network. Display the States' Long Name: displays the long name eventually associated with the nodes' states instead of the states' name. When a state does not have a long name associated, its own name is used instead. If the long name is too long, it will be truncated as the states' name would be. This option is saved with the network. Display Probabilities with Scientific Notation: displays the probabilities in scientific notation in the monitors and in the monitors' tooltips. This option is saved with the network. Fit Monitor Widths to Content: enlarge the monitors in order to display the whole content without cropping the names. This action is done only on the displayed monitors. If a new monitor is displayed and its width is greater, you must click on it again.
303
Menus
Restore Default Monitor Widths: restore the witdh of the monitors to the default value. Display the Uncertainty and Likelihood Variations: displays in the information panel the uncertainty and likelihood variations due to the evidences that were set. This menu is visible in validation mode only.
304
Menus
9. Tools
Compare (any mode): Structure: starts the structure comparison tool of two networks. Joint: starts the joint probabilities comparison tool. Cross Validation (validation mode): Arc Confidence: starts the arc confidence analysis tool on the current network. This network must have an associated database. Data Perturbation: starts the data perturbation tool on the current network. This network must have a database associated with it. Targeted: starts the targeted cross validation tool on the current network. This network must have a database associated with it and a target node. Structural Coefficient Analysis: starts the structural coefficient analysis tool on the current network. This network must have a database associated with it. Multi Quadrant Analysis (validation mode): starts the multi-quadrant analysis tool on the current network. This network must have an associated database. Assessment (any mode): it is available only if experts are registered on this network and assessments have been done Export a Network per Expert: the user select a destination directory. For each expert, a network with the same structure but with the probabilities given only by this expert will be saved in this directory. When there is no probability given by this expert, the consensus of the other experts is used. Export Probability Assessments: exports into a database all the assessments done by the experts. The database contains a column for each node with assessments, a column with the correponding probability given by the expert, a column for the confidence of this assessment and also the name of the expert and the time used to do this assessment (see Assessments). Export Expert Assessments: exports into a database for each cell of each table with assessments, the probability given by each expert and the associated confidence. Then, there are two columns by expert, one for the probability and the other for the confidence. A column Weight give a weight to each row corresponding to 1 / number of states (see Assessments).
Parameters
This dialog box allows choosing the networks to be compared:
305
Menus
The left hand side network is the "reference" network for comparison. Clicking the button below the picture allows loading another reference network. The right hand side network is the "comparison" network. On the same manner, clicking the button below the picture allows loading another network. An option allows choosing whether the Bayesian networks themselves or their equivalence class shall be compared.
Comparison report
The following report (HTML format) is created by clicking Compare button:
306
Menus
The first line indicates the names of compared networks. The rest of the report contains up to ten lists: Common arcs: same arcs in networks Inverted arcs: arcs which orientation changes in the comparison network Added arcs: arcs that exist only in the comparison network Deleted arcs: arcs that exist only in the reference network Common edges: same edges in networks Added edges: edges that exist only in the comparison network Deleted edges: edges that exist only in the reference network Common V-Structures: same V-Structures in networks Added V-Structures: V-Structures that exist only in the comparison network Deleted V-Structures: V-Structures that exists only in the reference network
Graphs
307
Menus
The Graphs button from the report allows displaying the structure comparison graphic tool. With this tool, the data from the report can be easily viewed.
Toolbar
The window toolbar contains eleven buttons: 4 for the navigation, 6 for modifying display and one to save networks 1. Navigation bar displays synthesis structure displays reference structure displays the previous structure (with regards to current displayed)
displays the next structure (with regards to current displayed) Navigating through the structures is realized in this order: a. Synthesis structure b. Reference structure c. 1st comparison structure d. 2nd comparison structure e. ... 2. Display bar Zoom in Zoom out Display structure with default zoom Adjust zoom to window size Rotating structure on the left Rotating structure on the right
3. Save bar
308
Menus
Save the current displayed network. This option is only available for comparison structures. It is disabled for Synthesis and Reference structures. positions the graph in the window.
This button
Popup menu
Popup menu allows displaying nodes comments and copying structure.
Synthesis structure
This structure summarizes all differences between reference Bayesian network and comparison structures. black: Arc or edge that exist in the reference and in at least one comparison network blue: Arc or edge that do not exist in the reference network but that exist in at least one comparison network red: Arc or edge that exist in the reference network but that do not appear in any comparison network An arc is displayed with an arrow, an edge is a simple line and the V-Structure is displayed as a portion of circle. Arcs, edges and V-Structures thickness depends on the frequency of the concerned object: the thicker, the higher its frequency. Hints appear when the mouse is moved over an arc, an edge or a V_Structure, the hint contains the name of the object, if it has been added, removed of if it is constant with regards to reference structure. It also contains the frequencies:
309
Menus
The synthesis structure can be printed or saved in a spare image file by clicking the buttons.
Reference structure
310
Menus
The reference structure is the initial Bayesian network (or its equivalence class) that is used for the comparison basis. The V-Structures of the reference are also displayed.
Comparison structures
Each network obtained from the cross validation is displayed and numerated from zero, the V_Structures of the network or its equivalence class are also displayed. Since a comparison structure can represent several identical networks, the total number of networks included in this manner is indicated in the picture caption. The frequency equals the number of networks represented by the comparison structure divided by the total number of networks produced through the cross validation.
311
Menus
For each panel, the mean, standard deviation, minimum, maximum and computed row numbers are indicated below the chart. In a first time, each file of joint probabilities is analyzed separately:
312
Menus
Second file:
313
Menus
314
Menus
Comparison :
The Kolmogorov-Smirnov Test is computed to mesure the distribution adequacy between the two samples. this test is present in the comparison panels only. It compares the two distributions of loglikelihoods. The values Z, D and the corresponding p-value are displayed.
315
Menus
316
Menus
317
Menus
Parameters
In the following window, the learning algorithm shall be selected, as well as the number of data samples to be generated.
The window displays sample size depending on the size of the database and the number of samples. The samples are chosen randomly in order to avoid the errors that can occur when the database is sorted. In the output directory can be saved all networks and databases learnt during Jackknife process. Depending on the chosen learning algorithm, a dialog box displays specific settings:
Analysis report
Once the networks have been learnt on each data sample, the following report is displayed:
318
Menus
It is composed of four parts: 1. The learning context: Reminds the learning method and the number of data samples. It also indicates the structural complexity coefficient used (it is the same as the one used in the initial network). 2. Arcs confidence analysis: Lists of arcs grouped by three colored types Black: the arcs that exist both in the reference structure and the networks from the samples. Arc frequency represents how often the arc appeared with the same orientation as in the sample networks. Inverted arc frequency represents how often the arc appeared with the reverse orientation as in the sample networks. Edges frequency represents how often the arc appeared without any orientation as in the sample networks (the equivalence class of the learnt networks is used). Total frequency is the sum of all previous ones. It indicates the overall strength between the variables.
319
Menus
Blue: the arcs that exist in at least one sample network but that do not exist in the reference network. In this case, the frequencies are displayed with a negative value. The reference orientation of arc is arbitrarily given by the first arc found in the first sample network. Arcs frequency represents how often this arc appeared with the same orientation as the first. Inverted arc frequency represents how often this arc appeared with the reverse orientation compared with the first. Edge frequency represents how often the arc appeared without any orientation (equivalence class). Total frequency is the sum of all previous ones. It indicates the overall strength of a relationship that does not exist in the reference network. Red: arcs that exist in the reference structure but that have never been found in any learnt sample. 3. V-Structures confidence analysis: Lists of V-Structures grouped by three colored types: Black: V-Structures that exist both in the reference structure and the networks from the samples. Blue: V-Structures that do not exist in the reference structure but that appeared in at least once in a sample network. Frequencies are displayed with a negative value. Red: V-Structures that exist in the reference network but that never appeared in any sample network. 4. Comparison structure array: It summarizes all learnt networks from the samples. When identical structures exist, they are gathered. First column is the structure identifier The second column is the number of identical structures learnt. The third column represent the frequency of the whole structure: it is the number of times the current structure appears divided by the total number of structures. The last column indicates whether the reference structure is included or not in the current structure. The report can be saved in a HTML format file. It can also be printed. Two other options exist: displaying graphs and extracting the network.
Graphs
The Graphs button from the report allows displaying the graphical structure comparator. With this tool, data contained in reports can be viewed and interpreted easily.
320
Menus
The BayesiaLab's structural learning algorithms are based on heuristic search. Then, they can be trapped into local minima. As they are based on different heuristics, those local minima can be different. Applying every algorithms and choosing the one with the lowest score represents the first solution to optimize the obtained network. This data perturbation tool provides another solution consisting in adding noise to the weights associated to each line of the database in order to try escaping from local minima. A noise is generated by using a Gaussian law, with 0 mean and the standard deviation set by the user. The selected learning algorithm is applied on this perturbed database and the score of the final structure is computed by using the original weights. The decay factor is applied after each iteration to reduce the standard deviation.
Parameters
Simply select the learning algorithm we want to use, and indicate the initial standard deviation, the decay factor and the number of tests to perform in the following dialog box:
An output directory can be specified where all networks learnt will be saved. Depending on the chosen learning algorithm, a dialog box displays specific settings:
Analysis report
Once the networks have been learnt, the following report is displayed:
321
Menus
This report is similar to the Arc Confidence except for the last table where the column indicating the Structure's Mean Score has been added. The report can be saved in a HTML format file. It can also be printed. Two other options exist: displaying graphs and extracting the network.
Graphs
The Graphs button from the report allows displaying the graphical structure comparator. With this tool, data contained in reports can be viewed and interpreted easily.
322
Menus
Parameters
The learning algorithm and the number of folds shall be chosen from this dialog box:
The sample size is calculated depending on the size of the database and the number of folds. The samples are chosen randomly in order to avoid the errors that can occur when the database is sorted. An output directory can be specified where all intermediate networks learnt from the folds shall be saved with their corresponding database.
Results
Network's targeted performance is displayed in this window:
323
Menus
The first panel "results synthesis" displays the global results computed on all the samples for: global precision R: Pearson's coefficient R2: squared Pearson's coefficient relative Gini value relative Lift value confusion matrices : for occurrences, for reliability and for precision Nodes frequency array indicates how often a node appears in any network built upon a fold (no matter it is directly connected to the target node or not). The Global report button (in the synthesis panel) displays the cross validation synthesis report. The tabs contain the targeted performance result of each network learnt on the folds:
324
Menus
The details about the panel contents can be found in the targeted performance targeted performance report section.
325
Menus
The report is built on the same template as the global targeted evaluation report, save that it summarizes all values for each index and each matrix calculated for each fold. The rest of the report contains the nodes frequencies.
The last part of the report contains structural comparison of reference network with the generated networks:
326
Menus
Its contents are the same as arcs confidence analysis. This report can be saved in a HTML-format file and can also be printed. Two other options exist: displaying graphs and extracting the network.
Graphs
The Graphs button from the report allows displaying the graphical structure comparator. With this tool, data contained in reports can be viewed and interpreted easily.
327
Menus
Caution
Use only when data is scarce. For analysing the structural coeficient, the chosen learning algorithm will be tested with different values of the structural coefficient in a given interval. At each iteration, the network structure is learned on entire database with a growing structural coefficient.
Parameters
Simply select the learning algorithm we want to use, and indicate the minimum and maximum limits for the structural coefficient and the number of iterations to perform in the following dialog box:
The structural coefficient can vary from 0 to 150. Three options are available allowing computing at each iteration: Structure/Data Ratio: use it for unsupervised tasks. It is the ratio between the natural structural complexity (with a coefficient to 1) of obtained networks and their data likelihood (with a coefficient to 0). Target's Precision in %: use it for supervised tasks. A target node is necessary. The precision of the target prediction is computed for each obtained network. Structure/Target's Precision: use it for supervised taskstaking into account the structure complexity and the precision. A target node is necessary. The tested structural coefficients will vary from the given minimum to the maximum by step of (maximum - minimum) / number of iterations. An output directory can be specified where all networks learnt will be saved.
328
Menus
Depending on the chosen learning algorithm, a dialog box displays specific settings:
Analysis report
Once the networks have been learnt on each structural coefficient, the following report is displayed:
329
Menus
This report is similar to the Arc Confidence except for the last table where the column indicating the maximum structural coefficient of the structure has been added.This indicates, among the coefficients tested, which is the greatest that has obtained this structure. The report can be saved in a HTML format file. It can also be printed. Three other options exist: displaying graphs, extracting the network and displaying the curves.
Graphs
The Graphs button from the report allows displaying the graphical structure comparator. With this tool, data contained in reports can be viewed and interpreted easily.
Curves
The button Curves of the report allows displaying the curves corresponding to the options chosen in the parameters. A dialog box allows the user to choose the curve to display:
Structure/Data Ratio : the structural coefficient can be chosen in the area at the inflection point of the curve, just before a strong increase by reading the graph from right to left.
330
Menus
Target's Precision in % :
331
Menus
332
Menus
333
Menus
This tool automatically creates a network in which all arcs present a frequency equal or superior to the indicated threshold. With this operation, only the strongest relationships between variables are kept. The network's conditional probability tables are learnt from initial data.
334
Menus
Parameters :
The following panel lets you define the parameters of the analysis.
Selector Node: you can choose from a drop down list the variable that will be used as selector node. The target node is excluded from this list. Analysis: allows choosing the analysis among: Mutual Information: compute the conditional mutual information between each variable and the target Total Effects: compute the total effects of the variables on the target Standardized Total Effects: compute the standerdized total effects of the variables on the target Linearize Nodes' Values: by checking this option, the values associated with states of nodes are recomputed and sorted in order to have a positive increasing impact on the value of the target variable. Regenerate Values: by checking this option, the values associated with states of continuous nodes are re-computed from the data in each database for each state of the selector variable. If the linearization is requested, it will be done after the generation of values.
335
Menus
Output Directory: allows specifying a directory in which each network corresponding to each state of the selector will be saved.
Results :
The display of the results takes the form of points positioned on a 2D chart. It is similar to quadrant chart in the correlation with the target node report or in the total effects report.
The points represent the variables. If a node has a color then the point will be displayed in this color. The name of the node is displayed at the right of the point. When moving the mouse over a point, the
336
Menus
coordinates are displayed in the top panel. The top panel shows the number of points displayed as well as the evidence context. It is possible to zoom in the chart as in the scatter of plots. The mean value of the nodes is displayed along x-axis and the reult of the chosen analysis is along the y-axis. The title of the chart indicates the performed analysios and the current state of the selector node. the states can be selected via the chart's contextual menu. The chart's contextual menu allows to : Display the comment of the nodes instead of their name Display the long names of the states Display the scales Select the state to display by choosing one in the proposed list Copy the chart as image or array of points (text or html) Print the chart
Actions:
The chart above represents the value of each node according to the chosen state. When we move the mouse over a node, other nodes disappear and the values of this node for each state of the selector are displayed. In the example, the relevant node is "Tasty" for the state 8 of the selector node "Product". Other values of "Tasty" are displayed with the name of the corresponding state next to them.
337
Menus
When choosing the option Display Scales from the contextual menu, chart displays for each node a scale between the minimum and maximum values among all states of the selector. The mean is also indicated by a vertical line. It is therefore possible to easily identify where is the value of the node for the current state among all other states. So we can identify the possible negative and positive progress margins for the current product for each of its qualities.
338
Menus
By hovering the mouse over the point, we obstain, as previously, the different values of the node for each state in addition to the scale:
339
Menus
The button Export Variations allows exporting the percentages of negative and positive variations of each node in order to use them in other analysis thanks to the mean variation editor.
340
Menus
Dynamic Bayesian networks provide a representation much more compact for such stochastic dynamic systems. This compactness is based on the following assumptions: The process is Markovian, i.e. the variables of time step t depend of set of variables that belong to a limited set of former time steps (usually only the previous time step t-1, first-order Markovian assumption) The system is time invariant, i.e. the probability tables do not evolve with respect to time. This last assumption is partially relaxed in BayesiaLab thanks to the Time variable that makes it possible modifying the probability distributions according to the value of the current time step by the means of the equations. For first-order Markovian assumption, it is then possible to represent these systems only with two time slices. The screenshot below represents the same network as the unrolled network presented above, without any limitation on the number of time steps. The first slice describes the initial network at time step t0 and the second one describes the temporal transitions t+1.
341
Menus
The specification of the t+1 slice is carried out by the means of the temporal arcs (red arcs). A temporal arc indicates that the two connected nodes correspond to the same node at two consecutive time steps. The set of nodes that belong to the t+1 slice are then the destination nodes of the temporal arcs, and the descendants of these destination nodes. For Markovian assumption of higher order than one, the node of time slice (t+1) are also linked to nodes belonging to time slices former than time slice t. However, temporal arcs can only be used to link nodes from time slice t and (t+1). The initial slice is particular because its synchronous arcs (connecting nodes that belong to the t0 slice, as for example N0 - > N1 t0 and N0 - > N2 t0 in the example above) are removed after the first temporal inference. We then have the common representation of dynamic Bayesian networks in which nodes of the t slice have no parent.
Inference
Inference in the dynamic networks is not as simple as in the case of the static Bayesian networks. BayesiaLab proposes two kinds of inference: inference based on a junction tree: exact inference for static networks, but that can return approximate results in the dynamic case. In some particular cases, as for example, the Valve system illustrated below on the left, the inference result is exact because the valves are independent from each other. To get exact results with dependent temporal nodes, it is necessary, on a similar fashion as the one employed for Bayesian updating, to qualitatively indicate these dependencies (without modifying the original conditional probability table to keep the possible marginal independence), by simply adding arcs between these dependent nodes, as illustrated below, on the right, with the red marked arcs. inference based on Monte Carlo simulations (particle filtering): approximate inference in the static and dynamic case, but the approximation is of the same order in both cases, i.e. the approximation is not related to the dependence of the nodes but its only due the randomness of the simulation.
Temporal Simulation
342
Menus
reset the network (the synchronous arcs and the probability tables of the t0 slice are reset). The network is also reset after the entry of new probability distributions of the temporal father nodes and when switching to Modeling mode. time meter indicating the current time step value. This meter is also used to enter the number of time steps to reach, i.e. the length of the simulation. If this number is lower than the current time step value, the network is reset. Otherwise, the simulation is carried out from the current time step to the entered number. The simulation can be stopped by using the red light of the status bar. If a temporal evidence scenario file is associated to the network, a right click on the index textfield displays a list of the evidence sets contained in the file. A click on a line performs the temporal simulation from the current index to the specified index, taking into account the corresponding evidences:
simulate temporal step by temporal step. graphical view of the probability evolution of the temporally spied nodes, as illustrated below.
If a temporal evidence scenario file is associated to the network, the evidences corresponding to each time step will be taken into account. It is possible to add a set of evidences for the current time step by pressing the button located in the monitor's toolbar. You must know that setting the probability distribution of a node will change according to the chosen inference: fixed distribution with exact inference or computation of the corresponding likelihoods with approximate inference.
Caution
Sometimes a fixed probability distribution can't be done exactly because the algorithm used to reach the target distribution fails to converge towards the target distribution. In this case a warning dialog box is displayed and an information message is also written in the console. The temporal chart button displays the following window:
343
Menus
For each spied state, the mean of the probability over the simulated period appears in the caption. If the network has utility nodes, the mean of the expected value of each utility node appears in the upper right corner, as well as the total of these expected values. A right click on the graph activates a contextual menu that allows: having a relative view of the graph (the Y-axis takes into account the minimal and maximal values and not 0 - 100 as it is the case in the default view) printing the graph, copying the graph. It can be pasted as an image or as a data array into external applications. Points can also be saved directly into a text file.
344
Menus
During the temporal inference, after each time increment, the probability distributions described by the equations that use this parameter are updated. Note that in Modeling mode, its value is set to zero (the resulting probability tables are then computed by replacing t with zero) The use of this parameter transforms the network into a dynamic Bayesian network.
345
Menus
11. Options
Console: Allows managing the console (open, clear and save as) Settings: Allows managing the preferences of BayesiaLab
11.1. Settings
The different settings of BayesiaLab are located in the following dialog box that you can reach through the menu Options>Settings:
The left side displays a tree in which you can select the settings you want to display. Each branch of the tree can have sub-branches that you can also select. The right side displays the selected settings. It contains a button Default values that allows to load the default values only for the selected settings. The Apply button validates the displayed settings modifications only. The dialog box contains also two buttons: The Accept button validates all the settings that have been modified, displayed or not, and closes the dialog box. The Cancel button cancels all the modifications of the settings, displayed or not, if these ones were not already validated and closes the dialog box.
346
Menus
General
This item is visible only if your license allows it. It is also not available for Mac OS X.
When an instance of BayesiaLab is already running, if the user wants to open again BayesiaLab or performs a double-click on a XBL file, the previous instance will be automatically reused. The following options allows managing this behavior: Allows opening multiple instances of BayesiaLab: when an instance of BayesiaLab is already running, it is possible to open a new instance if the license allows it. Opening a new instance can be done either by running directly BayesiaLab or by double-clicking on an XBL files for example (depending on the second option). It works as well with command line on any platform. If the option is unchecked, opened instance will be always reused and the second option will be always checked. Reuse opened BayesiaLab when double-clicking on XBL files: when an instance of BayesiaLab is already running, it is possible to reuse it when the user performs a double-click on an XBL file (or by command line). If the first option is checked and this option also, the user will open a new instance of BayesiaLab if he runs directly the program and he will reuse an already opened instance if he double-clicks on a XBL file. It works as well with command line on any platform. BayesiaLab needs to be rebooted to take changes into account.
Language
Allows choosing the BayesiaLab's language. These languages are available: 1. English 2. Spanish 3. French 4. Japanese 5. Chinese BayesiaLab needs to be rebooted to take the language modification into account.
Display
Customize the Bayesian networks windows:
347
Menus
Font smoothing: the drawing of the font is smoothed Display background image: displays or not the background image of the Bayesian networks windows. Images can be changed by using the contextual menu. Background color: double-clicking on the colored square allows choosing the color of the Bayesian networks windows background. Positioning grid spacing: indicates the size in pixels of the grid spacing. Default Node Font: indicates the default font used to display the nodes' name for all networks. This font can be modified for each network through the network's contextual menu. Default Comment Font: indicates the default font used to display the comments if no font is manually specified in the comment. This font can be modified for each comment in the comment editor.
Console
Customize the console:
Auto save console: automatically save the content of the console. Background color: double-clicking on the colored square allows choosing the color of the console background. Text color: double-clicking on the colored square allows choosing the color of the console text.
Options_menus
Customize the menus:
348
Menus
Recent network list size: allows specifying the number of recently opened networks you want to keep in the menu Networks>Recent networks. Recent database list size: allows specifying the number of recently opened databases you want to keep in the menu Data sources>Recent databases. Display Copy Format Choice When Copy: allows us to display, when a table, a graph, monitors, or other, are copied, a dialog box asking in which one of the available format the user wants to copy. According to the copied object, it is possible to copy as plain text, HTML or image. Under Windows, it is possible to uncheck this option, since the clipboard system allows us to copy several format at the same time, the choice of the format being done during pasting. Copy as Images with Transparent Background: when a copy as image is done on a network or monitors (for example), the background of the image is transparent instead of being white. Sometimes, on some software like MS Word, transparent background is interpreted as black. In these cases, you can choose to uncheck this option or to perform a copy on another software that supports transparency directly (like MS Excel or MS PowerPoint) and copy again the image into the target software.
349
Menus
Perspectives allows the user to configure the different menus of BayesiaLab as he wants. Each menu of the menu bar is present in the column Component and can be set visible or not if the user doesn't need it. This can be done by checking of not the corresponding option in the column named Visible. It is also possible to configure the keyboad shortcut for each menu item. The column Shortcut shows the current shortcut associated with the menu item. Some of components cannot have their visibilities modified, like Settings and Perspectives for obvious reasons. The last column Count shows how many times each action has been done. It is useful to understand how BayesiaLab is used. A special tool allows analyzing these results. It is possible to edit the shortcut of a menu by selecting the component in the list, for example Target's Optimization Tree:
350
Menus
Once selected, the buttons and field in the bottom part become available. After that, you can choose none, one or several modifiers among Ctrl, Meta for Mac users, Alt, Shift and Alt Graph. Now click on the field and type the key you want for the shortcut. In our example, we want to create the shortcut Ctrl + O. So the button Ctrl is pressed and the key O is typed in the text field. The interface indicates that this shortcut is already used for the Open menu, so we must change the keystroke by pressing the button Alt in order to obtain Ctrl + Alt + O.
To remove a shortcut, select it and click in the text field. Once the focus is on the text field press the key or (not the buttons). The button Load allows importing a perspective file. The button Save allows exporting the current perspective into a text file. To validate all the changes, you must click on the Apply or Accept buttons.
Directories
Customize the user directories:
Allows choosing the directories which will be used for the graphs, the databases, the images and other files (reports, etc). It is possible to use the same directory for all by selecting the convenient option.
351
Menus
The option fixing the different paths prevents from updating the paths when one of them is changed at the opening or saving of files.
Editing
Customize network editing:
Maximum state number allowed: allows specifying the maximum number of states allowed for a node. This number is used in the database import and in the node editing. Warn when the Data of a Deleted Node will be Lost: displays a warning when the user wants to delete a node having data associated with it which will be lost. Warn when CPTs are lost when inverting arc: displays a warning when inverting an arc, indicating that the content of the conditional probability table will be lost. Warn when the size of a CPT is too big: checking this option activates the beneath option Size allowed before warning that makes it possible to specify the size of the CPTs that is allowed before indicating the overflow. This option allows avoiding the creation of too big CPTs that decrease performances and that use too much memory. Automatically back to selection mode: comes back to selection mode after doing any action (creation or deletion of a node or an arc).
Data
Import exact values of the continuous variables: allows choosing if the continuous values of the discretized variables are loaded in memory or not. These values are used by the Graph module, and also to compute the mean and standard-deviation in the monitors. Warning: this option can use a lot of memory. Save Database with Network: Allows the user to save the database associated to a network in the same file in order to be able to reopen them together. When a network is opened with an associated database, an option in the file chooser allows the database to be loaded or not.
352
Menus
Save Evidence Scenario File xith Network: Allows the user to save the evidence scenario file associated to a network in the same file in order to be able to reopen them together. When a network is opened with an evidence scenario file, an option in the file chooser allows the evidence scenario file to be loaded or not.
Default completion: completion mode of missing values used by default during the import process. Default discretization: discretization mode used by default during the import process. Default interval number: number of intervals used by default for the discretization during the import process. Minimum Interval in Database Size Percent for KMeans Discretization: indicates the minimum population of an interval that is needed to keep it during KMeans discretization when importing. KMeans Filter Size in Percentage: gathers data into packets of the given size in order to remove the effect of the outliers (abnormal values). A size of 0 deactivates the filter. Set State Value to Interval Mean at Import: for the continuous nodes, allows setting as value associated with the state the corresponding interval's mean at import. It is the same for a variable added during associating. Set State Value to Interval Mean at Associate: for the continuous nodes, allows setting as value associated with the state the corresponding interval's mean at associate time. The added variables follow the same option as with import. Weight normalization: if the database has a weight column, two alternatives are available: no normalization: the weight of each line is the one that is in the database; normalization: the weight of each line is normalized to get a sum over the weights equal to the number of lines of the database multiplied by the user defined normalization factor.
Data > Import & Association > Missing and Filtered Values
353
Menus
Default Missing Values: Defines the default list of values that will be considered as missing values during importing and associating. During importing or associating, it is obviously possible to modify this list. Default Filtered Values: Defines the default list of values that will be considered as filtered values during importing and associating. During importing or associating, it is obviously possible to modify this list.
It is possible to choose the column separator. It is also possible to choose the comma or the dot as decimal separator symbol.
354
Menus
It is possible to define if the point is used or not to indicate the end of the line. The output file encoding can be changed in the combo box according to the used character sets. When the chosen encoding is UTF, it is possible to write or not the Byte Order Mask at the beginning of the file.
Reporting
Customize reports generation:
Display the node's color in the cell's background: if the node has a color associated with it, this color will be used as the background color of the cells where the node's name is. Display node comments in tables: adds a column in the reports containing the comment of each node. Display node comments in tool tips: this option allows displaying the comments associated with each node into a tool tip by moving the mouse over the node. However, in some cases, if you need to open the saved HTML file with external software, such as Excel for example, it can be necessary to disable this option.
>Automatic Layout
Customize automatic layout:
Allows choosing the automatic layout algorithm associated with the shortcut Dynamic layout algorithm Symmetric layout algorithm
between:
355
Menus
Multi-level: hierarchical algorithm where the initial network is recursively decomposed into sub networks. The symmetric algorithm is then applied in an ascending fashion on each of the obtained networks. Repulsion factor: indicates the repulsion force between the nodes. Optimization factor: optimize the layout precision.
Number of races: number of distinct populations used in the genetic algorithm. Size of the populations: number of solutions for each race. Individual mutation rate: mutation rate associated with each solution. Gene mutation rate: if the individual is mutating, each gene can mutate with this mutation rate. Selection rate: rate corresponding to the solutions that will be replaced by new solutions generated by crossover.
356
Menus
Family relationships weight: importance attached to have child nodes below their parents. Angle weight: importance attached to the vertical arc. Node overlapping weight: importance attached to prevent the node overlapping. Arc length weight: importance attached to reduction of the difference between the arc size and the size indicated by the slider. Intersection weight: importance attached to prevent of the arc intersections. Class weight: importance attached to reduce the distance between nodes of the same class.
Analysis
Customize analysis
Maximum Cluster Size: indicate the maximum number of clusters automatically chosen by the algorithm. Stop Threshold: indicates the threshold corresponding to the maximum KL weight a cluster can have to be kept.
Inference
Customize inference:
357
Menus
Display probability variations: arrows are displayed on the monitors in order to indicate the difference between the previous marginal probability of a node and its current one.
Network complexity reduction rate: indicates the threshold from which the complexity reduction algorithm must act if the network is too complex to allow performing exact inference. A weak reduction rate will let a network strongly connected and the inference should be more time consuming. It will be the opposite with a high rate. In all cases, if the algorithm needs to be run, the user will be informed.
Likelihood Weighting iteration number: indicates the number of samples used to compute the probabilities. More this number is high, more the computed probabilities are precise but more the computing time is long.
Learning
Customize learning:
Repaint while learning: displays the intermediate states of Bayesian networks during the learning process. Smoothed probability estimation: the conditional probability tables can be estimated with a smoothing parameter. This parameter corresponds to the initial virtual occurrences that are generated in agreement with the probability law described by the fully unconnected Bayesian network with only uniform probability distributions, i.e. each variable is independent and all the values of the variables are equiprobable.
358
Menus
Temporal threshold to estimate missing values: defines the maximum duration between two missing values estimation during structural learning. Beyond this threshold, a dialog box allows to stop learning. This option is very useful to learn a Bayesian network with a structure that can be suboptimal but that allows to carry out exact inference in a time that is proportional to the defined threshold (the time between two learning time steps corresponds to the inference of the missing values with the current network). Maximum number of iterations for the completion process: defines the number of maximal iterations for the algorithms of missing value estimation. The higher this value is, the better the probability estimation is. Caution, the time necessary to estimate the probabilities also increases proportionally to this parameter.
Taboo list size: defines the forbidden states list size for the Taboo algorithm. Taboo Order list size: defines the forbidden states list size for the Taboo Order algorithm.
Maximum Drift: indicates the maximum difference between the clusters probabilities during learning and those obtained after missing value completion, i.e. between the theoretical distribution during learning and the effective distribution after imputation over the learning data set. Minimum Cluster Purity in Percentage: defines the minimum allowed purity for a cluster to be kept. Minimum Cluster Size in Percentage: defines the minimum allowed size for a cluster to be kept.
359
Menus
Discount factor: this factor is used to mathematically bind the sum of the reinforcements that will be received over an unlimited horizon. It is also used to weight the future with respect to the present (0 <= factor <= 1). 0 reduces the action quality value to the expected direct reinforcement whereas 1 implies that the quality value is the total sum of all the reinforcements received over the (finite) horizon. Adaptive learning rate: allows the automatic adjustment of the learning rates during learning. If various consecutive modifications have the same direction, the corresponding rate is increased, otherwise, it is reduced. This option can allow speeding up the convergence of the learning algorithm. Learning rate: rate used in the update rule (0 <= rate <= 1). 0 prevents learning while 1 implies that the error term is fully impacted in the quality value. Initial exploration rate: rate used to balance Exploration and Exploitation during learning. Exploration means that actions are chosen randomly, whereas Exploitation means that actions are chosen according to their quality value (0 <= rate <= 1). 0 prevents exploring while 1 implies a fully random policy. This rate is automatically decreased during learning to converge toward a pure exploitation phase.
Statistical Tools
Customize the statistical tools:
Used Independence Test: Allows the user to choose between the Chi and the G-test as independence test used in mosaic analysis, correlation with the target node report (network and data), total effects on the target (network and data) and relationship analysis report (data).
360
Menus
12. Help
Help: to display this help. Context Help: changes the mouse cursor into a ? and then the user can click on the component he wants help on. Feedback Form: online form for evaluating BayesiaLab. Use Analysis: allows analysing the use of BayesiaLab. About BayesiaLab: displays information on the current version of BayesiaLab.
361
Menus
When you select a menu or submenu in the tree on the left, the graph shows, then, only the use of the selected menu and its submenus. In the following figure the Analysis menu is selected:
To see the detailed chart, simply press the button Display Detailed Use which displays a window containing a larger as well as the detail of the use on the right hand side.
362
Menus
When you click on the second option, the chart displays the number of times each function was used, according to the menu selected in the tree:
To see the detailed chart, simply press the button Display Detailed Use. The right part shows for each sector the number and percentage of use compared to the sum of all uses.
363
Menus
The button Display Table opens a new window containing the table all BayesiaLab's menus with their name, their icon, their visibility, their shortcut, and their number of uses. When a row is selected, pressing on the button help displays the help corresponding to the function, if it exists. the table is sortable by clicking on each column's header.
364
Network toolbar
Create a new graph window Open a Bayesian network (in a new graph window) Save a graph Print a graph
Edit toolbar
Cut (modeling mode) Copy Paste (a right-click before pasting allows specifying where to paste) (modeling mode) Undo the last possible action Redo the last undone action Find
View toolbar
Zoom in Zoom out Default zoom Resize and center the Bayesian network on the current window Rotate left the graph around the selected node or the center of the graph if there is no selected node Rotate right the graph around the selected node or the center of the graph if there is no selected node
Display toolbar
Displays or hide the node comments Display or hide the arc comments
365
Toolbars
Display or hide the color tags of nodes Display or hide the color tags of arcs Display or hide the images associated with the nodes Hide or display the comment indicators of nodes and arcs Display or hide the orientation of the arcs (validation mode)
Creation toolbar
Selection mode Node creation mode (modeling mode) Constraint node creation mode (modeling mode) Utility node creation mode (modeling mode) Decision node creation mode (modeling mode) Arc creation mode (modeling mode) Deletion mode (modeling mode)
Zoom in monitors Zoom out monitors Default monitor zoom Remove all the observations Remove all the monitors Remove all probability shifts Referencing of the probability shifts Display the maximum probability shifts Add the current set of evidences to the current evidence scenario file
366
Toolbars
Display the graphic representing the evolution of the temporally spied nodes
Temporal inference and dynamic learning of policies toolbar (validation mode + temporal variable or temporal arc + decision node)
Deactivate the exploration Activate the exploration (testing some random actions) during the temporal simulation No learning of the state/action qualities during the temporal simulation Learning of the state/action qualities during the temporal simulation
Go back to the first case of the database (index 0) Go back to the previous case Go to the next case Go up to the last case Stop the interactive inference
Go back to the first case of the database (index 0) Go back to the previous case Go to the next case Validate the current updating Stop the interactive updating
367
Toolbars
Go back to the previous threshold Go to the next threshold Store the current arc forces in the arc comments Stop the arc force analysis
Arc's mutual information analysis toolbar (validation mode + arc's mutual information)
Go back to the previous threshold Go to the next threshold Store the current information in the arc comments Stop the arc's mutual information analysis
Go back to the previous threshold according to the selected force Go to the next threshold according to the selected force Computes only the entering force of the nodes and displays if greater than the given threshold Computes the global force of the nodes and displays if greater than the given threshold Computes only the outing force of the nodes and displays if greater than the given threshold Stop the node force analysis
368
Toolbars
Displays the current clustering as a dendrogram Validate the current clustering Stop the variable clustering
Correlation with the target node toolbar (validation mode + correlation with the target node)
Correlation with a state of the target toolbar (validation mode + correlation with a state of the target node)
Combo box to choose the kind of neighborhood Field to modify the neighborhood depth Stop the neighborhood analysis
369
Chapter V. Search
Searching nodes
The thumbnail Nodes displays the node search user interface. The editable combo box allows entering the name of the node or of its class that we are searching. We can use also the special characters * and ?: The character * represents a series of 0 to n unspecified characters. The character ? represents exactly one unspecified character. The check box "Case sensitive" means that the capital and small letters will be interpreted as is. The search options allow searching among the nodes only, the classes only or both of them.
Examples: if we want to find all the nodes beginning by "Comb", we will enter the following search: "Comb*" (without the quotes) if we want to find all the nodes containing the letters "A" and "B" separated by a single character, we will enter the following search : "*A?B*" (without the quotes) Once the search is done, the list of the found nodes and the number of nodes are displayed. By selecting a node in the list, this one begins to blink in the graph window. If the window is closed, the current selection remains displayed.
Searching arcs
370
Search
The thumbnail Arcs displays the arc search user interface. This search is done from the name of the extremity nodes and the orientation of the arc. The user interface for both nodes is the same as the search of a single node. It is possible to specify, as for the nodes, the field of search of each extremity (nodes, classes or both of them). There are three buttons to indicate the orientation of the arcs we want to find.
Example: if we want to find all the arcs that the end node starts with "Comb", we will enter the following search: "Comb*" (without quotes) for the second node, we will choose the convenient orientation and we will enter "*" for the first node. if you want to add also the arcs that the start node begins by "Comb" simply press on the button .
Once the search is done, the list of the found arcs and the number of arcs are displayed. By selecting an arc in the list, this one begins to blink in the graph window. If the window is closed, the current selection remains displayed.
Searching monitors
The thumbnail Monitors displays the monitor search user interface. Obviously, you must be in validation mode and having monitored nodes. The editable combo box allows entering the name of the node or of its class that we are searching. We can use also the special characters * and ?: The character * represents a series of 0 to n unspecified characters. The character ? represents exactly one unspecified character. The check box "Case sensitive" means that the capital and small letters will be interpreted as is. The search options allow searching among the nodes only, the classes only or both of them.
371
Search
Examples: if we want to find all the monitors of nodes beginning by "Cap", we will enter the following search: "Cap*" (without the quotes) if we want to find all the monitors of nodes containing the letters "A" and "B" separated by a single character, we will enter the following search : "*A?B*" (without the quotes) Once the search is done, the list of the found monitors and the number of monitors are displayed. By selecting a monitor in the list, this one begins to blink in the monitor windows. If the window is closed, the current selection remains displayed.
372
As the console is editable, it is possible to insert messages in these results (thanks to the zone of text in the top of the console window). It can be saved and cleared.
373