Excel COM MathsfR
Excel COM MathsfR
DOI 10.1007/s00180-007-0023-6
O R I G I NA L PA P E R
Excel :: COM :: R
1R
T. Baier
Department of Statistics, Vienna University of Technology, Vienna, Austria
E. Neuwirth (B)
Department of Scientific Computing,
University of Vienna, Vienna, Austria
e-mail: [email protected]
92 T. Baier, E. Neuwirth
presented with a GUI which is not very common for modern applications. User
interaction does not follow the model of menu and dialog driven application
software. In the case of R, the user is is interacting with an interpreter for an
object oriented statistical programming language. A command prompt provides
a very steep learning curve for the beginner and requires quite an investment in
time by users new to the system. Packages, like Rcommander (Fox et al. 2005)
try to assist in learning R by providing a menu-driven GUI but from the user’s
point of view, the command prompt is still the main user interface. Many users
familiar with applications like OpenOffice (OpenOffice.org 2006) or Microsoft
Office do not want to invest the time necessary to learn using R.
So there is a large group of potential users who cannot use R because it
either is too complex or at least seems to be too complex. In addition to that,
there are tasks where the command line is not the ideal way of user interaction.
While R contains an integrated spreadsheet component which can be used to
comfortably enter and edit tabular data, users familiar with Microsoft Excel
or other full featured spreadsheet programs will miss many features they have
grown accustomed to. Therefore we decided to connect Microsoft Excel with
R. The software package implementing (among other things) this connection
is called R (D)COM Server V2.01 (Baier and Neuwirth 2005) and can be
downloaded from http://www.cran.r-project.org/ from section Other.
2 Excel
3 COM
COM components are used very similarly to objects from a class library. The
component author can decide whether the component is integrated into the
client application (running in the same process context as the client applica-
tion) or if it runs in a separate process. In the latter case, COM transparently
handles sharing components across process boundaries and so allows to inte-
grate components provided by one executable file or program into another one.
For a client application (like Microsoft Excel) the component itself is treated
just like an object provided by the application itself. Using this component
then is similar to calling an internal Visual Basic for Applications (VBA, Mi-
crosoft Corporation 2001b) function or object. Therefore, with respect to the
programming language, nothing new is to be learnt, the programmer just has
access to additional objects. Of course, the properties and methods of the object
itself are new. The integration into the VBA IDE, however, is so tight that it is
even possible to use the integrated object browser in the IDE to browse objects
provided via COM.
One of the major advantages of COM itself is its wide support on the
Microsoft Windows platform. Nearly every programming language, as, e.g. Vi-
sual Basic (Microsoft Corporation 2001c), Delphi (McNab et al. 1996) or C++,
scripting languages like JavaScript (Flanagan 2001), Perl (Wall et al. (1996) or
more specifically ActiveState Tool Corporation (2000) for a Windows version
able to use COM) or Python (Martelli 2003) or even applications providing ma-
cro support as the Microsoft Office family of products (VBA) provide support
for using the functionality exposed through COM objects by a server applica-
tion.
DCOM stands for distributed COM and extends the COM model with a very
important feature. While COM itself provides methods for performing func-
tion or method calls across process boundaries, DCOM goes one step further.
DCOM makes COM objects transparently available in a network of computers.
In our previous example of Microsoft Excel utilizing R as a computational com-
ponent, DCOM now allows to run the component (R) and the client application
(Microsoft Excel) on different machines. DCOM will take care of the neces-
sary communications over the network to make the services exposed by the R
component available to Excel.
COM requires developers to separate interface and implementation. Along
these lines we have defined a COM interface called IStatConnector (see
Sect. 4 which formally defines the functionality our COM server provides. Client
applications work with the COM interface and the true binding to the imple-
mentation is done when the COM object is instantiated (while the application
is running). This separation of interface and implementation and the run-time
binding mechanism allows to ensure compatibility between different versions
96 T. Baier, E. Neuwirth
of the COM server. For example, client applications created in 1999 for the
first release of our COM server for R still work without modifications with the
current version released in October 2005.
The COM interface defines the functions (and variables) a COM object
provides. An implementation of a COM interface is called a coclass.
In the last few years, Microsoft has developed a new component technology
as part of the .NET (Microsoft Corporation 2001a) system. Why we chose COM
as our component technology is easy to explain:
• COM and DCOM can be used on all 32 Bit Windows platforms (Windows
9x, ME, NT 4, 2000, XP and even Windows CE/Pocket PC)
• mature COM support is found in most programming languages and appli-
cations, while good support for .NET’s component technology is still not
found very often. Even Microsoft Excel does not have native .NET support
at the moment.
• when the concepts for integrating R into Excel has been developed back in
1999, .NET was not available at all
• the .NET → COM bridge technology allows to use our COM components
from .NET applications quite well.
For the near future we want to stay with COM as our base component
technology, but to make it easier for the new .NET environment, we will develop
native .NET components both for the computational components and for the
controls and applications. Since the .NET technology is fully documented and
implemented non-Microsoft platforms also (Mono Project 2006) this will allow
to use our mechanism on a wider range of operating systems.
Our goal was to make R’s computational engine available to third party appli-
cations in general, and to Microsoft Excel in particular. In COM terminology,
this makes Excel a COM client application and R a COM server. When talking
about COM, this also includes the DCOM technology. COM clients do not
distinguish between COM and DCOM when accessing an object’s methods and
properties. Only the client machine’s configuration and the process of object
creation may be different.
In this integrated system with components Excel and R, Excel (the client)
is the controlling part, whereas R (the server) offers its services on request to
Excel. Figure 1 shows the connection between the two applications including
data flow.
The COM server provides R’s functionality through a COM interface called
IStatConnector. Below we show relevant parts of this COM interface.
interface IStatConnector : IDispatch
{
// starting and stopping the interpreter
HRESULT Init([in] BSTR bstrConnectorName);
HRESULT Close();
Excel :: COM :: R 97
COM
...
};
another application (Excel in this case) and completely hide R’s own GUI from
users.
The COM interface IStatConnector Excel (or more specifically: RExcel)
uses is separated from the interface’s implementation, the coclass
StatConnector. The package rcom makes use of this concept and provides an
alternative implementation of the IStatConnector interface which displays
R’s “normal” GUI and allows manual interaction with R in parallel to using the
COM client. By simply changing the object creation mechanism in Microsoft
Excel, we can exchange the COM implementation and provide access to an R
process with its own GUI accessible for the user, while R is still integrated into
Excel.
The COM server (the coclass StatConnector) mainly consists of two parts.
The first part is tightly coupled with the implementation of R itself, the latter is
the “real” implementation of the COM interface IStatConnector. It is not
the goal of this article to give a detailed description of either the R implemen-
tation or of the COM implementation. We will only provide a short description
of the design goals and the advantages (and disadvantages) of our approach.
On Windows platforms, as well as for most other operating systems, GCC
(see Stallman 2005) is used to build the R executables and libraries from source
code. When we started our project, there was not much support for creating
COM servers using GCC. Commercial compilers, on the other hand, provided
good support to create COM server applications and contained class libra-
ries making creation of COM servers an easy task. Unfortunately—unlike on
most Unix-alike platforms—interoperability between different vendors’ C and
C++ compilers is not possible (easily). Only when using the so-called system
calling convention, which is defined for C code only, it was possible to create
implementations with one compiler (e.g., with GCC), which could safely be
called from an implementation compiled with a different compiler (in our case
Microsoft VC++). This abstraction layer only uses C functions (no COM or
C++) and utilizes Microsoft Windows’ system calling convention. This guaran-
tees that the functions can safely be called from a C program compiled with
any C compiler for Windows. The abstraction layer (below referenced as the
proxy object SC_Proxy_Object) consists of a set of pointers to C functions
stored in a structure and not only defines functions to access R but also a data
format which maps the R-specific internal storage format (SEXPs, see R De-
velopment Core Team 2005f for more information) to the so-called BDX data
format (Binary Data eXchange format) designed specially for this goal. This
data format has been designed for efficient (structured) data exchange with as
few memory/conversion operations as possible.
The proxy object SC_Proxy_Object is a general interface object and its
definition is independent from R. The same interface object (definition) and
the data format could be used for other systems than R, too (e.g., GNU Octave,
Excel :: COM :: R 99
this time many new versions of both StatConnector and R have been
released.
+ SC_Proxy_Object provides a stable interface for C and C++ program-
mers using any Windows C compiler. For C programmers this may be
easier than going the COM way via IStatConnector but still decouples
the implementations from a specific R version.
+ Changes in R may require code to be changed. As the R interface is
implemented in rproxy.dll and rproxy.dll is part of R (and the R dis-
tribution/setup) this can be done easily. It is very unlikely to have to make
changes to the StatConnector implementation because of changes in
R. This helps keeps maintenance cost low.
+ Installing a new R version does not require to switch to a new version of
StatConnector in most cases.
+ Multiple versions of R can be installed at the same time and used by the
client applications by simply setting a registry key to point to the version
of R which shall be used.
+ Different applications use different R processes. This makes the client
applications independent from each other and does not require any co-
operation between them.
− A small overhead for data conversion and additional function call overhead
is imposed by this architecture. Practically, this does not have any impact on
a typical application’s performance. Minimizing data transfer and function
calls keeps this overhead low.
+ Problems in R code or maybe some bug in a package (resulting in a crash)
does not affect the client application. The client application always only
gets an error code from the COM implementation and can handle faults
like those gracefully.
+ The same infrastructure which is used by StatConnector can be used by
alternative implementations, too. rcom (see Baier 2005) is another COM
server for R implementing IStatConnector. The implementation reuses
rproxy.dll to provide a different level of integration between the COM
client and R and also allows to implement a different user interface para-
digm.
Next, we will describe how the Add-In for Microsoft Excel uses
StatConnector to integrate both spread-sheet and R functionality.
6 Excel implementation
R has many complex data types implemented in R’s object system. Excel
essentially only has vectors (columns or rows) and matrices. The more complex
data types of R are conceptually incompatible with Excel’s tabular paradigm.
Therefore, our interface between the two applications only handles arrays
(containing only one basic data type like string or real) and dataframes. Da-
taframes in Excel are represented as arrays of columns. Each column has a
name (of the variable) in the top row and consists of data of equal type (string,
real, time,…). Different columns may have different types. The interface allows
to transfer Excel ranges of one underlying data type to R as array and to
transfer rectangular areas of cells (called “ranges”) following the “dataframe
convention” to R as dataframe. Similarly, scalars, vectors, and arrays in R can
be transferred to Excel ranges. The current version of the interface will even
handle date, time and complex numbers reasonably. Both data types are defined
in Excel and in R, but they are not implemented in the same way. Therefore,
great care must be taken when transferring these data types. Incidentally, most
parts of RExcel are implemented in VBA, which is an interpreted language. To
speed up data transfer for large arrays and dataframes, some routines had to be
implemented in compiled Visual Basic.
An additional problem is handling of missing values. Excel treats empty cells
differently under different conditions. For arithmetic functions in many cases
empty cells are treated the same as cells containing the value 0. For statisti-
cal projects, this is a serious issue which has been extensively documented in
McCullough and Wilson (2002). Therefore, our interface allows to specify dif-
ferent methods of handling for empty cells, and furthermore allows for different
conventions of indicating missing data in Excel.
To allow Excel to connect to R, the interface installs a new menu item RExcel
in Excel’s main menu. This menu items opens a submenu containing, among
other items, commands to transfer the currently selected data to R and to
transfer an array or a dataframe from R to a range in Excel.
This menu also has an item for connecting to R and to select the type of R
server to be used (R(D)COM or rcom).
In addition to transferring data from Excel to R and back, a mechanism for
executing R procedures and functions from Excel is needed. If the the rcom
mechanism is used, starting R brings up an R command line, so the user can
run R commands from this interface in the same way he would interact with
a standard R GUI. The advantage is that data can be transferred from Excel
most easily, and results can be transferred directly into Excel. If the underlying
R process is the R(D)COM server, however, no command line interface is
available. Therefore, another way is needed to run R commands. RExcel allows
to enter R commands as text into Excel cells. Then, a range can be selected
interactively and the text in these cells will be interpreted as a sequence of R
commands and executed. This way, Excel ranges become the R command line.
Additionally, the R code is saved as part of the worksheet and therefore one
Excel file can contain all the data and the R code needed to perform complete
statistical analyses.
102 T. Baier, E. Neuwirth
When working with R code from within Excel, debugging can become rather
tedious. Therefore, the interface offers tools to help with debugging. There is a
special debug mode where all R commands executed are displayed in a special
popup window. When an error occurs, this window will also display R’s error
messages. So Excel can even be used as a mini development environment.
The interface also has a command for getting the output of the last command
executed by R. This output can be put into a cell range in Excel as text. This
is useful for command producing output which cannot easily be represented
as arrays or dataframes. This way, when programming R one can inspect the
results of performing operations in R in an informal manner.
Using this mechanism (data transfer in both directions, execution of R com-
mands initiated by Excel), it is possible to write Excel macros performing sta-
tistical tasks and start them from menus in Excel. This way, complete statistical
applications can be written in Excel and the user only sees some additional
menu items performing these tasks. RExcel enhances Microsoft Excel with sta-
tistical methods not part of Excel itself. These enhancements are integrated
seamlessly into the GUI and Excel’s user interface paradigms, like, say, Excel’s
solver for multivariate equation solving and optimization.
To be able to implement such embedded applications, the implementor has to
know R, the spreadsheet part of Excel, and VBA. The hub for such applications
is VBA. Macros written in this language take care of data transfer. R commands
to be run are constructed as strings in VBA and then executed by calling an
appropriate procedure in VBA.
Here is a typical small example demonstrating the usual pattern of using R
in Excel this way:
Sub RegreDemo()
Call RInterface.StartRServer
Call RInterface.PutDataframe("mydf", _
Range("Regression!A1:C26"))
Call RInterface.RRun("attach(mydf)")
Call RInterface.GetArray("lm(y˜x1+x2)$coefficients", _
Range("Regression!F2"))
Call RInterface.StopRServer
End Sub
RApply("pchisq",C4,D4,E4)
Excel :: COM :: R 103
7 Additional tools
So far, this article has focused on the “core components”, which are the coclass
StatConnector (including the COM interface IStatConnector) and the
Microsoft Excel Add-In RExcel. In other words, the missing link between R’s
computational engine and the mathematical (or statistical) part of Microsoft’s
Office suite has been discussed.
Looking at R itself, it is obvious that some important parts of R’s features
have been omitted so far: graphics and text output.
Achieving graphics output seems to be very simple: When calling one of R’s
graphics commands (e.g. plot), R (or more precisely, the R instance running
in the COM server) will open an R graphics window and the graphical output is
shown. Although this approach provides a suitable solution at the first glance,
this cannot be the right solution on a second thought.
The graphics window is opened by the COM server and also “belongs”
to the COM server process. If the COM server is run on a remote machine,
the graphics window will be shown on the remote machine, too2 . The correct
solution for this is to show R’s display window on the local machine, while it
is controlled from the remote machine (the graphics are “drawn” by the COM
server). This is achieved by providing a so-called Active X control (see Cluts
2001). Active X controls are user interface components which can be shown in
a window or form. The “programmable” interface of the control is represented
by a (custom) COM interface.
The implementation uses the same mechanism to communicate between R
and the Active X control as the Excel Add-In does to talk to R. The Active X
2 This is a very simplified approach for explaining the mechanism. In reality, the COM server tries
to open the graphics window on the remote machine, but this will only succeed, if launch and run
permissions are set appropriately and the login state of the remote machine allows to show the
window.
Excel :: COM :: R 105
control is a COM object, and the R COM server on the remote machine holds
a reference to the control on the local machine. This “callback mechanism” is
implemented in rproxy.dll.
To capture R’s text output (e.g., texts appearing in the console window pro-
duced using cat) another Active X control is provided. In addition to the GUI
representation (the Active X control StatConnectorCharacterDevice)
a non-GUI object is also provided. The coclass StringLogDevice stores all
text output in a string variable and provides a way to programmatically access
R’s text output.
By using the core component StatConnector and the output components
StatConnectorGraphicsDevice and StringLogDevice any COM client
application can fully make use of R both as a powerful computational com-
ponent and as a high-quality graphics engine (Fig. 3).
References
ActiveState Tool Corporation (2000) Active Perl, 5.6.0.618 edn, ActiveState Tool Corporation.
http://www.ActiveState.com/ActivePerl/
Baier T (2005) rcom: R COM Client Interface and internal COM Server. R package version 1.2.1.
Baier T, Neuwirth E (2005) R (D)COM Server V2.00. http://www.cran.r-project.org/other/DCOM
Chambers JM (1998) Programming with Data, Springer, New York. ISBN 0-387-98503-4
http://www.cm.bell-labs.com/cm/ms/departments/sia/Sbook/
Cluts N (2001) Microsoft activex controls overview, in ‘MSDN Library’, Vol. Backgrounders,
Microsoft Corporation. http://www.msdn.microsoft.com/
DuBois P (2000) MySQL. New Riders
Eaton JW (2005) Octave: interactive language for numerical computations. University of Wisconsin,
Department of Chemical Engineering. http://www.octave.org/doc/index.html
Flanagan D (2001) JavaScript: the definitive guide, 4th edn. O’Reilly Media, Inc. ISBN 0596000480
Fox J with contributions from Michael Ash, Grosjean P, Maechler M, Putler D, Wolf P (2005) Rcmdr:
R Commander. R package version 1.1-1 http://www.r-project.org, http://www.socserv.socsci.
mcmaster.ca/jfox/Misc/Rcmdr/
Free Software Foundation (1991) GNU GENERAL PUBLIC LICENSE. Version 2
Free Software Foundation (1999) GNU LESSER GENERAL PUBLIC LICENSE. Version 2.1
Hornik K (2005) The R FAQ. ISBN 3-900051-08-9. http://www.CRAN.R-project.org/doc/FAQ/
Insightful Corporation (2005) S-PLUS 7’. http://www.insightful.com/products/splus/
James D, DebRoy S (2005) RMySQL
Lang DT (2005a) RDCOMClient: R-DCOM client. R package version 0.91-0.
http://www.omegahat.org/RDCOMClient, http://www.omegahat.org, http://www.omegahat.
org/bugs
Lang DT (2005b) RDCOMServer: R-DCOM object server. R package version 0.6-0.
http://www.omegahat.org/RDCOMServer, http://www.omegahat.org, http://www.omegahat.
org/bugs
Lang DT (2005c) XML: Tools for parsing and generating XML within R and S-Plus. R package
version 0.99-1. http://www.omegahat.org/RSXML
Lapsley M, Ripley BD (2005) RODBC: ODBC database access. R package version 1.1-4
Martelli A (2003) Python in a Nutshell. O’Reilly Media, Inc. ISBN 0596001886
McCullough BD, Wilson B (2002) On the accuracy of statistical procedures in Microsoft Excel 2000
and Excel XP. Comput Stat Data Anal 40:713–721
McNab E, Swart RE, Hinks P, Horn D, Jansen A, Jewell D, Wako W, Winning C (1996) The
Revolutionary Guide to Delphi 2. Peer Information Inc. ISBN 1874416672
Microsoft Corporation (2001a) Common language runtime. In: ‘MSDN Library’, vol. .NET Fra-
mework SDK, Microsoft Corporation. http://www.msdn.microsoft.com/
Microsoft Corporation (2001b) Microsoft office 2000/visual basic programmer’s guide. In:
‘MSDN Library’, vol. Office 2000 Documentation, Microsoft Corporation. http://www.msdn.
microsoft.com/
Microsoft Corporation (2001c) Visual basic. In: ‘MSDN Library’, vol. Visual Studio 6.0 Documen-
tation, Microsoft Corporation. http://msdn.microsoft.com/
Microsoft Corporation & Digital Equipment Corporation (1995) The component object model
specification, Technical Report 0.9, Microsoft Corporation (Draft)
Mono Project (2006) The Mono Project. http://www.mono-project.com/
Nardi BA (1993) A Small Matter of Programming. MIT Press, Boston. ISBN 0-262-14053-5
http://mitpress.mit.edu/catalog/item/default.asp?ttype=2&tid=6799
Neuwirth E, Arganbright D (2003) Mathematical Modeling with Microsoft Excel,
Thomson-Brooks/Cole. ISBN 0-534-42085-0. http://www.brookscole.com/cgi-wadsworth/
course_products_wp.pl ?fid=M2b&product_isbn_issn=0534420850&discipline_number=1
Object Management Group I (2002) Common object request broker architecture: Core specifica-
tion, Technical report, Object Management Group, Inc. 3.0
OpenOffice.org (2006) OpenOffice. http://www.openoffice.org/
R-core members, DebRoy S, Bivand R, others: see COPYRIGHTS file in the sources (2005) foreign:
Read Data Stored by Minitab, S, SAS, SPSS, Stata, Systat, dBase. R package version 0.8-10
108 T. Baier, E. Neuwirth