Se 23 Multilang
Se 23 Multilang
Language
Projects
#1
One-Slide Summary
•Many modern software projects involve code written
in multiple languages. This can involve a common
bytecode or C native method interfaces.
•Native code interfaces can be understood in terms of
(1) data layout and (2) special common functions to
manipulate managed data.
•Almost all aspects of software engineering are
impacted in multi-language projects.
#2
Multi-Language Projects?
• You already know about projects using multiple languages
• HW’s have touched on bash, C, Java, Python, etc…
• You’ll definitely need multiple languages as a SWE
• Remember productivity: use the best tool for the job
• C is fast, Python is convenient, Java is a mistake, SQL for DBs, PHP for web…
#5
Speech Perception: Segmentation
Motivating Example
• In un mondo splendido, colorato e magico
• Little ponies vivono, in pace sempre in armonia
• Timidi e simpatici, burberi e romantici
• Sono i caratteri, degli amici che troverai
• Ed ogni giorno crescerai, quanti problemi risolverai
• Insieme agli altri pony, lo sai, ti divertirai!
#7
Motivating Example magical
In a world splendid colorful
#8
Motivating Example
• In un mondo
timid splendido, colorato
sympathetic brusque e magico
romantic
• Little ponies vivono, in pace sempre in armonia
• Timidi e simpatici, burberi e romantici
• Sono i caratteri, degli amici che troverai
• Ed ogni giorno crescerai, quanti problemi risolverai
characters
• Insieme agli altri pony, lo sai, ti divertirai!
treasure trove =
found
amicable =
• Vola e vai, my little pony, se nuovi amici vorrai incontrare
friends
• Prendi il volo, ascolta il cuore, ed ogni avventura potrai affrontare!
• Vola e vai, my little pony, realizza i tuoi sogni e non ti fermare!
#9
Multi-Language Projects In Two Stages
• First, reason about the raw data layout
• Second, translate concepts you already
know
#10
Traditional Multi-Language Projects
•Application Kernel
• Statically Typed, Optimized, Compiled, interfaces with OS and
libraries.
•Scripts
• Dynamically Typed, Interpreted, Glue Components, Business Logic.
•Examples: Emacs (C / Lisp), Adobe Lightroom (C++ / Lua),
NRAO Telescope (C / Python), Google Android (C / Java),
most games (C++ / Lua), etc.
#11
Bytecode Multi-Language Projects
•Microsoft's Common Language Runtime of Managed Code
in the .NET Framework
• C++, C#, J#, F#, Visual Basic, ASP, etc.
• Common Language Infrastructure
•Java Bytecode, Java Virtual Machine, Java Runtime
Environment
• Java, Scala, JRuby, JScheme, Jython, Fortress, etc.
•Others: LLVM Bitcode, Python Bytecode, etc.
#12
Why Cover “Multi-Language”?
•Increasingly common. Developer quote:
• “My last 4 jobs have been apps that called: Java from C#, and C#
from F#; Java from Ruby; Python from Tcl, C++ from Python, and C
from Tcl; Java from Python, and Java from Scheme (And that's not
even counting SQL, JS, OQL, etc.)”
•SE process: choose the best tool for the job
• Example: concurrency might be better handled in F#/OCaml
(immutable functional) or Ruby (designed to hide such details),
while low-level OS or hardware access is much easier in C or C++,
while rapid prototyping is much easier in Python or Lua, etc.
#13
Disadvantages of Multi-Language Projects
•Integrating data and control flow across languages can be
difficult
•Debugging can be harder
• Especially as values flow and control flow from language A to language
B
•Build process becomes more complicated
• Make and maven?
•Developer expertise is required in multiple languages
• Must understand types (etc.) in all languages
#14
How Will We Do It?
• “In practice, interoperating between F# and C# (or any
other CLR language) is relatively straightforward, once
the 'shape' of the code (what the language turns into
at the IL level) in both languages is well understood.”
• Ted Neward, Microsoft Developer Network
#15
Worked Examples
• We are going to write a fast C-and-assembly routine for low-level processing
• Assume you know C or C++ (e.g., libpng, afl, etc.)
• This is a native kernel
• Then we will call that C code from
• Python (e.g., avl.py, mutate.py, delta.py)
• Java (e.g., JFreeChart, JSoup, EvoSuite)
• OCaml/F# (e.g., Infer)
• This will involve
• Understanding Data
• Translating Familiar Concepts
#16
Native Kernel: One-Time Pad
•One of the building blocks of modern cryptography is the
one-time pad.
• When used correctly it has a number of very desirable properties.
•To encrypt plaintext P with a key K (the one time pad) you
produce cyphertext C as follows:
• cyphertext[i] = plaintext[i] XOR keytext[i]
• A constant key mask may be also used for testing.
•Decryption also just xors with the key.
#17
XOR In Python
def python_string_xor(plain, key):
cypher = bytearray(' '*len(plain))
if type(key) is str:
for i in range(len(plain)):
cypher[i] = ord(plain[i]) ^ ord(key[i])
else: # is char
for i in range(len(plain)):
cypher[i] = ord(plain[i]) ^ key
return cypher
#18
Interfacing Python with C
static PyObject * cpython_string_xor(PyObject *self, PyObject *args)
{
const char *n_plain, *n_keytext;
int plain_size, i, n_mask;
if (PyArg_ParseTuple(args, "s#s", &n_plain, &plain_size, &n_keytext)) {
char * n_cypher = malloc(plain_size);
for (i=0;i<plain_size;i++)
n_cypher[i] = n_plain[i] ^ n_keytext[i];
return Py_BuildValue("s#", n_cypher, plain_size);
} else if (PyArg_ParseTuple(args, "s#i", &n_plain, &plain_size, &n_mask)) {
char * n_cypher = malloc(plain_size);
for (i=0;i<plain_size;i++)
n_cypher[i] = n_plain[i] ^ n_mask;
return Py_BuildValue("s#", n_cypher, plain_size);
}
return NULL;
}
#19
“Readability”
#20
Typedef:
#21
Interfacing Python with C
Function:
static PyObject * cpython_string_xor(PyObject *self, PyObject *args)
Build a Python String
{
from a C string.
const char *n_plain, *n_keytext;
int plain_size, i, n_mask;
if (PyArg_ParseTuple(args, "s#s", &n_plain, &plain_size, &n_keytext)) {
char * n_cypher = malloc(plain_size);
for (i=0;i<plain_size;i++)
n_cypher[i] = n_plain[i] ^ n_keytext[i];
return Py_BuildValue("s#", n_cypher, plain_size);
} else if (PyArg_ParseTuple(args, "s#i", &n_plain, &plain_size, &n_mask)) {
char * n_cypher = malloc(plain_size);
for (i=0;i<plain_size;i++)
n_cypher[i] = n_plain[i] ^ n_mask; Duck Typing:
return Py_BuildValue("s#", n_cypher, plain_size); Can we interpret the
} arguments as a string
return NULL; followed by an int?
}
#22
Interfacing Python with C, cont'd
static PyMethodDef CpythonMethods[] = {
{"string_xor", cpython_string_xor, METH_VARARGS,
"XOR a string with a string-or-character"},
{NULL, NULL, 0, NULL}
};
This function is
PyMODINIT_FUNC initcpython(void) required (based on
your module name).
{
(void) Py_InitModule("cpython", CpythonMethods);
}
#23
Linking Our Native Python Code
•gcc -pthread -fno-strict-aliasing
-DNDEBUG -g -fwrapv -O2 -Wall
-Wstrict-prototypes -fPIC
-I/usr/include/python2.7 -c cpython.c
-o build/temp.linux-x86_64-2.7/cpython.o
•gcc -pthread -shared -Wl,-O1
-Wl,-Bsymbolic-functions
-Wl,-Bsymbolic-functions -Wl,-z,relro
build/temp.linux-x86_64-2.7/cpython.o
-o build/lib.linux-x86_64-2.7/cpython.so
#24
Linking Our Native Python Code
Position Independent Code
•gcc -pthread -fno-strict-aliasing (see EECS 483)
-DNDEBUG -g -fwrapv -O2 -Wall
-Wstrict-prototypes -fPIC
-I/usr/include/python2.7 -c cpython.c
-o build/temp.linux-x86_64-2.7/cpython.o
•gcc -pthread -shared -Wl,-O1 -Wl,-
Build Shared Library Code
Bsymbolic-functions -Wl,-Bsymbolic-
(see EECS 483)
functions -Wl,-z,relro
build/temp.linux-x86_64-2.7/cpython.o
-o build/lib.linux-x86_64-2.7/cpython.so
.so = .dll = shared library
#25
Interfacing C with Python
import cpython # loads cpython.so
...
if do_native:
result = cpython.string_xor(plaintext, \
char_or_string_key)
else:
result = python_string_xor(plaintext, \
char_or_string_key)
#26
Programming Paradigms
•This “pass a string or an integer as the second argument” plan ...
• Works well for Dynamic (e.g., Python duck typing)
• Works well for Functional (algebraic datatypes)
• See EECS 490
• Is not a natural fit for Object-Oriented
• More natural: dynamic dispatch on “string-or-int”
•abstract class StringOrInt
• class StringOrInt_IsInt extends StringOrInt
• class StringOrInt_IsString extends StringOrInt
#27
Java Code (1/2)
abstract class StringOrInt {
abstract public byte[] java_string_xor (byte[] str1);
}
class StringOrInt_IsInt extends StringOrInt {
public int my_int;
public StringOrInt_IsInt (int i) { my_int = i; }
public byte[] java_string_xor (byte[] plain) {
byte [] cypher = new byte[plain.length];
for (int i = 0; i < plain.length; i++)
cypher[i] = (byte) ((int)plain[i] ^ my_int);
return cypher;
}
}
#28
Java Code (1/2) Java's String is so
tied up in encodings
that it's not raw-
content-preserving.
abstract class StringOrInt {
abstract public byte[] java_string_xor (byte[] str1);
}
class StringOrInt_IsInt extends StringOrInt {
public int my_int;
public StringOrInt_IsInt (int i) { my_int = i; }
public byte[] java_string_xor (byte[] plain) {
byte [] cypher = new byte[plain.length];
for (int i = 0; i < plain.length; i++)
cypher[i] = (byte) ((int)plain[i] ^ my_int);
return cypher; Cutely, Java warns
} about a lack of
} precision here (int/byte)
unless you cast.
#29
Java Code (2/2)
abstract class StringOrInt {
abstract public byte[] java_string_xor (byte[] str1);
}
class StringOrInt_IsString extends StringOrInt {
public byte[] my_string;
public StringOrInt_IsString (byte[] s) { my_string = s; }
public byte[] java_string_xor (byte[] plain) {
byte [] cypher = new byte[plain.length];
for (int i = 0; i < plain.length; i++)
cypher[i] = (byte) (plain[i] ^ my_string[i]);
return cypher;
}
}
#30
Tell Java about the Native Method
static {
/* load native library */
System.loadLibrary("cjava");
}
#31
C Code using JNI (1/2)
JNIEXPORT jbyteArray JNICALL Java_StringXOR_c_1string_1xor
(JNIEnv * env, jclass self, jbyteArray jplain, jobject jkey)
{
jbyte * n_plain = (*env)->GetByteArrayElements
(env, jplain, NULL);
size_t plainsize = (*env)->GetArrayLength(env, j_plain);
jclass key_cls = (*env)->GetObjectClass(env, jkey);
jfieldID fid ;
int i;
jbyteArray jcypher = (*env)->NewByteArray(env,plainsize);
jbyte * n_cypher = (*env)->GetByteArrayElements(env,
jcypher, NULL);
#36
C Code using JNI (2/2)
Field lookup again.
“[B” == “[] Byte”
} else {
fid = (*env)->GetFieldID(env, key_cls, "my_string", "[B");
if (fid != NULL) {
/* key has "byte[] my_string;" field */
jbyteArray jkeyt = (*env)->GetObjectField(env, jkey, fid);
jbyte * n_keytext = (*env)->GetByteArrayElements
Can indicate whether (env, jkeyt, NULL);
for
elements (i=0;i<plainsize;i++)
were copied or shared.
cypher[i] = n_plain[i] ^ n_keytext[i];
(*env)->ReleaseByteArrayElements(env,jkeyt,n_keytext,0);
} Playing nice with
} the garbage collector.
(*env)->ReleaseByteArrayElements(env, jplain, n_plain, 0);
(*env)->ReleaseByteArrayElements(env, jcypher, n_cypher, 0);
return jcypher;
}
#37
Compiling, Linking and Running JNI
gcc -I $(JAVA)/include \
-o libcjava.so -shared -fPIC cjava.c
javac StringXOR.java
java -Djava.library.path=. StringXOR
•That's it!
•“javap” also exists to automatically generate header files for C
JNI implementations.
#38
Zoology
•These ray-finned fish hatch in fresh water, migrate to
the ocean, and then return to fresh water to
reproduce. Tracking studies have shown that they
often return to the same spot they hatched from to
spawn. Commercial production of them is currently
over three million tonnes. They are often a keystone
species, supporting bears, birds and otters.
Psychology: Memory?
•54 students and 108 community members were posed
questions like:
#43
Basic Ocaml Implementation
type char_or_string =
| MyChar of char (* constant bit pattern *)
| MyString of string (* one-time pad *)
#45
Native C Implementation
• Basic idea:
• accept “string” and “char_or_string” as args
• extract contents of “string” (plaintext)
• examine “char_or_string”
• If “char” (mask), extract character code value
• If “string” (keytext), extract contents of string
• create a new string (return value, cyphertext)
• for loop (over length of string)
• cyphertext = plaintext xor key
• return cyphertext
#46
The Problem
x
•int x = 127; 00 00 00 7f
p
8d 50 00 62
•char * p = “hi”;
*p
68 69 00
cos
•let cos = MyChar('\127') in ???
#47
The Problem
•let cos = MyChar('\127') in
cos
ff 00 00 00 00 00 00 00 fc 08 00 00 00 00 00 00 ..
#48
The Problem
•let cos = MyChar('\127') in
cos
ff 00 00 00 00 00 00 00 fc 08 00 00 00 00 00 00 ..
cos2
60 8d 62 00 00 00 00 00 fc 04 00 00 00 00 00 00 ..
#49
The Problem
•let cos = MyChar('\127') in
cos
ff 00 00 00 00 00 00 00 fc 08 00 00 00 00 00 00 ..
0x628d60
68 69 00 00 .. #50
Run-Time Type Tags
•let cos = MyChar('\127') in
cos
00 04 00 00 00 00 00 00 ff 00 00 00 00 00 00 00 fc 08 00 00 00 00 00 00 ..
0x628d60
68 69 00 00 .. #51
Run-Time Type Tags
C(127) == Ocaml(255)
Type Tag 0 (garbage collection)
•let cos = MyChar('\127') in
cos
00 04 00 00 00 00 00 00 ff 00 00 00 00 00 00 00 fc 08
Pointer To 00 00 00 00 00 00 ..
String
(little endian)
Type Tag 1 “Color” (2 bits)
•let cos2 = MyString(“hi”)
and Size (54 bits) in
cos2
01 04 00 00 00 00 00 00 60 8d 62 00 00 00 00 00 fc 04 00 00 00 00 00 00 ..
#57
Cross-Cutting Implications for
Software Engineering
•Hiring and Expertise
• You need developers experienced with “both” languages
• Per-language experience may not be equal
•Code Inspection and Review
• Recall Google's per-language “badge” policy
• Need badges in all relevant languages
• How would you evaluate a pull request if you do not know all
of the languages?
#58
Cross-Cutting Implications for
Software Engineering
•Design
• Because cross-language coding is so difficult and error-prone, you
must design those interfaces very carefully in advance
• cf. native method interface ← key word
• Think carefully about relevant metrics (e.g., coupling, cohesion, etc.)
• Design patterns can help, but you typically want to encapsulate any
cross-language code inside one
• e.g., don't have some native code in the Model and some in the View and have
them share: backdoor?
#59
Cross-Cutting Implications for
Software Engineering
•Readability
• “Glue” code is typically incomprehensible without training
• Recall: look for familiar motifs
• All of our examples have parts that “do the same thing” (e.g., convert value
from X to C)
• But comprehension may also require knowing about both
languages
• Python and Java field queries
• Ocaml integer conversions
#60
Cross-Cutting Implications for
Software Engineering
•Test Input Generation
• Most tools do not support test input generation across
multiple language layers (it is an open research problem)
• AFL is popular because it works on binaries (and thus any
compiled language)
• Microsoft's PEX works for any .NET / common language
runtime program
• But do not assume tools will work for multi-language projects:
plan in advance to mitigate risk!
#61
Cross-Cutting Implications for
Software Engineering
•Test Coverage
• Outside of giant ecosystems (e.g., Java Bytecode, Microsoft
Common Language Runtime), coverage tools do not span
languages
• Pick one or run them separately
•Mutation Analysis
• Similarly, mutation tools are typically language specific
• Exam-style thought question: should you mutate the glue
code when doing mutation testing?
#62
Cross-Cutting Implications for
Software Engineering
•Debugging
• Outside of some bytecode/CLR instances, debuggers almost
never help with multi-language projects
• You “can” run GDB on an Ocaml-produced (etc.) executable,
but it won't see any of your function or variable names
• Basically just a raw assembly view
• cf. C++ name mangling
#63
Cross-Cutting Implications for
Software Engineering
•Debugging
• Typically you pick one language's debugger
• Augment that with print-statement debugging at interface
boundaries
• Debugging multi-language code is merely “annoying” if the
bug is isolated to code in just one language
• It is “very, very difficult” if the bug actually involves crossing
the boundary
#64
Cross-Cutting Implications for
Software Engineering
•Static Analysis and Refactoring
• Unless the tool happens to support all relevant languages it
will only report defects in some of the code
• And it will make conservative assumptions about what happens at the
cross-language interface
• Result: more false positives and/or false negatives
#65
Cross-Cutting Implications for
Software Engineering
•Dynamic Analyses and Profiling
• Similar story: unless the tool happens to support multiple
languages (and most do not), you will have to pick one
language and just use that language's tool
• Example: you can run gprof on a non-C-produced binary, but it
probably will not be able to give recognizable function names
or useful call graphs
• Thought question: would CHESS or Eraser work on multi-
language projects?
#66
Cross-Cutting Implications for
Software Engineering
•Process, Planning and Metrics
• Will developers be as precise at effort estimation for coding in multi-
language projects?
• How will you make high-level QA decisions (e.g., “is it good enough
to ship?”) if coverage metrics only apply to part of the code?
• What additional risks do you take on by choosing to carry out a multi-
language project?
• How would you mitigate those risks?
• Do the benefits outweigh the costs?
#67
Cross-Cutting Implications for
Software Engineering
•Requirements and Quality Properties
• The dominant reason to use multiple languages is to gain the
ease and safety of a high-level language for most of your
program and the speed of a low-level one for critical kernels
• This is a quality (non-functional) requirement
• Another common reason is to make use of an already-written
library (COTS)
• This is usually a functional requirement
• Elicitation: how critical are those to stakeholders?
#68
Actual Numbers (Quality)
• (20 trials, best wall-clock ms time reported)
• Ocaml – Ocaml 143
• Ocaml – Native 103
• Python – Python 598
• Python – Native 29
• Java – Java 165
• Java – Native 183
•C 22
#69
Actual Numbers (You Explain)
• (20 trials, best wall-clock ms time reported)
• Ocaml – Ocaml 143
• Ocaml – Native 103
What?
• Python – Python 598
• Python – Native 29
• Java – Java 165
• Java – Native 183 What?
•C 22
#70
Homework
•Exam 2 Released Saturday Night
• “Cumulative” (but mostly second half)
• Open internet/notes/books
(but NO collaboration with anyone)
#71
Bonus: Ocaml Native Interface
Debugging Example
• You try to write this C/OCaml code, but …
• Input:
• 4b50 0403 0014 0000 0008 59b7 42cd 0ed7
• Expected Output, XOR with '\127':
• 342f 7b7c 7f6b 7f7f 7f77 26c8 3db2 71a8
• Actual Output, Deterministic:
• b4af fbfc ffeb ffff fff7 a648 bd32 f128