Data Science
Data Science
“Who combines the skills of software programmer, statistician and storyteller slash artist to extract
the nuggets of gold hidden under mountains of data”
And by the end of these courses, hopefully you will feel equipped to do just that!
There is a little anecdote that describes the truly exponential growth of data generation we are
experiencing. In the third century BC, the Library of Alexandria was believed to house the sum of
human knowledge. Today, there is enough information in the world to give every person alive 320
times as much of it as historians think was stored in Alexandria’s entire collection.
And this brings us to the second quality of big data: velocity. Data is being generated and collected
faster than ever before. In our YouTube example, new data is coming at you every minute! In a
completely different example, say you have a question about shipping times or routes. Well, most
transport trucks have real time GPS data available - you could in real time analyze the trucks
movements… if you have the tools and skills to do so!
The third quality of big data is variety. In the examples I’ve mentioned so far, you have different
types of data available to you. In the YouTube example, you could be analyzing video or audio,
which is a very unstructured data set, or you could have a database of video lengths, views or
comments, which is a much more structured dataset to analyze.
The most basic of definitions would be that a data scientist is somebody who uses data to answer
questions. But more importantly to you, what skills does a data scientist embody?
Figure 2 - Drew Conway’s Venn diagram of data science
And to answer this, we have this illustrative Venn diagram, in which data science is the intersection
of three sectors - Substantive expertise, hacking skills, and math and statistics.
To explain a little on what we mean by this, we know that we use data science to answer questions -
so first, we need to have enough expertise in the area that we want to ask about in order to formulate
our questions and to know what sorts of data are appropriate to answer that question. Once we have
our question and appropriate data, we know from the sorts of data that data science works with, often
times it needs to undergo significant cleaning and formatting - and this often takes computer
programming slash “hacking” skills. Finally, once we have our data, we need to analyze it, and this
often takes math and stats knowledge.
In this specialization, we’ll spend a bit of time focusing on each of these three sectors, but will
primarily focus on math and statistics knowledge and hacking skills. For hacking skills, we’ll focus
on teaching two different components: computer programming or at least computer programming
with R, which will allow you to access data, play around with it, analyze it, and plot it. Additionally,
we’ll focus on having you learn how to go out and get answers to your programming questions.
One reason data scientists are in such demand is that most of the answers aren’t already outlined in
textbooks - a data scientist needs to be somebody who knows how to find answers to novel
problems.
Data scientist roles have grown over 650 percent since 2012, but currently 35,000 people in the US
have data science skills, while hundreds of companies are hiring for those roles - even those you may
not expect in sectors like retail and finance - supply of candidates for these roles cannot keep up with
demand.
This is a great time to be getting in to data science - not only do we have more and more data, and
more and more tools for collecting, storing, and analyzing it, but the demand for data scientists is
becoming increasingly recognized as important in many diverse sectors, not just business and
academia.
Additionally, according to Glassdoor, in which they ranked the top 50 best jobs in America, Data
Scientist is THE top job in the US in 2017, based on job satisfaction, salary, and demand.
One place we might not immediately recognize the demand for data science is in sports – Daryl
Morey is the general manager of a US basketball team, the Houston Rockets. Despite not having a
strong background in basketball, Morey was awarded the job as GM on the basis of his bachelor’s
degree in computer science and his M.B.A. from M.I.T. He was chosen for his ability to collect and
analyze data, and use that to make informed hiring decisions.
Another data scientist that you may have heard of is Hilary Mason. She is a co-founder of
FastForward labs, a machine learning company recently acquired by Cloudera, a data science
company, and is the Data Scientist in Residence at Accel. Broadly, she uses data to answer questions
about mining the web and understanding the way that humans interact with each other through social
media.
And finally, Nate Silver is one of the most famous data scientists or statisticians in the world today.
He is founder and editor in chief at FiveThirtyEight - A website that
uses statistical analysis - hard numbers - to tell compelling stories about elections, politics, sports,
science, economics and lifestyle.
He uses large amounts of totally free public data to make predictions about a variety of topics; most
notably he makes predictions about who will win elections in the United States, and has a
remarkable track record for accuracy doing so.
What is data?
Since we’ve spent some time discussing what data science is, we should spend some time looking at
what exactly data is.
Definitions of “data”
First, let’s look at what a few trusted sources consider data to be.
First up, we’ll look at the Cambridge English Dictionary, which states that data is:
Information, especially facts or numbers, collected to be examined and considered and used to help
decision-making.
These are slightly different definitions and they get at different components of what data is. Both
agree that data is values or numbers or facts, but the Cambridge definition focuses on the actions that
surround data - data is collected, examined and most importantly, used to inform decisions. We’ve
focused on this aspect before - we’ve talked about how the most important part of data science is the
question and how all we are doing is using data to answer the question. The Cambridge definition
focuses on this.
The Wikipedia definition focuses more on what data entails. And although it is a fairly short
definition, we’ll take a second to parse this and focus on each component individually.
So, the first thing to focus on is “a set of values” - to have data, you need a set of items to
measure from. In statistics, this set of items is often called the population. The set as a whole is what
you are trying to discover something about. For example, that set of items required to answer your
question might be all websites or it might be the set of all people coming to websites, or it might be a
set of all people getting a particular drug. But in general, it’s a set of things that you’re going to
make measurements on.
So, taking this whole definition into consideration we have measurements (either qualitative or
quantitative) on a set of items making up data - not a bad definition.
Figure 4 - An example of a structured dataset - a spreadsheet of individuals (first initial, last name) and their
country of origin, sex, height, and weight)
Unfortunately, this is rarely how data is presented to you. The data sets we commonly encounter is
much messier, and it is our job to extract the information we want, corral it into something tidy like
the imagined table above, analyze it appropriately, and often, visualize our results.
Sequencing data
Population census data
Electronic medical records (EMR), other large databases
Geographic information system (GIS) data (mapping)
Image analysis and image extrapolation
Language and translations
Website traffic
Personal/Ad data (eg: Facebook, Netflix predictions, etc)
Figure 5 - A volcano plot is produced at the end of a long process to wrangle the raw FASTQ data into
interpretable expression data
Figure 6 - The US population is stratified by sex and age to produce a population pyramid plot
Here is the US census website and some tools to help you examine it , but if you aren’t from the
US, I urge you to check out your home country’s census bureau (if available) and look at some of the
data there!
There is another fun Google initiative involving image analysis, where you help provide data to
Google’s machine learning algorithm… by doodling!
Admittedly, often the data available will limit, or perhaps even enable, certain questions you are
trying to ask. In these cases, you may have to reframe your question or answer a related question, but
the data itself does not drive the question asking.
Summary
In this lesson we focused on data - both in defining it and in exploring what data may look like and
how it can be used.
First, we looked at two definitions of data, one that focuses on the actions surrounding data, and
another on what comprises data. The second definition embeds the concepts of populations,
variables, and looks at the differences between quantitative and qualitative data.
Second, we examined different sources of data that you may encounter, and emphasized the lack of
tidy datasets. Examples of messy datasets, where raw data needs to be wrangled into an interpretable
form, can include sequencing data, census data, electronic medical records, etc. And finally, we
return to our beliefs on the relationship between data and your question and emphasize the
importance of question-first strategies. You could have all the data you could ever hope for, but if
you don’t have a question to start, the data is useless.
Getting help
One of the main skills you are going to be called upon for as a data scientist is your ability to solve
problems. And sometimes to do that, you need help. The ability to solve problems is at the root of
data science; so the importance of being able to do so is paramount. In this lesson, we are going to
equip you with some strategies to help you when you get stuck with a problem and need some help!
Much of this information has been compiled from Roger Peng’s video on “Getting Help” and Eric
Raymond’s “How to ask questions the smart way” - so definitely check out those resources!
Introducción
En el mundo de los hackers, el tipo de respuestas que obtengas a tus preguntas técnicas depende tanto
de la manera en que formules tus preguntas como de la dificultad de desarrollar la respuesta. En esta
guía se enseñará cómo preguntar de manera que puedas obtener una respuesta satisfactoria.
Lo primero que tienes que entender es que a los hackers les gustan los problemas realmente complejos y
las buenas preguntas que les hagan pensar en ellos. De no ser así no estaríamos aquí. Si nos
proporcionas una cuestión interesante te estaremos agradecidos; las buenas preguntas suponen un
estímulo y un regalo. Las buenas preguntas nos ayudan a desarrollar nuestra comprensión, y a menudo
revelan problemas que podíamos no haber percibido o en los que de otra manera no habríamos
reparado. Entre los hackers, "¡Buena pregunta!" debe entenderse como un sincero cumplido.
A pesar de esto, los hackers tienen la reputación de enfrentarse a las preguntas sencillas con hostilidad o
arrogancia. A veces parece como si resultásemos hostiles a los principiantes o a los ignorantes. Pero eso
realmente no es cierto.
Lo que somos, de una manera no apologética, es hostiles con la gente que parece no querer pensar o
hacer sus deberes antes de plantear las preguntas. La gente de ese tipo son sumideros de tiempo --
toman sin dar a cambio, desperdician el tiempo que podríamos haber dedicado a otra cuestión más
interesante y con otra persona más merecedora de una respuesta. A las personas de este tipo las
llamamos "perdedores" (y por razones históricas a veces escribimos "lusers".
Somos, de largo, voluntarios. Robamos el tiempo de vidas ocupadas para responder preguntas, y a veces
nos sobrecargan. Así que filtramos sin tregua. En particular, desechamos las preguntas de quienes
parecen ser perdedores para ocupar el tiempo que dedicamos a responder preguntas de una manera
más eficiente, con los ganadores.
Tú no quieres ser uno de los perdedores. Tampoco quieres parecerte a ninguno de ellos. La mejor
manera de obtener una respuestas rápida y eficiente es preguntando como un ganador — como una
persona con inteligencia, confianza en sí mismo e indicios de que necesita ayuda con un problema en
particular.
(Las mejoras a esta guía serán bienvenidas. Puede enviar sus sugerencias (en inglés)
a [email protected].)
N. del T.: "luser" es una contracción de los términos "user" (usuario) y "loser" (perdedor).
Antes de preguntar
Antes de hacer una pregunta técnica por correo, en un grupo de noticias o en el foro de un sitio web, haz
lo siguiente:
Cuando hagas tu pregunta, destaca el hecho de que ya has hecho todo esto; esto ayudará a establecer
que no eres una esponja vaga y que sólo estás desperdiciando el tiempo de los demás. Aún mejor,
destaca lo que hayas aprendido a partir de estas cosas. Nos gusta responder a la gente que ha
demostrado ser capaz de aprender de las respuestas.
Prepara tu pregunta. Piensa en ella. Las preguntas precipitadas reciben respuestas precipitadas, o ni
siquiera eso. Cuanto más hagas para demostrar que has puesto pensamiento y esfuerzo en resolver tu
problema antes de pedir ayuda, más cerca estarás de recibirla realmente.
Ten cuidado de no hacer la pregunta equivocada. Si haces una que esté basada en asunciones erróneas,
Hacker Al Azar seguramente te responderá con algo literal e inútil mientras piensa "Qué pregunta más
estúpida...", y esperando que la experiencia de obtener una respuesta a lo que has preguntado
exactamente en vez de a lo que necesitas saber te enseñará una lección.
Nunca asumas que tienes derecho a una respuesta. No lo tienes. Te ganarás una respuesta, si te la
ganas haciendo una pregunta sustancial, interesante y que haga pensar— una que contribuya
implícitamente a la experiencia de la comunidad antes que solicitar de manera pasiva conocimiento de
los demás.
Por otra parte, un muy buen comienzo es dejar claro que puedes y quieres participar en el proceso de
desarrollar la solución. "¿Tiene alguien alguna pista?" "¿Qué le falta a mi ejemplo?" y "¿Hay alguna
página que debiera haber consultado?" tendrán más probabilidades de ser respondidas que "Publica por
favor el procedimiento exacto que debería seguir", porque estás dejando claro que estás realmente
deseoso de completar el proceso si alguien simplemente te orienta en la dirección correcta.
Cuando preguntes
Elige el foro con cuidado
Ten cuidado al elegir dónde planteas tu pregunta. Seguramente te ignorarán o te tacharán de perdedor
si:
publicas una pregunta muy elemental en un foro en el que se esperan preguntas técnicas
avanzadas, o viceversa
Por esto, es importante expresar tu pregunta de manera clara. Si no puedes molestarte en hacer eso,
nosotros no podemos molestarnos en prestarte atención. Aprovecha el esfuerzo añadido en pulir tu
lenguaje. No tiene que ser nada estirado ni formal — de hecho, la cultura hacker valora el habla
informal, la jerga y el lenguaje cómico usado con precisión. Pero tiene que ser preciso; tiene que haber
alguna indicación de que estás pensando y prestando atención.
Deletrea correctamente. No confundas "its" con "it's" o "loose" con "lose". No ESCRIBAS TODO EN
MAYÚSCULAS, eso se lee como si estuvieses gritando, se considera poco "fino". Si escribes como un bobo
medio analfabeto probablemente te ignorarán. Escribir como un hax0r script kiddie de l33t es el beso
de la muerte absoluto y te garantiza que no recibirás otra cosa que un silencio sepulcral (o, si tienes
suerte, un montón de desprecio y sarcasmo).
Si preguntas en un foro en el que no se usa tu idioma materno, obtendrás una cantidad limitada de
avisos por tus errores gramaticales y de ortografía — pero ninguno añadido por tus argumentaciones
chapuceras (y sí, normalmente conocemos la diferencia). Además, a menos que conozcas las lenguas de
quienes te respondan, escribe en inglés. Los hackers ocupados tienden a descartar las preguntas en
idiomas que no entienden, y el inglés es el idioma de trabajo en la red. Al escribir en inglés minimizas las
posibilidades de que descarten tu pregunta sin leerla.
No envíes correo en el que párrafos completos consten de una única línea * múltiples veces.
(Esto dificulta responder sólo a partes del mensaje.)
Tampoco envíes mensaje codificados como MIME Quoted-Printable; todos esos =20 esparcidos
por el texto son feos y además distraen.
Jamás de los jamases esperes que los hackers puedan leer formatos de documentos
propietarios como Microsoft Word. La mayoría de los hackers reaccionan a esto de igual
manera que reaccionarías tú ante un montón de estiércol humeante volcado en el umbral de tu
puerta.
Si envías correo desde una máquina con Windows, desactiva la estúpida prestación "Smart
Quotes" (citas inteligentes) de Outlook. Esto es para evitar caracteres de basura esparcidos por
tu mensaje.
Estúpido:
Inteligente:
Cursor del ratón deformado con XFree86 4.1, chipset de vídeo Loquesea MV1005
Describe la investigación que llevaste a cabo para acotar una posible respuesta al problema
antes de hacer la pregunta.
Describe los pasos de diagnóstico que llevaste a cabo e intenta solucionar el problema tú mismo
antes de formular la cuestión.
Hazlo lo mejor que puedas para anticiparte a las preguntas que un hacker te haría, y para responderlas
antes de tu solicitud de ayuda.
Simon Tatham ha escrito un excelente ensayo titulado Cómo informar de errores de manera efectiva. Te
recomiendo efusivamente que lo leas.
Estúpida:
Me salen errores SIG11 durante la compilación del núcleo, y sospecho que haya podido romperse un hilo
en uno de los circuitos de la placa base. ¿Cuál es la mejor manera de comprobar eso?
Inteligente:
Mi K6/233 ensamblado por mí con una placa base FIC-PA2007 (chipset VIA Apollo VP2) con 256MB
Corsair PC133 SDRAM empieza a tener frecuentes errores SIG11 sobre unos 20 minutos después de
haberlo arrancado durante el curso de compilaciones del núcleo, pero nunca durante los primeros 20
minutos. Si reinicio no se reinicia el reloj, pero si lo apago durante la noche sí. Pasar toda la RAM a la
partición de intercambio no ha servido de nada. A continuación os pongo la parte relevante del registro
de una típica sesión de compilación.
Si el programa en cuestión tiene opciones de diagnóstico (como -v para prolijo) intenta pensar
cuidadosamente en elegir opciones que puedan añadir información de depuración útil para la
transcripción.
Si tu mensaje acaba resultando muy largo (más de cuatro párrafos), puede resultar útil comentar el
problema de manera sucinta al principio y luego hacerlo de manera cronológica. De esta manera, los
hackers sabrán dónde mirar al leer tu mensaje.
Cuando pides una respuesta privada, estás interrumpiendo tanto el proceso como la recompensa. No
hagas eso. Es elección de quien responde hacerlo en privado — y si lo hace, normalmente es porque
piensa que la pregunta es demasiado obvia o mal planteada como para resultar interesante para otros.
Hay una excepción limitada a esta regla. Si piensas que puedes recibir una gran cantidad de respuestas
muy similares por el tipo de pregunta, entonces las palabras mágicas son "mandadme las respuestas por
correo-e y haré un resúmen para el grupo". Se considera cortés ahorrar a la lista de correo o al grupo de
noticias una gran cantidad de respuestas sustancialmente idénticas — pero evidentemente tienes que
mantener la promesa de resumirlas.
Sé honesto, esto no es tan importante como (y no puede sustituir a) ser correcto gramaticalmente, claro,
preciso y descriptivo, evitar formatos propietarios, etc; los hackers prefieren, por lo general, los informes
sobre errores concretos técnicamente aunque bruscos a la vaguedad educada. (Si esto te deja
contrariado, recuerda que valoramos una pregunta por lo que nos enseña).
De todos modos, si obtuviste tus conocimientos técnicos en una tómbola, la educación incrementará tus
posibilidades de recibir una respuesta útil.
La nota no tiene que ser larga ni desarrollada, un sencillo "Pepe - que al final resulta que lo que fallaba
era el cable. Gracias a todos. - Jose Luis" será mejor que nada. De hecho, un resúmen corto y agradable
es mejor que una larga disertación a menos que la solución requiera de cierta profundidad técnica.
Además de ser cortés e informativo, esta especie de seguimiento ayuda a todos los que te asistieron a
sentir una sensación satisfactoria de cercanía al problema. Si tú no eres un hacker, créete que ese
sentimiento es muy importante para los gurús y expertos a quienes pediste ayuda. Los problemas que
acaban sin resolverse resultan frustrantes; los hackers desean verlos resueltos. El buen karma que aliviar
ese picor te hará ganar te resultará de mucha ayuda la próxima vez que necesites plantear una
pregunta.
RTFM tiene un familiar más joven. Si recibes como respuesta "STFW", quien te lo envía piensa que
deberías haber Buscado en La Puta Web. Casi con toda certeza tendrá razón. Ve y busca.
A menudo, quien envía una de estas respuestas está contemplando el manual o la página web en
cuestión mientras escribe. Estas respuestas significan que piensa que (a) la información que necesitas es
fácil de encontrar, y (b) aprenderás más si buscas tú mismo la información que si te la dan a "digerir"
con cuchara.
Esto no debería ofenderte; según el estándar de los hackers, se te está mostrando cierto respeto (aunque
áspero, no lo neguemos) al simplemente no ignorarte. Deberías agradecer la extrema amabilidad.
Si no entiendes...
Si no entiendes la respuesta, no devuelvas inmediatamente la solicitud de una clarificación. Usa las
mismas herramientas que utilizaste para intentar resolver tu pregunta original (manuales, PUFs, la Web,
amigos con mayores destrezas) para entender la respuesta. Si necesitas pedir una clarificación, intenta
demostrar lo que has aprendido.
Por ejemplo, supón que te digo: "Suena como si tuvieses un zentry atascado; necesitarás liberarlo."
Entonces:
He aquí una buena pregunta: "Está bien, he leído la página de manual y los zentrys sólo se mencionan
bajo las variables -z y -p. En ninguna de ellas se menciona nada sobre liberar a los zentrys. ¿Es una de
éstas o me estoy perdiendo algo?"
Cuando esto sucede, lo peor que puedes hacer es lamentarte por la experiencia, denotar que te han
asaltado verbalmente, pedir disculpas, llorar, contener la respiración, amenazar con pleitos, quejarte a
los jefes de la gente, dejar la tapa del baño abierta, etc. En vez de eso, esto es lo que tienes que hacer:
Los estándares de la comunidad no se mantienen por sí mismos: los mantiene la gente que los aplica
activa, visiblemente, en público. No te quejes de que todas las críticas se te deberían haber enviado por
correo privado: así no es como funciona esto. Ni resulta útil insistir en que se te ha insultado
personalmente cuando alguien comenta que alguna de tus peticiones era errónea, o que sus opiniones
diferían. Ésas son actitudes de perdedores.
Ha habido foros de hackers en los que, aparte de un sentido de la hipercortesía mal guiado, se ha
prohibido la entrada a participantes por enviar cualquier mensaje haciendo constar errores en los
mensajes de los demás, y se les ha dicho "No digas nada si no deseas ayudar al usuario". El éxodo de los
participantes más experimentados a otros lugares les ha hecho descender al balbuceo sin el menor
sentido y han perdido toda su utilidad como foros técnicos.
Recuerda: cuando ese hacker te diga que te has equivocado, y (no importa cuán rudamente) te diga que
no vuelvas a hacerlo, su actuación te concierne a (1) ti y a (2) su comunidad. Sería mucho más sencillo
para él ignorarte poniéndote un filtro. Si no eres capaz de ser agradecido ten al menos un poco de
dignidad, no te quejes y no esperes que te traten como una frágil muñeca sólo porque seas un recién
llegado de alma teatralmente hipersensible y con ilusiones de estar autorizado a todo.
Preguntas que no hacer
He aquí algunas preguntas estúpidas que ya se han convertido en clásicas junto con lo que los hackers
están pensando cuando no las responden.
Inteligente: He usado Google para intentar encontrar algo sobre el "Funli Flurbamático 2600" en la Web,
pero no he obtenido resultados satisfactorios. ¿Sabe alguien dónde puedo encontrar información de
programación sobre este dispositivo?
Estúpida: No he conseguido compilar el código del proyecto loquesea. ¿Por qué está roto?
Inteligente: El código del proyecto loquesea no compila bajo Nulix versión 6.2. Me he leído las PUF, pero
no aparece nada de problemas relacionados con Nulix. Os pego aquí una transcripción de mi intento de
compilación; ¿es por algo que hice mal?
Ha especificado el entorno, se ha leído las PUF, ha mostrado el error y no ha asumido que sus problemas
son culpa de otra persona. Quizá este chico se merezca algo de atención.
Inteligente:He intentado X, Y y Z con la placa base S2464. Cuando eso no funcionó, intenté A, B y C.
Fíjense en ese curioso síntoma cuando hice C. Obviamente el florbeador está gromiqueando, pero los
resultados no son los que podrían esperarse. ¿Cuáles son las causas habituales del gromiqueo en las
placas multiprocesador? ¿Sabe alguien de alguna prueba más que pueda llevar a cabo para averiguar el
problema?
Esta persona, por otra parte, parece merecedora de una respuesta. Ha mostrado su inteligencia en un
intento de resolver el problema en vez de esperar que le caiga una respuesta del cielo.
En la última pregunta, fijáos en la sutil pero importante diferencia entre pedir "Dame una respuesta" y
"Por favor, ayúdame a hacerme una idea de qué diagnósticos adicionales puedo llevar a cabo para
alcanzar a ver la luz".
De hecho, la forma de la última pregunta se encuentra basada muy de cerca en un incidente real que
sucedió en Agosto de 2.001 en la lista de correo del núcleo de Linux. Yo (Eric) era el que preguntaba
entonces. Estaba sufriendo misteriosos cuelgues con una placa Tyan S2464. Los miembros de la lista
aportaron la información crítica que necesitaba para resolver el problema.
Al plantear la pregunta de la manera que la hice, le dí a la gente algo con que entretenerse; hice fácil y
atractivo para ellos que se involucraran. Demostré respeto por la capacidad de mis compañeros y les
invité a consultarme también como compañero. También demostré respeto por el valor de su tiempo
haciéndoles saber los callejones sin salida con los que ya me había topado.
Después de todo, cuando les dí a todos las gracias y remarqué lo bien que había funcionado el proceso,
un miembro de la lista de correo del núcleo de Linux hizo la observación de que creía que había sido así
no porque yo tuvera un "nombre" en esa lista, sino porque hice la pregunta de la manera adecuada.
Nosotros los hackers somos de alguna manera una ruda meritocracia; estoy seguro de que tenía razón, y
de que si me hubiese comportado como una esponja se me habrían echado todos encima o me habrían
ignorado sin importar quien fuese. Su sugerencia de que había escrito el completo incidente como una
instrucción para otros condujo directamente a la composición de esta guía.
Por esto, si no obtienes respuesta, no te tomes como algo personal que no sintamos que podamos
ayudarte. Hay otros recursos a menudo mejor adaptados a las necesidades de un principiante.
Hay muchos grupos de usuarios en línea y locales compuestos por entusiastas del software incluso
aunque nunca hayan escrito software alguno ellos mismos. Estos grupos se forman de manera que la
gente pueda ayudarse entre sí y ayudar a los nuevos usuarios.
Hay además muchas compañías comerciales a las que puedes contratar para que te presten su ayuda,
tanto grande como pequeña. ¡Que no te aterre la idea de tener que pagar por un poco de ayuda!
Después de todo, si al motor de tu coche se le rompe una junta seguramente tendrás que llevarlo al
mecánico y pagar para que te lo arreglen. Incluso aunque el software no te costase nada, no puedes
esperar que el soporte sea siempre gratuito.
Para el software popular como Linux, hay al menos unos 10.000 usuarios por cada desarrollador. Resulta
imposible que una sola persona pueda atender llamadas de soporte técnico de cerca de 10.000 usuarios.
Recuerda que aunque tengas que pagar por el soporte, estás aún pagando mucho menos que si tuvieses
que comprar el software (y el soporte para el software de código cerrado es por lo general mucho más
caro y menos competente que el soporte para el software de código abierto).”
Also, as we said earlier, being able to solve problems is often one of the core skills of a data
scientist. Data science is new; you may be the first person to come across a specific problem and you
need to be equipped with skills that allow you to tackle problems that are both new to you and to the
community!
Finally, troubleshooting and figuring out solutions to problems is a great, transferable skill! It will
serve you well as a data scientist, but so much of what any job often entails is problem solving.
Being able to think about problems and get help effectively is of benefit to you in whatever career
path you find yourself in!
One of your first stops for data analysis problems should be reading the manuals or help files (for R
problems, try typing ?command) – if you post a question on a forum that is easily answered by the
manual, you will often get a reply of “Read the manual” … which is not the easiest way to get at the
answer you were going for!
Next steps are searching on Google and searching relevant forums. Common forums for data science
problems include StackOverflow and CrossValidated. Additionally, for you in this class, there is
a course forum that is a great resource and super helpful! Before posting a question to any forum,
try and double check that it hasn’t been asked before, using the forums’ search functions.
While you are Googling, things to pay attention to and look for are: tutorials, FAQs, or vignettes of
whatever command or program is giving you trouble. These are great resources to get you started –
either in telling you the language/words to use in your next searches, or outright showing you how to
do something.
First steps for solving coding problems
As you get further into this course and using R, you may run into coding problems and errors and
there are a few strategies you should have ready to deal with these. In my experience, coding
problems generally fall into two categories: your command produces no data and spits out an error
message OR your command produces an output, but it is not at all what you wanted. These two
problems have different strategies for dealing with them.
I’ve been there – you type out a command and all you get are lines and lines of angry red text telling
you that you did something wrong. And this can be overwhelming. But taking a second to check
over your command for typos and then carefully reading the error message solves the problem in
nearly all of the cases. The error messages are there to help you – it is the computer telling you what
went wrong. And when all else fails, you can be pretty assured that somebody out there got the same
error message, panicked and posted to a forum – the answer is out there.
On the other hand, if you get an output, but it isn’t what you expected:
Consider how the output was different from what you expected
Think about what it looks like the command actually did, why it would do that, and not
what you wanted
Most problems like this are because the command you provided told the program to do one thing and
it did that thing exactly… it just turns out what you told it to do wasn’t actually what you wanted!
These problems are often the most frustrating – you are so close but so far! The quickest way to
figuring out what went wrong is looking at the output you did get, comparing it to what you wanted,
and thinking about how the program may have produced that output instead of what you wanted.
These sorts of problems give you plenty of practice thinking like a computer program!
Next steps
Alright, you’ve done everything you are supposed to do to solve the problem on your own – you
need to bring in the big guns now: other people!
Easiest is to find a peer with some experience with what you are working on and ask them for
help/direction. This is often great because the person explaining gets to solidify their understanding
while teaching it to you, and you get a hands on experience seeing how they would solve the
problem. In this class, your peers can be your classmates and you can interact with them through the
course forum (double check your question hasn’t been asked already!).
But, outside of this course, you may not have too many data science savvy peers – what then?
“Rubber duck debugging” is a long held tradition of solitary programmers everywhere. In the
book “The Pragmatic Programmer,” there is a story of how stumped programmers would explain
their problem to a rubber duck, and in the process of explaining the problem, identify the solution.
So next time you are stumped, bring out the bath toys!
Before you go ahead and just post your question, you need to consider how you can best ask your
question to garner (helpful) answers.
Bad:
These titles don’t give your potential helpers a lot to go off of – they don’t really know what the
problem is and if they are able to help you. Instead, you need to provide some details about what you
are having problems with. Answering what you were doing and what the problem is are two key
pieces of information that you need to provide. This way somebody who is on the forum will know
exactly what is happening and that they might be able to help!
Better:
R 3.4.3 lm() function produces seg fault with large data frame (Windows 10)
Applied PCA to a matrix - what are U, D, and Vt?
Even better:
Use titles that focus on the very specific core problem that you are trying to get help with. It signals
to people that you are looking for a very specific answer; the more specific the question, often, the
faster the answer.
Forum etiquette
Following a lot of the tips above will serve you well in posting on forums and observing forum
etiquette. You are asking for help, you are hoping somebody else will take time out of their day to
help you – you need to be courteous. Often this takes the form of asking specific questions, doing
some troubleshooting of your own, and giving potential problem solvers easy access to all the
information they need to help you. Formalizing some of these do’s and don’t’s, you get the
following lists:
Do’s
Let’s take a few seconds to talk a bit about this last point, as we have touched on the others already.
First, what do we mean by “follow up on the post”? You’ve asked your question and you’ve
received several answers and lo and behold one of them works! You are all set, get back to work!
No! Go back to your posting, reply to the solution that worked for you, explaining that they fixed
your problem and thanking them for their solution! Not only do the people helping you deserve
thanks, but this is helpful to anybody else who has the same problem as you, later on. They are going
to do their due diligence, search the forum and find your post – it is so helpful for you to have
flagged the answer that solved your problem.
Conversely, while you are waiting for a reply, perhaps you stumble upon the solution (go you!) –
don’t just close the posting or never check back on it. One, people who are trying to help you may be
replying and you are functionally ignoring them, or two, if you close it with no solution, somebody
with the same problem won’t ever learn what your solution was! Make sure to post the solution and
thank everybody for their help!
Don’t’s:
Additionally, for people who are active on multiple forums, it is always aggravating when the same
person posts the same question on five different forums…. Or when the same question is posted on
the same forum repeatedly. Be patient – pick the most relevant forum for your purposes, post once,
and wait.
Summary
In this lesson, we look at how to effectively get help when you run into a problem. This is important
for this course, but also for your future as a data scientist!
We first looked at strategies to use before asking for help, including reading the manual, checking
the help files, and searching Google and appropriate forums. We also covered some common coding
problems you may face and some preliminary steps you can take on your own, including paying
special attention to error messages and examining how your code behaved compared to your goal.
Once you’ve exhausted these options, we turn to other people for help. We can ask peers for help or
explain our problems to our trusty rubber ducks (be it an actual rubber duck or an unsuspecting
coworker!). Our course forum is also a great resource for you all to talk with many of your peers! Go
introduce yourself!
And if all else fails, we can post on forums (be it in this class or at another forum, like
StackOverflow), with very specific, reproducible questions. Before doing so, be sure to brush up on
your forum etiquette - it never hurt anybody to be polite! Be a good citizen of our forums!
There is an art to problem solving, and the only way to get practice is to get out there and start
solving problems! Get to it!
The Question
When setting out on a data science project, it’s always great to have your question well-defined.
Additional questions may pop up as you do the analysis, but knowing what you want to answer with
your analysis is a really important first step. Hilary Parker’s question is included in bold in her post.
Highlighting this makes it clear that she’s interested in answer the following question:
Is Hilary/Hillary really the most rapidly poisoned name in recorded American history?
The Data
To answer this question, Hilary collected data from the Social Security website. This dataset
included the 1,000 most popular baby names from 1880 until 2011.
Data Analysis
As explained in the blog post, Hilary was interested in calculating the relative risk for each of the
4,110 different names in her dataset from one year to the next from 1880 to 2011. By hand, this
would be a nightmare. Thankfully, by writing code in R, all of which is available on GitHub, Hilary
was able to generate these values for all these names across all these years. It’s not important at this
point in time to fully understand what a relative risk calculation is (although Hilary does a great job
breaking it down in her post!), but it is important to know that after getting the data together, the
next step is figuring out what you need to do with that data in order to answer your question. For
Hilary’s question, calculating the relative risk for each name from one year to the next from 1880 to
2011 and looking at the percentage of babies named each name in a particular year would be what
she needed to do to answer her question.
In looking at the results of this analysis, the first five years appeared peculiar to Hilary Parker. (It’s
always good to consider whether or not the results were what you were expecting, from any
analysis!) None of them seemed to be names that were popular for long periods of time. To see if
this hunch was true, Hilary plotted the percent of babies born each year with each of the names from
this table. What she found was that, among these “poisoned” names (names that experienced a big
drop from one year to the next in popularity), all of the names other than Hilary became popular all
of a sudden and then dropped off in popularity. Hilary Parker was able to figure out why most of
these other names became popular, so definitely read that section of her post! The name, Hilary,
however, was different. It was popular for a while and then completely dropped off in popularity.
To figure out what was specifically going on with the name Hilary, she removed names that became
popular for short periods of time before dropping off, and only looked at names that were in the top
1,000 for more than 20 years. The results from this analysis definitively show that Hilary had the
quickest fall from popularity in 1992 of any female baby name between 1880 and 2011. (“Marian”’s
decline was gradual over many years.)
Figure 12 - 39 most poisoned names over time, controlling for fads
Communication
For the final step in this data analysis process, once Hilary Parker had answered her question, it was
time to share it with the world. An important part of any data science project is effectively
communicating the results of the project. Hilary did so by writing a wonderful blog post that
communicated the results of her analysis, answered the question she set out to answer, and did so in
an entertaining way.
Additionally, it’s important to note that most projects build off someone else’s work.
It’s really important to give those people credit. Hilary accomplishes this by:
Text analysis of Trump’s tweets confirms he writes only the (angrier) Android half ,
by David Robinson
Where to Live in the US, by Maelle Salmon
Sexual Health Clinics in Toronto, by Sharla Gelfand
Summary
In this lesson, we hope we’ve conveyed that sometimes data science projects are tackling difficult
questions (‘Can we predict the risk of opioid overdose?’) while other times the goal of the project is
to answer a question you’re interested in personally (‘Is Hilary the most rapidly poisoned baby name
in recorded American history?’). In either case, the process is similar. You have to form your
question, get data, explore and analyse your data, and communicate your results. With the tools
you’ll learn in this series of courses, you will be able to set out and carry out your own data science
projects, like the examples included in this lesson!
R is downloaded from the Comprehensive R Archive Network , or CRAN, and while this might be
your first brush with it, we will be returning to CRAN time and time again, when we install packages
- so keep an eye out!
The reasons for using R are myriad, but some big ones are:
1) Its popularity
R is quickly becoming the standard language for statistical analysis. This makes R a great language
to learn as the more popular a software is, the quicker new functionality is developed, the more
powerful it becomes, and the better the support there is! Additionally, as you can see in the graph
below, knowing R is one of the top five languages asked for in data scientist job postings!
2) Its cost
FREE!
This one is pretty self-explanatory - every aspect of R is free to use, unlike some other stats packages
you may have heard of (eg: SAS, SPSS), so there is no cost barrier to using R!
R is a very versatile language - we’ve talked about its use in stats and in graphing, but its use can be
expanded to many different functions - from making websites, making maps using GIS data,
analyzing language… and even making these lectures and videos! For whatever task you have in
mind, there is often a package available for download that does exactly that!
4) Its community
And the reason that the functionality of R is so extensive is the community that has been built around
R. Individuals have come together to make “packages” that add to the functionality of R - and more
are being developed every day!
Particularly for people just getting started out with R, its community is a huge benefit - due to its
popularity, there are multiple forums that have pages and pages dedicated to solving R problems. We
talked about this in the Getting Help lesson; these forums are great both for finding other people who
have had the same problem as you, and posting your own new problems.
Installing R
Now that we’ve spent some time looking at the benefits of R, it is time to install it! We’ll go over
installation for both Windows and Mac below, but know that these are general guidelines and small
details are likely to change subsequent to the making of this lecture - use this as a scaffold.
For both Windows and Mac machines, we start at the CRAN homepage: https://cran.r-project.org/
Open the executable, and if prompted by a security warning, allow it to run. Select the language you
prefer during installation and agree to the licensing information. You will next be prompted for a
destination location - this will likely be defaulted to Program Files, in a subfolder called R, followed
by another directory of the version number. Unless you have any issues with this, the default
location is perfect.
Figure 16 - The install wizard for installing R
You will then be prompted to select which components should be installed. Unless you are running
short on memory, installing all of the components is desireable. Next you’ll be asked about startup
options, and again, the defaults are fine for this. You will then be asked where Setup should place
shortcuts - this is completely up to you, you can allow it to add the program to the start menu, or you
can click the box at the bottom that says to not create a start menu link. Finally, you will be asked
whether you want a desktop or Quick Launch icon - up to you! I do not recommend changing the
defaults for the registry entries though.
After this window, the installation should begin. Test that the installation worked by opening R for
the first time!
Click on the link to the most recent version of R, which will download a .pkg file.
Figure 18 - Downloading the .pkg file for Macs
Open the pkg file and follow the prompts as provided by the installer. First, click continue on the
welcome page and again on the important information window page. Next you will be presented
with the software license agreement, again, continue. Next you may be asked to select a destination
for R, either available to all users or to a specific disk - select whichever you feel is best suited to
your setup. Finally, you will be at the “Standard Install” page; R selects a default directory and if
you are happy with that location, go ahead and click install. At this point, you may be prompted to
type in the admin password, do so, and the install will begin!
Once the installation is finished, go to your Applications and find R. Test that the installation worked
by opening R for the first time!
Summary
In this lesson we first looked at what R is and why we might want to use it. We then focused on the
installation process for R on both Windows and Mac computers. Before moving on to the next
lecture, be sure that you have R installed properly.
Installing RStudio
We’ve installed R and can open the R interface to input code, but there are other ways to interface
with R - and one of those ways is using RStudio. In this lesson, we’ll get RStudio installed on your
computer.
What is RStudio?
RStudio is a graphical user interface for R, that allows you to write, edit and store code, generate,
view and store plots, manage files, objects and dataframes, and integrate with version control
systems – to name a few of its functions. We will be exploring exactly what RStudio can do for you
in future lessons, but for anybody just starting out with R coding, the visual nature of this program as
an interface for R is a huge benefit.
Installing RStudio
Thankfully, installation of RStudio is fairly straightforward. First, you go to the RStudio download
page. We want to download the RStudio Desktop version of the software, so click on the
appropriate “Download” , under that heading and you will see a list of “Installers for
supported platforms”.
Figure 21 - The various versions of RStudio available for different operating systems
At this point the installation process diverges for Macs and Windows, so follow the instructions for
the appropriate OS.
Following this, the installation wizard will open. Following the defaults on each of the windows of
the wizard is appropriate for installation. In brief, on the welcome screen, click next. If you want
RStudio installed elsewhere, “Browse” through your file system. Otherwise, it will likely default to
the “Program Files” folder - this is appropriate. Click next. On this final page, allow RStudio to
create a Start menu shortcut. Click Install. RStudio is now being installed. Wait for this process to
finish; RStudio is now installed on your computer. Click Finish.
Check that RStudio is working appropriately by opening it from your Start menu.
Figure 26 - Drag the RStudio file into your Applications folder to complete installation for RStudio
Figure 27 - RStudio is running!
Summary
In this lesson we installed RStudio, both for Macs and for Windows computers. Before moving on to
the next lecture, click through the available menus and explore the software a bit. We will have an
entire lesson dedicated to exploring RStudio, but having some familiarity beforehand will be helpful!
RStudio Tour
Now that we have RStudio installed, we should familiarize ourselves with the various components
and functionality of it! RStudio provides a cheatsheet of the RStudio environment - warning: this
link initiates a download of a PDF from the RStudio GitHub.
You may be missing the upper left quadrant and instead have the left side of the screen with just one
region, “Console” - if this is the case, go to File > New File > R Script and now it should more
closely resemble the image. You can change the sizes of each of the various quadrants by hovering
your mouse over the spaces between quadrants and click-dragging the divider to resize the sections.
We will go through each of the regions and describe some of their main functions. It would be
impossible to cover everything that RStudio can do, so we urge you to explore RStudio on your own
too!
Figure 29 - The four main quadrants of RStudio, plus the main menu bar
To start, let’s explore the main sections of the menu bar that you will use. The first being the File
menu. Here we can open new or saved files, open new or saved projects (we’ll have an entire lesson
in the future about R Projects, so stay tuned!), save our current document or close RStudio. If you
mouse over “New File”, a new menu will appear that suggests the various file formats available to
you. R Script and R Markdown files are the most common file types for use, but you can also
generate R notebooks, web apps, websites, or slide presentations. If you click on any one of these, a
new tab in the “Source” quadrant will open. We’ll spend more time in a future lesson on R
Markdown files and their use.
Figure 31 - The File menu
The Session menu has some R specific functions, in which you can restart, interrupt or terminate R -
these can be helpful if R isn’t behaving or is stuck and you want to stop what it is doing and start
from scratch.
The Tools menu is a treasure trove of functions for you to explore. For now, you should know that
this is where you can go to install new packages (see next lecture), set up your version control
software (see future lesson: Linking GitHub and RStudio), and set your options and preferences for
how RStudio looks and functions. For now, we will leave this alone, but be sure to explore these
menus on your own once you have a bit more experience with RStudio and see what you can change
to best suit your preferences!
To execute your first command, try typing 1 + 1 then enter at the > prompt. You should see the
output [1] 2 below your command.
Now copy and paste the following into your console and hit enter.
This creates a matrix with four rows and two columns, with the numbers 1 through 8.
The environment/history
To view this matrix, first look to the Environment quadrant, where you should see the following:
Click anywhere on the “example” line, and a new tab on the Source quadrant should appear,
showing the matrix you created. Any dataframe or matrix that you create in R can be viewed this
way in RStudio.
Figure 37 - Your newly made matrix, opened in a new tab of the source panel
RStudio also tells you some information about the object in the environment, like whether it is a list
or a dataframe or if it contains numbers, integers or characters. This is very helpful information to
have as some functions only work with certain classes of data. And knowing what kind of data you
have is the first step to that.
The quadrant has two other tabs running across the top of it. We’ll just look at the History tab now.
Your history tab should look something like this:
Here you will see the commands that we have run in this session of R. If you click on any one of
them, you can click “To Console” or “To Source” and this will either rerun the command in the
console, or will move the command to the source, respectively. Do so now for your example matrix
and send it to Source.
Files/help/plots/packages panel
The final region we’ll look at occupies the bottom right of the RStudio window. In this quadrant,
five tabs run across the top: Files, Plots, Packages, Help, and Viewer.
In Files, you can see all of the files in your current working directory. If this isn’t where you want to
save or retrieve files from, you can also change the current working directory in this tab using the
ellipsis at the far right, finding the desired folder, and then under the “More” cogwheel, setting this
new folder as the working directory.
In the Plots tab, if you generate a plot with your code, it will appear here. You can use the arrows to
navigate to previously generated plots. The Zoom function will open the plot in a new window, that
is much larger than the quadrant. Export is how you save the plot. You can either save it as an image
or as a PDF. The broom icon clears all plots from memory.
Figure 42 - The plots tab
The Packages tab will be explored more in depth in the next lesson on R packages. Here you can see
all the packages you have installed, load and unload these packages, and update them.
The Help tab is where you find the documentation for your R packages and various functions. In the
upper right of this panel there is a search function for when you have a specific function or package
in question.
Summary
In this lesson we took a tour of the RStudio software. We became familiar with the main menu and
its various menus. We looked at the Console, where R code is input and run. We then moved on to
the Environment panel that lists all of the objects that have been created within an R session and
allows you to view these objects in a new tab in Source. In this same quadrant, there is a History tab,
that keeps a record of all commands that have been run. It also presents the option to either rerun the
command in the Console, or send the command to Source, to be saved. Source is where you save
your R commands. And the bottom right quadrant contains a listing of all the files in your working
directory, displays generated plots, lists your installed packages, and supplies help files for when you
need some assistance! Take some time to explore RStudio on your own!
R packages
Now that we’ve installed R and RStudio and have a basic understanding of how they work together,
we can get at what makes R so special: packages.
What is an R package?
So far, anything we’ve played around with in R uses the “base” R system. Base R, or everything
included in R when you download it, has rather basic functionality for statistics and plotting but it
can sometimes be limiting. To expand upon R’s basic functionality, people have
developed packages. A package is a collection of functions, data, and code conveniently provided in
a nice, complete format for you. At the time of writing, there are just over 14,300 packages available
to download - each with their own specialized functions and code, all for some different purpose. For
a really in depth look at R Packages (what they are, how to develop them), check out Hadley
Wickham’s book from O’Reilly, “R Packages.”
Side note: A package is not to be confused with a library (these two terms are often conflated in
colloquial speech about R). A library is the place where the package is located on your computer. To
think of an analogy, a library is, well, a library… and a package is a book within the library. The library is
where the books/packages are located.
Packages are what make R so unique. Not only does base R have some great functionality but these
packages greatly expand its functionality. And perhaps most special of all, each package is
developed and published by the R community at large and deposited in repositories.
Take a second to explore the links above and check out the various packages that are out there!
Figure 45 - The big three repositories for R packages
First, CRAN groups all of its packages by their functionality/topic into 35 “themes.” It calls this
its “Task view.” This at least allows you to narrow the packages you can look through to a topic
relevant to your interests.
More often, if you have a specific task in mind, Googling that task followed by “R package” is a
great place to start! From there, looking at tutorials, vignettes, and forums for people already doing
what you want to do is a great way to find relevant packages.
If you are installing from the CRAN repository, use the install.packages() function, with the name of
the package you want to install in quotes between the parentheses (note: you can use either single or
double quotes). For example, if you want to install the package “ggplot2”, you would
use: install.packages("ggplot2")
Try doing so in your R console! This command downloads the “ggplot2” package from CRAN and
installs it onto your computer.
If you want to install multiple packages at once, you can do so by using a character vector,
like: install.packages(c("ggplot2", "devtools", "lme4"))
If you want to use RStudio’s graphical interface to install packages, go to the Tools menu, and the
first option should be “Install packages…” If installing from CRAN, select it as the repository and
type the desired packages in the appropriate box.
This makes the main install function of BioConductor, biocLite(), available to you. Following this,
you call the package you want to install in quotes, between the parentheses of
the biocLite command, like so: biocLite("GenomicFeatures")
Figure 49 - Installing packages with BioConductor
1. install.packages("devtools") - only run this if you don’t already have devtools installed. If
you’ve been following along with this lesson, you may have installed it when we were
practicing installations using the R console
Loading packages
Installing a package does not make its functions immediately available to you. First you
must load the package into R; to do so, use the library() function. Think of this like any other
software you install on your computer. Just because you’ve installed a program, doesn’t mean it’s
automatically running - you have to open the program. Same with R. You’ve installed it, but now
you have to “open” it. For example, to “open” the “ggplot2” package, you would
run:library(ggplot2)
NOTE: Do not put the package name in quotes! Unlike when you are installing the packages,
the library() command does not accept package names in quotes!
Figure 51 - Step one of getting a package is installing it, but to use it, you must load it using library(); similar
to installing R and then loading it by opening the .exe file
There is an order to loading packages - some packages require other packages to be loaded first
(dependencies). That package’s manual/help pages will help you out in finding that order, if they
are picky.
If you want to load a package using the RStudio interface, in the lower right quadrant there is a tab
called “Packages” that lists out all of the packages and a brief description, as well as the version
number, of all of the packages you have installed. To load a package just click on the checkbox
beside the package name
In RStudio, that package tab introduced earlier is another way to look at all of the packages you have
installed.
Updating packages
You can check what packages need an update with a call to the function old.packages() This will
identify all packages that have been updated since you installed them/last updated them.
To update all packages, use update.packages(). If you only want to update a specific package, just
use once again install.packages("packagename")
Figure 53 - Functions used to see what packages are installed and update them
Within the RStudio interface, still in that Packages tab, you can click “Update,” which will list all of
the packages that are not up to date. It gives you the option to update all of your packages, or allows
you to select specific packages.
You will want to periodically check in on your packages and check if you’ve fallen out of date - be
careful though! Sometimes an update can change the functionality of certain functions, so if you re-
run some old code, the command may be changed or perhaps even outright gone and you will need
to update your code too!
Unloading packages
Sometimes you want to unload a package in the middle of a script - the package you have loaded
may not play nicely with another package you want to use.
Uninstalling packages
If you no longer want to have a package installed, you can simply uninstall it using the
function remove.packages(). For example, remove.packages("ggplot2")
(Try that, but then actually re-install the ggplot2 package - it’s a super useful plotting package!)
Within RStudio, in the Packages tab, clicking on the “X” at the end of a package’s row will uninstall
that package.
Sometimes, when you are looking at a package that you might want to install, you will see that it requires
a certain version of R to run. To know if you can use that package, you need to know what version of R
you are running!
One way to know your R version is to check when you first open R/RStudio - the first thing it outputs in
the console tells you what version of R is currently running. If you didn’t pay attention at the beginning,
you can type version into the console and it will output information on the R version you are running.
Another helpful command is sessionInfo() - it will tell you what version of R you are running along
with a listing of all of the packages you have loaded. The output of this command is a great detail to
include when posting a question to forums - it tells potential helpers a lot of information about your OS,
R, and the packages (plus their version numbers!) that you are using.
Figure 57 - Ways to see what version of R you are running
First, you need to know what functions are included within a package. To do this, you can look at the
man/help pages included in all (well-made) packages. In the console, you can use the help() function
to access a package’s help files. Try help(package = "ggplot2") and you will see all of
the many functions that ggplot2 provides. Within the RStudio interface, you can access the help files
through the Packages tab (again) - clicking on any package name should open up the associated help
files in the “Help” tab, found in that same quadrant, beside the Packages tab. Clicking on any one of
these help pages will take you to that functions help page, that tells you what that function is for and
how to use it.
Once you know what function within a package you want to use, you simply call it in the console
like any other function we’ve been using throughout this lesson. Once a package has been loaded, it
is as if it were a part of the base R functionality.
If you still have questions about what functions within a package are right for you or how to use
them, many packages include “vignettes.” These are extended help files, that include an overview of
the package and its functions, but often they go the extra mile and include detailed examples of how
to use the functions in plain words that you can follow along with to see how to use the package. To
see the vignettes included in a package, you can use the browseVignettes() function. For example,
let’s look at the vignettes included in ggplot2:browseVignettes("ggplot2") . You should see that
there are two included vignettes: “Extending ggplot2” and “Aesthetic specifications.” Exploring the
Aesthetic specifications vignette is a great example of how vignettes can be helpful, clear
instructions on how to use the included functions.
Summary
In this lesson, we’ve explored R packages in depth. We examined what a packages is (and how it
differs from a library), what repositories are, and how to find a package relevant to your interests.
We investigated all aspects of how packages work: how to install them (from the various
repositories), how to load them, how to check which packages are installed, and how to update,
uninstall, and unload packages. We took a small detour and looked at how to check what version of
R you have, which is often an important detail to know when installing packages. And finally, we
spent some time learning how to explore help files and vignettes, which often give you a good idea
of how to use a package and all of its functions.
If you still want to learn more about R packages, here are two great resources! R Packages: A
Beginner’s Guide from Adolfo Álvarez on DataCamp and a lesson from the University of
Washington, on an Introduction to R Packages from Ken Rice and Timothy Thornton.
R Projects
One of the ways people organize their work in R is through the use of R Projects, a built-in
functionality of RStudio that helps to keep all your related files together. RStudio provides a great
guide on how to use Projects so definitely check that out!
What is an R Project?
When you make a Project, it creates a folder where all files will be kept, which is helpful for
organizing yourself and keeping multiple projects separate from each other. When you re-open a
project, RStudio remembers what files were open and will restore the work environment as if you
had never left - which is very helpful when you are starting back up on a project after some time off!
Functionally, creating a Project in R will create a new folder and assign that as the working directory
so that all files generated will be assigned to the same directory.
Also, since everything related to one project is all in the same place, it is much easier to share your
work with others - either by directly sharing the folder/files, or by associating it with version control
software. We’ll talk more about linking Projects in R with version control systems in a future lesson
entirely dedicated to the topic!
Finally, since RStudio remembers what documents you had open when you closed the session, it is
easier to pick a project up after a break - everything is set-up just as you left it!
Creating a Project
There are three ways to make a Project:
1) From scratch - this will create a new directory for all your files to go in.
2) From an existing folder - this will link an existing directory with RStudio
3) From version control - this will “clone” an existing project onto your computer (Don’t
worry too much about this one, you’ll get more familiar with it in the next few lessons)
Let’s create a Project from scratch, which is often what you will be doing!
Open RStudio, and under File, select “New Project”. You can also create a new Project by using the
Projects toolbar and selecting “New Project” in the drop-down menu, or there is a “New Project”
shortcut in the toolbar.
Since we are starting from scratch, select “New Project” and a window will appear. Select “New
Directory” and when prompted about the Project type, select “New Project”
Pick a name for your project and for this time, save it to your Desktop. This will create a folder on
your Desktop where all of the files associated with this Project will be kept. Click “Create Project.”
Figure 62 - Creating a new project
1) In the “Files” quadrant of the screen, you can see that RStudio has made this new directory
your working directory and generated a single file with the extension “.Rproj”
2) In the upper-right of the window, there is a Projects toolbar that states the name of your
current Project and has a drop down menu with a few different options that we’ll talk about
in a second.
Figure 64 - Note the new project file in the Files quadrant and the Project toolbar
Opening a project
Opening an existing Project is as simple as double clicking the .Rproj file on your computer. You
can accomplish the same from within RStudio by opening RStudio and going to File > Open Project.
You can also use the Project toolbar and open the drop down menu and select “Open Project…”
Figure 65 - Ways to open a project
All of these options will quit a Project and doing so will cause RStudio to write which documents are
currently open (so they can be restored when you start back up again) and it then closes the R
session. When you set up your Project, you can tell it to save environment (so, for example, all of
your variables and data tables will be preloaded when you reopen the project), but this is not the
default behavior.
The Projects toolbar is also an easy way to switch between Projects - click on the drop-down menu
and choose “Open Project” and find your new Project you want to open - this will save the current
Project, close it, and then open the new Project within the same window. If you want multiple
Projects open at the same time, do the same but instead select “Open Project in New Session”. This
can also be accomplished through the File menu, where those same options are available.
Figure 67 - Ways to switch between projects
Best practices
When you are setting up a project, it can be helpful to start out creating a few directories. Try a few
strategies and see what works best for you, but most file structures are set-up around having a
directory containing the raw data, a directory that you keep scripts/R files in, and a directory for the
output of your code.
For example:
If you set up these folders before you start, it can save you organizational headaches later on in a
project when you can’t quite remember where something is!
Summary
In this lesson we’ve covered what Projects in R are, why you might want to use them, how to open,
close, or switch between projects, and some best practices to best set you up for organizing yourself!
If you’ve ever worked collaboratively on a document before, this comic from PHD Comics might
resonate with you.
Which brings us to the next major benefit of version control: It keeps a record of all changes made to
the files. This can be of great help when you are collaborating with many people on the same files -
the version control software keeps track of who, when, and why those specific changes were made.
It’s like “Track changes” to the extreme!
Figure 69 - An example of the version control history for the development of this course!
This record is also helpful when developing code, if you realize after some time that you made a
mistake and introduced an error. You can find the last time you edited that particular bit of code, see
the changes you made, and revert back to that original, unbroken code, leaving everything else
you’ve done in the meanwhile untouched!
Finally, when working with a group of people on the same set of files, version control is helpful for
ensuring that you aren’t making changes to files that conflict with other changes. If you’ve ever
shared a document with another person for editing, you know the frustration of integrating their edits
with a document that has changed since you sent the original file - now you have two versions of
that same original document. Version control allows multiple people to work on the same file and
then helps merge all of the versions of the file and all of their edits into one cohesive file.
Figure 70 - Results of a StackOverflow survey asking which version control software their respondents use
And as you become more familiar with Git and how it works and interfaces with your projects,
you’ll begin to see why is has risen to the height of popularity. One of the main benefits of Git is that
it keeps a local copy of your work and revisions, which you can then edit offline, and then once you
return to internet service, you can sync your copy of the work, with all of your new edits and tracked
changes to the main repository online. Additionally, since all collaborators on a project have their
own local copy of the code, everybody can simultaneously work on their own parts of the code,
without disturbing that common repository.
Another big benefit that we’ll definitely be taking advantage of is the ease with which RStudio and
Git interface with each other. In the next lesson, we’ll work on getting Git installed and linked with
RStudio and making a GitHub account.
What is GitHub?
GitHub is an online interface for Git. Git is software used locally on your computer to record
changes. GitHub is a host for your files and the records of the changes made. You can sort of think
of it as being similar to DropBox - the files are on your computer, but they are also hosted online and
are accessible from any computer. GitHub has the added benefit of interfacing with Git to keep track
of all of your file versions and changes.
Repository: Equivalent to the project’s folder/directory - all of your version controlled files (and the
recorded changes) are located in a repository. This is often shortened to repo. Repositories are what
are hosted on GitHub and through this interface you can either keep your repositories private and
share them with select collaborators, or you can make them public - anybody can see your files and
their history.
Commit: To commit is to save your edits and the changes made. A commit is like a snapshot of
your files: Git compares the previous version of all of your files in the repo to the current version
and identifies those that have changed since then. Those that have not changed, it maintains that
previously stored file, untouched. Those that have changed, it compares the files, logs the changes
and uploads the new version of your file. We’ll touch on this in the next section, but when you
commit a file, typically you accompany that file change with a little note about what you changed
and why.
When we talk about version control systems, commits are at the heart of them. If you find a mistake,
you revert your files to a previous commit. If you want to see what has changed in a file over time,
you compare the commits and look at the messages to see why and who.
Push: Updating the repository with your edits. Since Git involves making changes locally, you need
to be able to share your changes with the common, online repository. Pushing is sending those
committed changes to that repository, so now everybody has access to your edits.
Pull: Updating your local version of the repository to the current version, since others may have
edited in the meanwhile. Because the shared repository is hosted online and any of your
collaborators (or even yourself on a different computer!) could have made changes to the files and
then pushed them to the shared repository, you are behind the times! The files you have locally
on your computer may be outdated, so you pull to check if you are up to date with the main
repository.
Figure 71 - Analogies to these concepts
Staging: The act of preparing a file for a commit. For example, if since your last commit you have
edited three files for completely different reasons, you don’t want to commit all of the changes in
one go; your message on why you are making the commit and what has changed will be complicated
since three files have been changed for different reasons. So instead, you can stage just one of the
files and prepare it for committing. Once you’ve committed that file, you can stage the second file
and commit it. And so on. Staging allows you to separate out file changes into separate commits.
Very helpful!
To summarize these commonly used terms so far and to test whether you’ve got the hang of this,
files are hosted in a repository that is shared online with collaborators. You pull the repository’s
contents so that you have a local copy of the files that you can edit. Once you are happy with your
changes to a file, you stage the file and then commit it. You push this commit to the shared
repository. This uploads your new file and all of the changes and is accompanied by a message
explaining what changed, why and by whom.
Branch: When the same file has two simultaneous copies. When you are working locally and
editing a file, you have created a branch where your edits are not shared with the main repository
(yet) - so there are two versions of the file: the version that everybody has access to on the repository
and your local edited version of the file. Until you push your changes and merge them back into the
main repository, you are working on a branch. Following a branch point, the version history splits
into two and tracks the independent changes made to both the original file in the repository that
others may be editing, and tracking your changes on your branch, and then merges the files together.
Merge: Independent edits of the same file are incorporated into a single, unified file. Independent
edits are identified by Git and are brought together into a single file, with both sets of edits
incorporated. But, you can see a potential problem here - if both people made an edit to the same
sentence that precludes one of the edits from being possible, we have a problem! Git recognizes this
disparity (conflict) and asks for user assistance in picking which edit to keep.
Conflict: When multiple people make changes to the same file and Git is unable to merge the edits.
You are presented with the option to manually try and merge the edits or to keep one edit over the
other.
Figure 72 - **A visual representation of these concepts, from
https://www.atlassian.com/git/tutorials/using-branches/git-merge
Clone: Making a copy of an existing Git repository. If you have just been brought on to a project
that has been tracked with version control, you would clone the repository to get access to and create
a local version of all of the repository’s files and all of the tracked changes.
Fork: A personal copy of a repository that you have taken from another person. If somebody is
working on a cool project and you want to play around with it, you can fork their repository and then
when you make changes, the edits are logged on your repository, not theirs.
Best practices
It can take some time to get used to working with version control software like Git, but there are a
few things to keep in mind to help establish good habits that will help you out in the future.
One of those things is to make purposeful commits. Each commit should only address a single issue.
This way if you need to identify when you changed a certain line of code, there is only one place to
look to identify the change and you can easily see how to revert the code.
Similarly, making sure you write informative messages on each commit is a helpful habit to get into.
If each message is precise in what was being changed, anybody can examine the committed file and
identify the purpose for your change. Additionally, if you are looking for a specific edit you made in
the past, you can easily scan through all of your commits to identify those changes related to the
desired edit.
Finally, be cognizant of the version of files you are working on. Frequently check that you are up to
date with the current repo by frequently pulling. Additionally, don’t horde your edited files - once
you have committed your files (and written that helpful message!), you should push those changes to
the common repository. If you are done editing a section of code and are planning on moving on to
an unrelated problem, you need to share that edit with your collaborators!
Figure 73 - A summary of the main best practices to keep in mind as you work with version control
Summary
Now that we’ve covered what version control is and some of the benefits, you should be able to
understand why we have three whole lessons dedicated to version control and installing it. We
looked at what Git and GitHub are, and then covered much of the commonly used (and sometimes
confusing!) vocabulary inherent to version control work. We then quickly went over some best
practices to using Git – but the best way to get a hang of this all is to use it! Hopefully you feel like
you have a better handle on how Git works than the people in this XKCD comic! So let’s move
on to the next lesson and get it installed!
What is GitHub?
As we previously learned, GitHub is a cloud-based management system for your version-controlled
files. Like DropBox, your files are both locally on your computer and hosted online and easily
accessible. Its interface allows you to manage version control and provides users with a web-based
interface for creating projects, sharing them, updating code, etc.
Logging in to GitHub
You should now be logged in to GitHub! In the future, to log on to GitHub, go
to https://github.com/, where you will be presented with the homepage. If you aren’t already
logged in, click on the “Sign in” link at the top.
Once you’ve done that, you will see the log in page where you will enter in your username and
password that you created earlier.
Figure 75 - GitHub’s log in page
Once logged in, you will be back at https://github.com/, but this time the screen should look
like this:
The homepage
We’re going to take a quick tour of the GitHub website, and we’ll particularly focus on these
sections of the interface:
1. User settings
2. Notifications
3. Help files
4. The GitHub guide
Following this tour, we’ll make your very first repository using the GitHub guide!
Figure 77 - Some major features of GitHub
User settings
Now that you’ve logged on to GitHub, we should fill out some of your profile information and get
acquainted with the account settings. In the upper right corner, there is an icon with an arrow beside
it, click this and go to “Your profile”
This is where you control your account from and can view your contribution histories and
repositories.
Since you are just starting out, you aren’t going to have any repositories or contributions yet - but
hopefully we’ll change that soon enough! What we can do right now is edit your profile.
Go to “Edit profile” along the lefthand edge of the page. Here, take some time and fill out your name
and a little description of yourself in the “Bio” box, and if you like, upload a picture of yourself!
When you are done, click “Update profile”
Along the lefthand side of this page, there are many options for you to explore. Click through each
of these menus to get familiar with the options available to you. To get you started, go to the account
page.
Here, you can edit your password or if you are unhappy with your username, change it. Be careful
though, there can be unintended consequences when you change your username - if you are just
starting out and don’t have any content yet, you’ll probably be safe though.
Continue looking through the personal setting options on your own. When you are done, go back to
your profile.
Once you’ve had a bit more experience with GitHub, you’ll eventually end up with some
repositories to your name. To find those, click on the “Repositories” link on your profile. For now, it
will probably look like this:
Figure 82 - Your repositories page
By the end of the lecture though, check back to this page to find your newly created repository!
Notifications
Next, we’ll check out the notifications menu. Along the menu bar across the top of your
window, there is a bell icon, representing your notifications. Click on the bell.
Once you become more active on GitHub and are collaborating with others, here is where you can
find messages and notifications for all the repositories, teams, and conversations you are a part of.
Help files
Along the bottom of every. single. page. there is the “Help” button. GitHub has a great help
system in place - if you ever have a question about GitHub, this should be your first point to search!
Take some time now and look through the various help files, and see if any catch your eye.
Figure 85 - At the bottom of every page, you can find the Help page
Take some time to explore around the repository - Check out your commit history so far. Here you
can find all of the changes that have been made to the repository, and you can see who made the
change, when they made the change, and provided you wrote an appropriate commit message, you
can see why they made the change! It should look like similar to this:
Once you’ve explored all of the options in the repository, go back to your user profile. It should look
a little different from before:
Figure 89 - Your profile now shows your first repository
Now when you are on your profile you can see your latest repository created and for a complete
listing of your repositories, click on the “Repositories” tab. Here you can see all of your repositories,
a brief description, the time of the last edit, and along the right hand side, there is an activity graph,
showing when and how many edits have been made on the repository.
Git
As you may remember from our last lecture, Git is the free and open source version control system
which GitHub is built on.
One of the main benefits of using the Git system is its compatibility with RStudio; however, in order
to link the two software together, we first need to download and install Git on your computer.
Click on the appropriate download link for your operating system. This should initiate the download
process.
For Windows
Once the download is finished, open the .exe file to initiate the installation wizard. If you receive a
security warning, click “Run” and/or “Allow.” Following this, click through the installation wizard,
generally accepting the default options unless you have a compelling reason not to.
Click “Install” and allow the wizard to complete the installation process. Following this, check the
“Launch Git Bash” option, and unless you are curious, deselect the “View Release Notes” box, as
you are probably not interested in this right now.
Figure 93 - Finishing the install process
Doing so, a command line environment will open. Provided you accepted the default options during
the installation process, there will now be a Start menu shortcut to launch Git Bash in the future. You
have now installed Git.
Figure 94 - Git Bash is the command line interface you will use to configure Git
For Mac
We will walk you through the most common installation process however, there are multiple ways to
get Git onto your Mac. You can follow the tutorials
at https://www.atlassian.com/git/tutorials/install-git for alternative installation
routes.
After downloading the appropriate Git version for Macs, you should have downloaded a DMG file
for installation on your Mac. Open this file. This will install Git on your computer. A new window
will open.
Double click on the .pkg file and an installation wizard will open. Click through the options,
accepting the defaults. Click Install. When prompted, close the installation wizard. You have
successfully installed Git!
Figure 96 - Steps to a successful installation of Git!
Configuring Git
Now that Git is installed, we need to configure it for use with GitHub, in preparation for linking it
with RStudio.
We need to tell Git what your username and email are, so that it knows how to name each commit as
coming from you. To do so, in the command prompt (either Git Bash for Windows or Terminal for
Mac), type: git config --global user.name "Jane Doe" with your desired username in
place of “Jane Doe.” This is the name each commit will be tagged with.
Following this, in the command prompt, type: git config --global user. email
[email protected] MAKING SURE TO USE THE SAME EMAIL ADDRESS YOU
SIGNED UP FOR GITHUB WITH!
Figure 97 - Configuring Git to tag each commit with your name and interface with GitHub
Doing so, you should see the username and email you selected above. If you notice any problems or
want to change these values, just retype the original config commands from earlier with your desired
changes.
Once you are satisfied that your username and email is correct, exit the command line by
typing exit and hit Enter. At this point, you are all set up for the next lecture!
Summary
In this lesson, we signed up for a GitHub account and toured the GitHub website. We made your
first repository and filled in some basic profile information on GitHub. Following this, we installed
Git on your computer and configured it for compatibility with GitHub and RStudio.
Sometimes the default path to the Git executable is not correct. Confirm that git.exe resides in the
directory that RStudio has specified; if not, change the directory to the correct path. Otherwise, click
OK or Apply.
Figure 101 - Confirm that the directory RStudio points to for the Git executable is correct
Following this, in that same window again, click “View public key” and copy the string of numbers
and letters. Close this window.
Figure 102 - Generate an RSA key and copy the public key to your clipboard
You have now created a key that is specific to you which we will provide to GitHub, so that it knows
who you are when you commit a change from within RStudio.
To do so, go to github.com/, log-in if you are not already, and go to your account settings. There,
go to “SSH and GPG keys” and click “New SSH key”. Paste in the public key you have copied from
RStudio into the Key box and give it a Title related to RStudio. Confirm the addition of the key with
your GitHub password.
Figure 103 - Location of “SSH and GPG keys” on your profile settings
Figure 104 - Telling GitHub the public SSH key generated in RStudio
GitHub and RStudio are now linked. From here, we can create a repository on GitHub and link to
RStudio.
Create a new repository and edit it in RStudio
On GitHub, create a new repository ( github.com > Your Profile > Repositories > New ).
Name your new test repository and give it a short description. Click Create repository. Copy the
URL for your new repository.
In RStudio, go to File > New Project. Select Version Control. Select Git as your version control
software. Paste in the repository URL from before, select the location where you would like the
project stored. When done, click on “Create Project”. Doing so will initialize a new project, linked to
the GitHub repository, and open a new session of RStudio.
Create a new R script (File > New File > R Script) and copy and paste the following code:
Save the file. Note that when you do so, the default location for the file is within the new Project
directory you created earlier.
Figure 109 - Saving your first script for this project
Once that is done, looking back at RStudio, in the Git tab of the environment quadrant, you should
see your file you just created! Click the checkbox under “Staged” to stage your file.
Figure 110 - All files that have been modified since your last pull appear in the Git tab
Click “Commit”. A new window should open, that lists all of the changed files from earlier, and
below that shows the differences in the staged files from previous versions. In the upper quadrant, in
the “Commit message” box, write yourself a commit message. Click Commit. Close the window.
Go to your GitHub repository and see that the commit has been recorded.
You’ve just successfully pushed your first commit from within RStudio to GitHub!
Summary
In this lesson, we linked Git and RStudio, so that RStudio recognizes you are using Git as your
version control software. Following that, we linked RStudio to GitHub, so that you can push and pull
repositories from within RStudio. To test this, we created a repository on GitHub, linked it with a
new project within RStudio, created a new file, and then staged, committed, and pushed the file to
your GitHub repository!
Thankfully, RStudio and GitHub recognize this can happen and have steps in place to help you
(admittedly, this is slightly more troublesome to do than just creating a repository on GitHub and
linking it with RStudio before starting the project…).
So first, let’s set up a situation where we have a local project that isn’t under version control. Go to
File > New Project > New Directory > New Project and name your project. Since we are
trying to emulate a time where you have a project not currently under version control, do NOT click
“Create a git repository”. Click Create Project.
Figure 113 - Creating a project that is not under version control
We’ve now created an R Project that is not currently under version control. Let’s fix that. First, let’s
set it up to interact with Git. Open Git Bash or Terminal and navigate to the directory containing
your project files. Move around directories by typing cd ~/dir/name/of/path/to/file
(cd ~/Proyectos/Temporary_add_to_version_control )
When the command prompt in the line before the dollar sign says the correct directory location of
your project, you are in the correct location. Once here, type git init followed by git add . -
this initializes (init) this directory as a git repository and adds all of the files in the directory ( .) to
your local repository. Commit these changes to the git repository using git commit -m "Initial
commit"
Figure 114 - Linking the project folder with Git so it is now under version control
At this point, we have created an R Project and have now linked it to Git version control. The next
step is to link this with GitHub.
Figure 115 - Creating a repository on GitHub that is named the same as your R project
Upon creating the repository, you should see a page like this:
Figure 116 - Your new repository on GitHub containing code to push from the command line
You should see that there is an option to “Push an existing repository from the command line” with
instructions below containing code on how to do so. In Git Bash or Terminal, copy and paste these
lines of code to link your repository with GitHub. After doing so, refresh your GitHub page and it
should now look something like the image below.
When you re-open your project in RStudio, you should now have access to the Git tab in the upper
right quadrant and can push to GitHub from within RStudio any future changes.
Figure 117 - You’ve now pushed your R project repository to your GitHub repository of the same name
Figure 118 - Follow the same steps as previously done to clone your own repository to a new project in RStudio
Figure 119 - Clone an existing project from GitHub from within RStudio
All the existing files in the repository should now be stored locally on your computer and you have
the ability to push edits from your RStudio interface. The only difference from the last lesson is that
you did not create the original repository, instead you cloned somebody else’s.
Summary
In this lesson, we went over how to convert an existing project to be under Git version control using
the command line. Following this, we linked your newly version controlled project to GitHub using
a mix of GitHub commands and the command line. We then briefly recapped how to clone an
existing GitHub repository to your local machine using RStudio.
What is R Markdown?
R Markdown is a way of creating fully reproducible documents, in which both text and code can be
combined. In fact, these lessons are written using R Markdown! That’s how we make things:
bullets
bold
italics
links
or run inline r code
And by the end of this lesson, you should be able to do each of those things too, and more!
Despite these documents all starting as plain text, you can render them into HTML pages, or PDFs,
or Word documents, or slides! The symbols you use to signal, for example, bold or italics is
compatible with all of those formats.
Why use R Markdown?
One of the main benefits is the reproducibility of using R Markdown. Since you can easily combine
text and code chunks in one document, you can easily integrate introductions, hypotheses, your code
that you are running, the results of that code and your conclusions all in one document. Sharing what
you did, why you did it and how it turned out becomes so simple - and that person you share it with
can re-run your code and get the exact same answers you got. That’s what we mean about
reproducibility. But also, sometimes you will be working on a project that takes many weeks to
complete; you want to be able to see what you did a long time ago (and perhaps be reminded exactly
why you were doing this) and you can see exactly what you ran AND the results of that code - and R
Markdown documents allow you to do that.
Another major benefit to R Markdown is that since it is plain text, it works very well with version
control systems. It is easy to track what character changes occur between commits; unlike other
formats that aren’t plain text. For example, in one version of this lesson, I may have forgotten to
bold this word. When I catch my mistake, I can make the plain text changes to signal I would like
that word bolded, and in the commit, you can see the exact character changes that occurred to now
make the word bold.
Check out this video that the RStudio developers have released about R Markdown and what it is!
Installation
Another (selfish) benefit of R Markdown is how easy it is to use! Like everything in R, this extended
functionality comes from an R package - “rmarkdown.” All you need to do to install it is
run install.packages("rmarkdown")
I’ve filled in a title and an author and switched the output format to a PDF. Explore around this
window and the tabs along the left to see all the different formats that you can output to. When you
are done, click OK, and a new window should open with a little explanation on R Markdown files.
There are three main sections of an R Markdown document. The first is the header at the top,
bounded by the three dashes (---). This is where you can specify details like the title, your name,
the date, and what kind of document you want output. If you filled in the blanks in the window
earlier, these should be filled out for you.
Also on this page, you can see text sections, for example, one section starts with “## R Markdown”
- We’ll talk more about what this means in a second, but this section will render as text when you
produce the PDF of this file - and all of the formatting you will learn generally applies to this
section.
And finally, you will see code chunks. These are bounded by the triple backticks (```). These are
pieces of R code (“chunks”) that you can run right from within your document - and the output of
this code will be included in the PDF when you create it.
The easiest way to see how each of these sections behave is to produce the PDF!
“Knitting” documents
When you are done with a document, in R Markdown, you are said to “knit” your plain text and
code into your final document. To do so, click on the “Knit” button along the top of the source panel.
When you do so, it will prompt you to save the document as an RMD file. Do so.
Figure 122 - The rendered PDF you created by knitting your markdown file
So here you can see that the content of the header was rendered into a title, followed by your name
and the date. The text chunks produced a section header called “R Markdown” which is followed by
two paragraphs of text. Following this, you can see the R code summary(cars), importantly,
followed by the output of running that code. And further down you will see code that ran to produce
a plot, and then that plot. This is one of the huge benefits of R Markdown - rendering the results to
code inline.
Go back to the R Markdown file that produced this PDF and see if you can see how you signify you
want text bolded. (Hint: Look at the word “Knit” and see what it is surrounded by).
What are some easy Markdown commands?
At this point, I hope we’ve convinced you that R Markdown is a useful way to keep your code/data
and have set you up to be able to play around with it. To get you started, we’ll practice some of the
formatting that is inherent to R Markdown documents.
To start, let’s look at bolding and italicizing text. To bold text, you surround it by two asterisks on
either side. Similarly, to italicize text, you surround the word with a single asterisk on either
side. **bold** and *italics* respectively.
We’ve also seen from the default document that you can make section headers. To do this, you put a
series of hash marks (#). The number of hash marks determines what level of heading it is. One hash
is the highest level and will make the largest text (see the first line of this lecture), two hashes is the
next highest level and so on. Play around with this formatting and make a series of headers, like so:
# Header level 1
## Header level 2
### Header level 3...
The other thing we’ve seen so far is code chunks. To make an R code chunk, you can type the three
backticks, followed by the curly brackets surrounding a lower case R, put your code on a new line
and end the chunk with three more backticks. Thankfully, RStudio recognized you’d be doing this a
lot and there are short cuts, namely Ctrl+Alt+I (Windows) or Cmd + Option + I (Mac).
Additionally, along the top of the source quadrant, there is the “Insert” button, that will also produce
an empty code chunk. Try making an empty code chunk. Inside it, type the code print("Hello
world"). When you knit your document, you will see this code chunk and the (admittedly
simplistic) output of that chunk.
If you aren’t ready to knit your document yet, but want to see the output of your code, select the line
of code you want to run and use Ctrl+Enter or hit the “Run” button along the top of your source
window. The text “Hello world” should be output in your console window. If you have multiple
lines of code in a chunk and you want to run them all in one go, you can run the entire chunk by
using Ctrl+Shift+Enter OR hitting the green arrow button on the right side of the chunk OR
going to the Run menu and selecting Run current chunk.
One final thing we will go into detail on is making bulleted lists, like the one at the top of this lesson.
Lists are easily created by preceding each prospective bullet point by a single dash, followed by a
space. Importantly, at the end of each bullet’s line, end with TWO spaces. This is a quirk of R
Markdown that will cause spacing problems if not included.
Try
Making
Your
Own
Bullet
List!
This is a great starting point and there is so much more you can do with R Markdown. Thankfully,
RStudio developers have produced an “R Markdown cheatsheet” that we urge you to go check out
and see everything you can do with R Markdown! The sky is the limit!
Summary
In this lesson we’ve delved into R Markdown, starting with what it is and why you might want to use
it. We hopefully got you started with R Markdown, first by installing it, and then by generating and
knitting our first R Markdown document. We then looked at some of the various formatting options
available to you and practiced generating code and running it within the R Studio interface.