STATS 2DA3 Introduction to Data Science Methods
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
Introduction to Data Science Methods
STATS 2DA3
Introduction
• Take a look at the course outline.
• Notes will be posted online ahead of classes on Avenue to Learn. Please check Avenue regularly for course announcements and assignments.
• Assignments will be administered using Avenue to Learn.
• Oice hours.
• We have 2 lectures and 1 lab per week. Please check Mosaic for information on labs.
• There is no manditary textbook, however a lot of my notes are taken from R in Action by Robert Kabacof. There is a list of Suggested Reading on Avenue to Learn.
Data Analytics
• Data analysis can involve some or all of the following;
• transforming the data.
• imputation of missing values.
• variable selection.
• statistical modelling.
Data Analytics II
• Modern Data Analytics also includes;
• pulling data from a variety of sources, such as database management systems, text files, spreadsheets, a variety of diferent statistical packages, and web pages.
• merging of data obtained from diferent sources.
• data cleaning
• analysis with modern techniques such as Machine Learning techniques.
• creating graphical displays of results.
Data Analysis Flowchart
image from R in Action by Robert Kabacof
About R
• R is an environment and programming language used for statistical computing.
• It is open-source (and hence free!).
• There are many powerful graphics packages, such as ggplot2.
• Results from any step in an analysis can be saved, manipulated, and used as a new input.
• R functionality can be integrated into other languages e.g. C++, Python, SAS...
• R can run on basically any platform, e.g. Mac OS, Windows, Unix....
Installing R
• R is open source (and free) and can be downloaded from the Comprehensive R Archive Network (CRAN).
• Go to http://cran .rproject .org and download the version appropriate for your operating system (probably Windows or Mac).
• We will download additional libraries, such as ggplot2, later.
R Basics
• R is:
• case sensitive.
• an interpreted language (more on this later).
• You can enter commands in the prompt line (>) and they will be executed one at a time, however I recommend running your commands from a source file.
• R uses lots of data types, e.g. vectors, data frames, matrices, and lists (more on this later).
• There are lots of built in functions, and users can create their own.
• Statements consist of Functions and Assignments .
R Basics II
• Objects can be created and manipulated. An object is anything that can be assigned a value, e.g. data, results...
• An object must have a class attribute, which tells R how to handle it correctly.
• < − is used for assignments (not =).
Note: < − is treated the same as − >, but don’t use − >, it’s not standard convention.
• To comment out text, use #. R will ignore anything that comes after #.
Language Type
• Any program is basically a set of instructions.
• Both compiled and interpreted languages take human-readable code and convert it into machine code, which can be read by a computer.
• With complied languages, the target machine translates the program.
• With interpreted languages, an interpreter program reads and executes the code line by line.
Compiled Languages
• Compiled Languages;
• are directly converted into machine code, which the target machine executes.
• are fast.
• allow control of aspects such as memory and CPU use.
• have to be manually compiled before execution.
• If a change to the program is desired, once the change is made the whole program needs to be recompiled.
• Examples include C and C++.
Interpreted Language
• Interpreted Languages;
• use an interpreter, which executes the program line by line.
• are usually slower to execute, relative to complied languges.
• The main advantage of interpreted language, besides the ease of editing the code, is that the interpreter executes the source code. Hence the code is platform independent.
• Examples include R and Python.
R Studio
• The R interface is very simple (and I like it).
• However most people use R Studio.
• RStudio Desktop
http://www .rstudio .com
specifically
https://www.rstudio.com/products/rstudio/#rstudio-desktop the last time I checked you could get a free version here.
• RStudio uses multiple windows, has tools for importing data, visualizing output and writing reports (R Markdown).
• RStudio is just an interface. Make sure you install R before installing RStudio!
The Workspace
• Your workspace includes all user defined objects, such as vectors, matrices, functions, data frames and lists.
• Your working directory is where R reads files from, and will save files to by default unless told otherwise.
• The function getwd() tells you your current directory.
• The function setwd() allows you to re-set the directory.
• You can call in a file that is not in the current working directory by using the full path name.
• Use “ ” around file and directory names.
Packages
• Packages are collections of R functions, data, and code.
• R comes with many built in packages, and you can download and install other packages that are of interest to you.
• You must load a package into your coding session to be able to access it.
• Packages are stored in the library directory.
• The function .libPaths() tells you where your library is located.
• The function library() shows you what packages are in your library.
• The function search() displays what packages are currently loaded.
Examples
• Let’s now look at some basic examples in R.
• Please download R and make sure you can run and understand the examples before the next class.
2023-01-18