Data Engineering MG-GY 8411 | Fall 2020
Data Engineering
MG-GY 8411 | Fall 2020
What is Data Engineering?
The growth and adoption of technology has allowed for the collection, transmission and storage of large and diverse datasets. Companies gather numbers, text, images, and recordings to better understand customer habits in sectors ranging from healthcare to telecommunications to finance. The information behind datasets can provide support for strategic goals or operational needs in industry. Data has helped companies tackle a number of tasks including
• Matching buyers and sellers subject to constraints on inventory
• Dividing customers into related groups based on similar behavior
• Identifying duplication in fragments of text to aggregate news
• Finding frequently occurring combinations of retail products
• Devising targeted marketing campaigns based on transaction records
While the volume, velocity and variety of datasets provide the possibility for advances in many industries, they can present hurdles to companies without appropriate data governance. Data engineering is the process of managing the availability, utility, and integrity of data in enterprise systems for the determination of reliable and reproducible information.
Course Description
The course will teach students concepts, methods and tools for data mining. Students will learn to extract information behind large and diverse datasets. Experience with languages for programming, querying and data-flow will enable students to process, summarize and analyze data. The course will help students to add value and insights to business operations and management.
Learning Objectives
The course will teach students skills for determining patterns and associations in datasets at scale. Learning objectives include
Enterprise systems contain a variety of components. Some components will become obsolete with developments and innovations in technology. While we will learn about languages for programming, querying and data-flow, we will focus on transferable skills underlying these tools.
Prerequisites
Students are required to have experience with programming particularly the Python programming language. Students can meet the prerequisite by completion of MG-GY 8401: Programming for Business Intelligence and Analytics or comparable classes in another department. Students should have analytical skills related to MG-GY 8413 : Business Analytics, MG-GY 6103 : Management Science or comparable classes in another department.
Students have a wide range of backgrounds. However, if students have a willingness to learn, then they will succeed in class. The instructional staff can work together with students to meet learning needs.
Resources
We have a primary textbook
• Mining of Massive Datasets by Leskovec, Rajaraman, and Ullman, Cambridge University Press (2014)
and two secondary textbooks
• Learning SQL by Alan Beaulieu, O'Reilly Media (2020)
• Spark: The Definitive Guide by Bill Chambers and Matei Zaharia, O'Reilly Media (2018)
Some students have benefited from online tutorials. You are encouraged to explore the following videos
• Pandas Essential Training by Jonathan Fernandes
• MySQL Essential Training by Bill Weinman
• Apache Spark Essential Training by Ben Sullins
Schedule
Please check the Weekly Agenda on NYU Classes for more information
Course Policies
Grading will be based on seven homework assignments and one project along with participation. Each homework assignment has the same weight. I will drop the lowest homework grade. You can participate in lecture and office hours along with Slack.
Collaboration
You can collaborate on homework and project. However, you are responsible for mentioning your collaborators. Any submissions without acknowledgement of collaborators violates course policies. You should avoid duplicating code in your homework and project. If you copy code into your submission, then you must provide comments that attribute the source.
Late Assignments
You get 3 extension days for the semester
• Extensions are rounded up to the nearest day.
o For example, 1 minute late would mean 1 extension day.
• After you use 3 extension days, then any late homework loses 25% per day.
• No homework will be accepted more than 2 days late.
Assignments will be due before 11:59 PM on the day of the deadline. Please check the Calendar on NYU Classes for deadlines.
Educational Technologies
Students will use three learning management systems linked to NYU Classes.
If you encounter any issues with these educational technologies throughout the semester, then please reach out to the teaching fellow who can assist you in trouble-shooting.
Gradescope
We will manage grades through Gradescope. We will provide
• Scores
• Feedback
• Solutions
for homework and project on Gradescope. You can raise a regrade request to indicate any issues with scoring.
Slack
We will use Slack to manage communications. Throughout the semester, students can publicly or privately send direct messages to instructional staff or students. The Slack workspace contains channels for
• Homework + Project
• Lecture + Office Hours
• Logistics
Students can use the channels for homework and projects to collaborate on assignments. Please adhere to the following guidelines
• Designated Posts
o Before posting about assignments, please search messages for relevant keywords.
o If you reference another post, then please insert a link in the message to help us locate it.
• No answers in Posts
o While you can explain your reasoning, you should not divulge an answer.
• Public Posts
o If you have personal questions, then please privately send a direct message to the instructional staff.
o Otherwise you should share your questions with classmates to avoid duplication and foster collaboration among the class.
• Posts vs Office Hours
o If you think your questions may require a detailed discussion, then please attend office hours.
• Self -Contained Posts
o Try to explain your understanding of the problem. Posting screen shots will help us to understand possible issues.
JupyterHub
You are not responsible for installing any applications to program in Python. Instead you can access the language, libraries and dependencies on the JupyterHub platform. You will find materials on JupyterHub for
• Lecture + Lab
• Homework + Project
You can connect to JupyterHub with your NYU credentials. Note that you need Multi-Factor Authentication. While the platform has many components, we will use the Files tab, then Assignments tab and the Control Panel page.
• Control Panel
o The Logout button will not stop your server. You can navigate to the Control Panel page with the Control Panel button to stop your server.
o After three hours, your server will stop running. You can restart your server from the Control Panel page.
• Files
o shared/ : You have read access to the folder. The folder contains datasets.
o your_materials/ : You have read access and write access to the folder. The folder contains materials for lecture and labs. Any changes to the materials will be stored between sessions.
o class_materials/from_github/ : You have read access and write access to the folder. The folder contains the same materials for lecture and labs in your_materials/. However, any changes to the materials will not be stored between sessions.
• Assignments tab
o For homework and project, you can
▪ fetch + modify
▪ validate + submit
o After you fetch an assignment you will find a folder with the corresponding name under the Files tab.
o Please see the instructional video for more information.
You have a storage limit on JupyterHub. You should not upload any files to JupyterHub to avoid exceeding the storage limit.
University Policies
Please contact Elizabeth Spock ([email protected]) and Rebecca Menzer ([email protected]) with any questions about registration.
Please contact Deanna Rayment ([email protected]) or Paige Christian McAdams ([email protected]) about any personal issues.
Health Precautions
Prior to the semester, students must indicate their anticipated location on Albert. If students elect to come to campus, then they must abide by NYU policies surrounding COVID testing and quarantining.
Before entering any academic, administrative or residential buildings, students are required to report symptoms particularly temperature on the daily screening tool. Faculty, students and staff are required to wear personal protective equipment at all times on campus. Strict adherence to signage in hallways, stairwells, elevators and other communal spaces is required to ensure social distancing.
Classrooms will provide six feet of separation between students and ten feet of separation from instructor. Since public spaces on campus cannot pass 50% capacity, if enrollment exceeds a limit determined by departments, then students will be divided at random into cohorts. Instructors will assign the cohorts to separate weeks allowable for classroom attendance. For contact tracing, students must sit in the same seat throughout the semester. Students can indicate their elected seats on NYU Classes.
The academic and operational policies surrounding the epidemic are meant to provide flexibility to students without comprising their health or the health of faculty and staff. Since local and state agencies hold NYU accountable for the transmission of COVID on campus, the administration requires compliance from NYU affiliates. Failure to comply to policies will result in academic or legal consequences. Students should report any unsafe behavior to the anonymous hotline [email protected].
Preferred name and pronouns
You are always welcome to write your preferred name on all class assignments, exams, etc. If you have a name and/or pronoun that doesn’t match the class roster delivered from the registrar, please let me know and I will ensure that you are addressed correctly in our class.
Academic Integrity
Work you submit should be your own. Please consult the academic integrity policy for more information. Penalties for violations of academic integrity may include failure of the course, suspension from NYU, or even expulsion.
Observances and Sick Days
As a nonsectarian, inclusive institution, NYU policy permits members of any religious group to absent themselves from classes without penalty when required for compliance with their religious obligations. The policy and principles to be followed by students and faculty may be found under the NYU calendar policies on religious holidays.
If you are unwell, then please do not attend lecture, section or office hours. Please contact the instructional staff about the circumstances. If the absence impacts your completion of an activity, then the instructional staff will work with you to find an alternative time.
Disability Disclosure Statement
Academic accommodations are available for students with disabilities. The Moses Center website is www.nyu.edu/csd. Please contact the Moses Center for Students with Disabilities (212-998-4980 or [email protected]) for further information. Students who are requesting academic accommodations are advised to reach out to the Moses Center as early as possible in the semester for assistance
2021-06-01