Data Engineering

MG-GY 8411 | Fall 2020


What is Data Engineering?

The growth and adoption of technology has allowed for the collection, transmission and storage of large and diverse datasets. Companies gather numbers, text, images, and recordings to better understand customer habits in sectors ranging from healthcare to telecommunications to finance. The information behind datasets can provide support for strategic goals or operational needs in industry. Data has helped companies tackle a number of tasks including

• Matching buyers and sellers subject to constraints on inventory

• Dividing customers into related groups based on similar behavior

• Identifying duplication in fragments of text to aggregate news

• Finding frequently occurring combinations of retail products

• Devising targeted marketing campaigns based on transaction records

While the volume, velocity and variety of datasets provide the possibility for advances in many industries, they can present hurdles to companies without appropriate data governance. Data engineering is the process of managing the availability, utility, and integrity of data in enterprise systems for the determination of reliable and reproducible information.


Course Description

The course will teach students concepts, methods and tools for data mining. Students will learn to extract information behind large and diverse datasets. Experience with languages for programming, querying and data-flow will enable students to process, summarize and analyze data. The course will help students to add value and insights to business operations and management.


Learning Objectives

The course will teach students skills for determining patterns and associations in datasets at scale. Learning objectives include

Enterprise systems contain a variety of components. Some components will become obsolete with developments and innovations in technology. While we will learn about languages for programming, querying and data-flow, we will focus on transferable skills underlying these tools.


Prerequisites

Students are required to have experience with programming particularly the Python programming language. Students can meet the prerequisite by completion of MG-GY 8401: Programming for Business Intelligence and Analytics or comparable classes in another department. Students should have analytical skills related to MG-GY 8413 : Business Analytics, MG-GY 6103 : Management Science or comparable classes in another department.

Students have a wide range of backgrounds. However, if students have a willingness to learn, then they will succeed in class. The instructional staff can work together with students to meet learning needs.


Resources

We have a primary textbook

• Mining of Massive Datasets by Leskovec, Rajaraman, and Ullman, Cambridge University Press (2014)

and two secondary textbooks

• Learning SQL by Alan Beaulieu, O'Reilly Media (2020)

• Spark: The Definitive Guide by Bill Chambers and Matei Zaharia, O'Reilly Media (2018)

Some students have benefited from online tutorials. You are encouraged to explore the following videos

• Pandas Essential Training by Jonathan Fernandes

• MySQL Essential Training by Bill Weinman

• Apache Spark Essential Training by Ben Sullins


Schedule

Please check the Weekly Agenda on NYU Classes for more information


Course Policies

Grading will be based on seven homework assignments and one project along with participation. Each homework assignment has the same weight. I will drop the lowest homework grade. You can participate in lecture and office hours along with Slack.


Collaboration

You can collaborate on homework and project. However, you are responsible for mentioning your collaborators. Any submissions without acknowledgement of collaborators violates course policies. You should avoid duplicating code in your homework and project. If you copy code into your submission, then you must provide comments that attribute the source.


Late Assignments

You get 3 extension days for the semester

• Extensions are rounded up to the nearest day.

o For example, 1 minute late would mean 1 extension day.

• After you use 3 extension days, then any late homework loses 25% per day.

• No homework will be accepted more than 2 days late.

Assignments will be due before 11:59 PM on the day of the deadline. Please check the Calendar on NYU Classes for deadlines.


Educational Technologies

Students will use three learning management systems linked to NYU Classes.

If you encounter any issues with these educational technologies throughout the semester, then please reach out to the teaching fellow who can assist you in trouble-shooting.


Gradescope

We will manage grades through Gradescope. We will provide

• Scores

• Feedback

• Solutions

for homework and project on Gradescope. You can raise a regrade request to indicate any issues with scoring.


Slack

We will use Slack to manage communications. Throughout the semester, students can publicly or privately send direct messages to instructional staff or students. The Slack workspace contains channels for

• Homework + Project

• Lecture + Office Hours

• Logistics

Students can use the channels for homework and projects to collaborate on assignments. Please adhere to the following guidelines

• Designated Posts

o Before posting about assignments, please search messages for relevant keywords.

o If you reference another post, then please insert a link in the message to help us locate it.

• No answers in Posts

o While you can explain your reasoning, you should not divulge an answer.

• Public Posts

o If you have personal questions, then please privately send a direct message to the instructional staff.

o Otherwise you should share your questions with classmates to avoid duplication and foster collaboration among the class.

• Posts vs Office Hours

o If you think your questions may require a detailed discussion, then please attend office hours.

• Self -Contained Posts

o Try to explain your understanding of the problem. Posting screen shots will help us to understand possible issues.


JupyterHub

You are not responsible for installing any applications to program in Python. Instead you can access the language, libraries and dependencies on the JupyterHub platform. You will find materials on JupyterHub for

• Lecture + Lab

• Homework + Project

You can connect to JupyterHub with your NYU credentials. Note that you need Multi-Factor Authentication. While the platform has many components, we will use the Files tab, then Assignments tab and the Control Panel page.

• Control Panel

o The Logout button will not stop your server. You can navigate to the Control Panel page with the Control Panel button to stop your server.

o After three hours, your server will stop running. You can restart your server from the Control Panel page.

• Files

shared/ : You have read access to the folder. The folder contains datasets.

your_materials/ : You have read access and write access to the folder. The folder contains materials for lecture and labs. Any changes to the materials will be stored between sessions.

class_materials/from_github/ : You have read access and write access to the folder. The folder contains the same materials for lecture and labs in your_materials/. However, any changes to the materials will not be stored between sessions.

• Assignments tab

o For homework and project, you can

▪ fetch + modify

▪ validate + submit

After you fetch an assignment you will find a folder with the corresponding name under the Files tab.

Please see the instructional video for more information.

You have a storage limit on JupyterHub. You should not upload any files to JupyterHub to avoid exceeding the storage limit.


University Policies

Please contact Elizabeth Spock ([email protected]) and Rebecca Menzer ([email protected]with any questions about registration.

Please contact Deanna Rayment ([email protected]) or Paige Christian McAdams ([email protected]) about any personal issues.


Health Precautions

Prior to the semester, students must indicate their anticipated location on Albert. If students elect to come to campus, then they must abide by NYU policies surrounding COVID testing and quarantining.

Before entering any academic, administrative or residential buildings, students are required to report symptoms particularly temperature on the daily screening tool. Faculty, students and staff are required to wear personal protective equipment at all times on campus. Strict adherence to signage in hallways, stairwells, elevators and other communal spaces is required to ensure social distancing.

Classrooms will provide six feet of separation between students and ten feet of separation from instructor. Since public spaces on campus cannot pass 50% capacity, if enrollment exceeds a limit determined by departments, then students will be divided at random into cohorts. Instructors will assign the cohorts to separate weeks allowable for classroom attendance. For contact tracing, students must sit in the same seat throughout the semester. Students can indicate their elected seats on NYU Classes.

The academic and operational policies surrounding the epidemic are meant to provide flexibility to students without comprising their health or the health of faculty and staff. Since local and state agencies hold NYU accountable for the transmission of COVID on campus, the administration requires compliance from NYU affiliates. Failure to comply to policies will result in academic or legal consequences. Students should report any unsafe behavior to the anonymous hotline [email protected].


Preferred name and pronouns

You are always welcome to write your preferred name on all class assignments, exams, etc. If you have a name and/or pronoun that doesn’t match the class roster delivered from the registrar, please let me know and I will ensure that you are addressed correctly in our class.


Academic Integrity

Work you submit should be your own. Please consult the academic integrity policy for more information. Penalties for violations of academic integrity may include failure of the course, suspension from NYU, or even expulsion.


Observances and Sick Days

As a nonsectarian, inclusive institution, NYU policy permits members of any religious group to absent themselves from classes without penalty when required for compliance with their religious obligations. The policy and principles to be followed by students and faculty may be found under the NYU calendar policies on religious holidays.

If you are unwell, then please do not attend lecture, section or office hours. Please contact the instructional staff about the circumstances. If the absence impacts your completion of an activity, then the instructional staff will work with you to find an alternative time.


Disability Disclosure Statement

Academic accommodations are available for students with disabilities. The Moses Center website is www.nyu.edu/csd. Please contact the Moses Center for Students with Disabilities (212-998-4980 or [email protected]) for further information. Students who are requesting academic accommodations are advised to reach out to the Moses Center as early as possible in the semester for assistance