Note: This syllabus is subject to change.
This course will cover the principles and practices of managing data at scale, with a focus on use cases in data analysis and machine learning. We will cover the entire life cycle of data management and science, ranging from data preparation to exploration, visualization and analysis, to machine learning and collaboration.
The class will balance foundational concerns with exposure to practical languages, tools, and real-world concerns. We will study the foundations of prevalent data models in use today, including relations, tensors, and dataframes, and mappings between them. We will study SQL as a means to query and manipulate data at scale, including analytical challenges like sampling, aggregation and windowing, and performance concerns like views and indexes, data models, query processing and optimization, and transactions, all from a user perspective. We will study the foundations and realities of data preparation, including hands-on work with real-world data using standard Python and SQL frameworks. We will explore data exploration modalities for non-programmers, including the fundamentals behind spreadsheet systems and interactive visual analytics packages. We will look at approaches for managing the lifecycle of data science, including the establishment, monitoring, and maintenance of data pipelines for both analytics and machine learning. Time permitting, we will look at the specifics of ML pipelines including data validation, training, prediction serving and feedback loops, as well as technologies for representing, moving, sharing, and caching data including event streaming systems, key-value/document stores, in-memory and on-disk representation formats, log analytics, and search engines.
COMPSCI C100/DATA C100/STAT C100 or COMPSCI 189 or INFO 251 or DATA 144/INFO 254 or equivalent upper-division course in data science. COMPSCI 61B or INFO 206B or equivalent courses in programming. This class will not assume deep experience with databases or big data solutions.
The class is currently full with a long waitlist. Sadly due to teaching budget cuts we won’t be able to expand the class or take concurrent enrollment students this semester, and we will let the waitlist play out on its own.
Please make sure you are enrolled on Ed and Gradescope. Ed is our primary method of communication and making announcements, and you are responsible for checking it frequently. We plan on using bCourses only for lecture webcasts. Gradescope is where all assignments are submitted.
60% of your grade will come from projects, and 25% of your grade will come from multivitamins. Each project and multivitamin will be weighted equally. The final exam will be worth 15% of your grade.
Final exam will be held on Thursday, December 15th, from 11:30am - 2:30pm. The final exam is offered in-person only (location TBD), and we will not offer alternate exams. It is your responsibility to ensure that you are not enrolled in another class that conflicts with our exam time.
Throughout the semester, we will release 5 programming assignments via Ed and the website. The 5th project is optional for undergraduate students enrolled in Data 101 and required for graduate students enrolled in Info 258. For Data 101 students, the first four projects are each worth 15% of your grade, and the fifth project can replace your lowest project score. For Info 258, each of the five projects is worth 12% of your grade.
Multivitamins are short written assignments designed to keep you on schedule and check your understanding of the basics from lecture. They will mostly consist of multiple choice questions covering material that is not covered in the projects. If you are struggling with any of the questions on the multivitamin, you are encouraged to come to office hours for help. We will have 5 multivitamins throughout the semester.
Office hours are a great place to go for help with multivitamins, projects, or any other content-related questions. You can find a list of regular office hours under the Staff tab on this page. We may also host project parties throughout the semester ahead of project deadlines. The course calendar under the Calendar tab will always show the most up-to-date times and locations for office hours / project parties.
You will get 4 slip days for projects and 4 slip days for multivitamins. Note that these are separate, so you will not receive extra project slip days if you do not use all of your multivitamin slip days. Likewise, using a slip day for a project will not use up one of your multivitamin slip days. Slip days are automatically used in the manner that will optimize your score the most. After using all of your slip time for a particular assignment category, you’ll be docked 15% of your score for the assignment each extra late day on your submission. This applies to both projects and multivitamins. Note that submission times are rounded up to the next day. That is, 2 minutes late = 1 day late.
We do not allow for collaboration on assignments since we expect you to complete all assignments individually; however, you are free (and encouraged!) to discuss concepts from lecture. We will follow the EECS departmental policy on academic honesty, so be sure you are familiar with it. And hey — don’t cheat. Not cool.
For administrative and logistics issues, deadline extension requests, alternate exam requests, DSP accommodations, or special accommodations (for emergencies or personal issues), please make a private post on Ed. If you need an extension, please include your reason for requesting an extension and any relevant documentation if applicable. If you are a DSP student and your accommodation letter allows for extensions on assignments, you will be given 2 extra days per assignment deadline on top of your slip day allocation. If you require any extensions beyond that, please make a private post on Ed. For issues that you do not feel comfortable with posting on Ed, feel free to email the TAs or the instructor. However, we would recommend posting on Ed if possible to ensure a quicker response.