Evaluation of Scalable Data Processing Methods
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
Project: Evaluation of Scalable Data Processing Methods
Course Overview
This course focuses on Big Data Processing and Analysis, covering core topics such as:
· Indexing and search in large-scale data systems
· Algorithmic design for scalability
· Learning-based methods for data processing and analysis
· Scale-out frameworks
Students will explore both theoretical foundations and practical techniques used in modern data-intensive methodologies.
Project Objective
The primary goal of this course project is to provide hands-on experience with state-of-the-art methods in big data processing. Students will select and adopt an existing method recently published in a top-tier research venue (e.g., SIGMOD, VLDB, ICDE, KDD, NeurIPS, ICML, or any other Core A+ venue, https://people.iiti.ac.in/~artiwari/cseconflist.html) and evaluate its performance on a dataset that was not used in the original study. Note that the methodology should be at least somewhat related to scalability issues in data processing, analysis, and / or learning. The grading will be awarded for any new designs, ideas, or implementations that improve upon the existing method.
Project Tasks
1. Method Selection
o Choose a data processing method from a top-tier research publication.
o Clearly summarize the original goal, assumptions, and evaluation methodology.
2. Dataset Selection
o Identify or construct a new dataset that was not part of the original evaluation.
o Justify why this dataset is suitable for testing the generalizability of the method.
3. Implementation and Adaptation
o Re-evaluate the chosen method or adapt publicly available code.
o Make necessary modifications to ensure compatibility with the new dataset.
4. Performance Evaluation
o Compare the results with those reported in the original paper.
o Analyze performance discrepancies and provide insights into the method’s robustness.
Deliverables
· Presentation slides (within 20 pages): A brief outline of the method selection, dataset selection, implementation, and your findings.
· Codes and datasets: Reproducible codebase.
· Appendix report [Optional]: in ACM SIG Conf format (https://www.overleaf.com/latex/templates/association-for-computing-machinery-acm-sig-proceedings-template/bmvfhcdnxfty)
· All these should be submitted to a Gitea repository. The teaching team will create a repository to each team.
Evaluation Criteria
· Relevance and Quantity of the chosen method(s) and dataset(s) (40%)
· Depth of analysis, findings, and experimental rigor (60%)
Notes
· Students may work individually (not recommended) or in a team of maximum 3 people.
· Reproducibility and clarity of documentation will be emphasized.
· Projects with potential for further research or publication are highly encouraged.
2025-05-14
Indexing and search in large-scale data systems