A walk-though for Data analytics-related projects

type

Post

status

Published

date

Aug 22, 2025

slug

projects

summary

Data science project

📝 Detailed

Traffic Flow Visualization Dashboard –Mead Hunt Volunteer Project

Project description

Initiated and completed a volunteer data visualization project focused on traffic flow analysis and interactive mapping for a local region in collaboration with MEADHUNT.

Built a custom interactive dashboard to visualize vehicle traffic patterns, congestion hotspots, and movement trends using Python (Plotly, Dash) and Leaflet.js / Folium for map-based visualizations.

Developed interactive HTML outputs for easy sharing and access, embedding maps and dashboard components directly into web files without server hosting.

Self-taught the required programming and visualization skills, demonstrating initiative, adaptability, and technical growth in a real-world context.

Find more in here

Map-related project | Yunzhu‘s Personal Website

Display projects and blogs

Airbnb Super Host Strategy – Machine Learning Application May 2025

Project description

As part of my master's coursework, I worked on a project to build a predictive model for Airbnb’s Superhost program using machine learning methods.

In a group of six, we conducted a full data pipeline process—from raw data cleaning to feature engineering, model training, and insight generation—to identify the key drivers of Airbnb’s super host designation. Early in the project, our team selected super host status as the primary target variable, as it was the only categorical outcome suited for classification modeling. I helped identify key predictive variables by analyzing correlations and manually reviewing variable distributions to understand their relevance. I also engineered new features based on guest reviews, stay frequency, and rating patterns to capture behavioral signals more effectively.

I led the model implementation phase, running multiple machine learning algorithms—including K-Nearest Neighbors (KNN), Support Vector Machines (SVM), Random Forest, and XGBoost—and validated results across different environments to ensure stability. I evaluated models using accuracy, precision, and recall, ultimately selecting XGBoost for its strong performance and interpretability. I then conducted feature importance analysis using SHAP visualizations to uncover the most impactful predictors. Our findings revealed that review volume and rating score were the strongest drivers of Superhost status, suggesting that both consistent feedback and high-quality service are essential for recognition on the platform.

We also applied cohort analysis to identify high-potential host groups that demonstrated steady growth and stable performance, providing recommendations on where Airbnb could focus additional support or intervention to improve Super host rates.

One technical challenge involved handling large datasets that were too large to process directly on our online collaboration platform. As a solution, we performed parts of the modeling locally on RStudio to avoid system crashes and maintain processing efficiency. In some stages, we also leveraged AI tools to assist in generating preliminary code templates, which accelerated coding during time constraints. This project helped me deepen my understanding of machine learning workflows, especially in managing end-to-end data science projects. I learned not only how to build and evaluate different models but also how to engineer variables that improve prediction quality. The experience also taught me how to balance collaboration with technical efficiency, especially when working with large datasets that could not be processed online. Overall, this project enhanced both my modeling skills and my ability to translate data science results into actionable business recommendations.

T-Mobile Customer Loyalty Action Learning Project Mar 2025

A course-based project in 10 weeks

Weekly meetings with team members and faculty advisors to review interim findings, align on analytical approach, and assign next-phase tasks; incorporated stakeholder feedback from each session into iterative data models and visualizations, ensuring insights were both statistically sound and business-relevant.

New Product Development for Philips Coffee Maker – Choice-Based Conjoint Analysis Dec 2024

Project description

Our team was tasked with advising Philips on the development of a new coffee maker by analyzing consumer preferences from a choice-based conjoint experiment. The study sought to determine whether adding a water filter and/or an auto-grinder would increase market share. We worked with 1,480 choice observations from 185 respondents, each evaluating hypothetical coffee makers varying in brand, capacity, price, filter presence, and auto-grinder inclusion. The business goal was to provide Philips with actionable recommendations to optimize product design and pricing for maximum market penetration.

I prepared and cleaned the dataset, converting product attributes into effect-coded variables suitable for modeling. Using GLIMMIX, I applied a mixture multinomial logit model to identify distinct market segments and determine the optimal number of segments by minimizing the Bayesian Information Criterion (BIC). The analysis revealed two key consumer segments: one feature-focused group preferring smaller capacity and mid-range pricing, and a larger cost-sensitive group favoring bigger capacity, lower price, and inclusion of an auto-grinder. Based on these insights, we recommended that Philips target the cost-sensitive segment with a 15-cup coffee maker priced at $59, equipped with an auto-grinder but without a water filter. Market share simulations showed this product design could capture approximately 68% of the segment, significantly outperforming alternatives and positioning Philips strongly against competitors. This actionable recommendation combined rigorous statistical modeling with practical business strategy to guide Philips’s new product development.

This project reinforced the value of combining advanced statistical techniques with market-oriented thinking. I gained hands-on experience with mixture MNL models in SAS, effect coding, and market share simulations. More importantly, I learned how to translate technical modeling results into a clear, persuasive business case — a skill critical when advising stakeholders who need actionable recommendations, not just statistical findings. The experience underscored the importance of tailoring analysis not only to what the data says but to how decision-makers can act on it.

Multinomial Model for Whole Foods Market Entry Nov 2024

Project description

As part of my master’s coursework, I participated in a project to develop an international market-entry strategy for Whole Foods.

In this project, our team developed a consumer-centric segmentation strategy to guide Whole Foods’ entry into the European market, leveraging a retail image survey dataset collected by the European Commission (1,669 respondents across 105 NUTS2 regions in seven countries). Rather than segmenting by country, we applied mixture regression modeling to uncover latent consumer groups based on their sensitivity to store image drivers: service quality, atmosphere, and price. I took a leading role in model implementation—running iterations, selecting the optimal number of segments using BIC, and ensuring model stability. Once the two segments were identified, I helped interpret the segment-specific coefficients and t-values, revealing that Segment 1 was price-driven, while Segment 2 prioritized service and atmosphere.

Beyond modeling, I also led the effort to visualize regional insights for market entry recommendations. Our team faced technical challenges in mapping regions because Tableau could not properly interpret the format of the NUTS2 region names. After troubleshooting various options, I manually mapped the key regions using Photoshop, drawing reference lines to highlight the geographical distribution of Segment 2—primarily in eastern Germany, France, and northern areas of Spain and Italy. These visualizations support our final recommendation to target service-sensitive regions with a high-touch customer experience strategy. My dual contributions in model interpretation and creative visualization helped ensure our recommendations were both data-driven and clearly communicated to stakeholders.

Segment 2 (69.48% of regions) shows significantly higher sensitivity to service quality and atmosphere, while Segment 1 is more price-driven. Specifically, in Segment 2, increases in service quality and atmosphere ratings have a stronger positive impact on store image compared to Segment 1, while price plays a smaller role. In contrast, Segment 1’s store image is most influenced by competitive pricing, suggesting discount strategies would be more effective there.

Based on these findings, we recommend that Whole Foods prioritize Segment 2 for its market entry, targeting regions such as eastern Germany, France, northern Spain, and northern Italy. In these areas, Whole Foods can differentiate itself by focusing on high service quality and a localized store atmosphere, supported by investments in employee hiring and training as well as partnerships with local businesses to tailor the shopping experience to local cultures. While price remains important, excessive discounting is not necessary for this segment.

The results provide actionable, data-driven insights for Whole Foods to refine its entry strategy, rollout locations, and store positioning to meet the preferences of high-value customers. However, the analysis is limited by data availability in certain regions and the exclusion of other potential drivers of store image beyond service, atmosphere, and price. Further research may uncover additional factors to support even more precise targeting and long-term success.

Based on these results, we developed a prioritized market entry list and recommended strategies for product assortment and pricing adjustments in different regions.

Reflection:

Through this project, I gained hands-on experience in applying regression modeling to real-world market entry problems. It also helped me improve my ability to turn complex modeling results into clear business recommendations and visualize findings effectively for decision-making.

Eye-Tracking Data for Ad Efficiency Oct 2024

Boosted ad engagement time by 25% by analyzing 1M+ eye-tracking datapoints to assess ad effectiveness, identifying drivers of consumer attention to improve ad design features

Quantified user engagement using logit, multinomial logit, and semi-log regression models in SAS when studying variables such as fixation duration and gaze time

Evaluated the impact of promotions on brand choice and profitability using scanner panel data, outlier detection, and missing data handling, and recommended flexible discount strategies to build loyalty without eroding margins

As part of a five-member team, I led the data modeling efforts in a project analyzing over one million gaze points collected through eye-tracking experiments, aimed at understanding which magazine ad layouts captured the most sustained viewer attention. Our goal was to quantify consumer engagement and translate eye-movement behavior into actionable insights for advertisers.

My main responsibilities included selecting key variables related to ad design features (e.g., text density, image prominence, layout symmetry) and constructing engineered variables to better reflect attention dynamics such as first fixation duration and total gaze time. We applied three steps of multiple regression models using SAS, aligning our approach with established consumer behavior modeling frameworks that assess decision-making in stages.

Specifically, I implemented:

Binary logit models

to examine whether an ad captured any attention (initial engagement),

Multinomial logit models

to assess viewer preference among ad types,

Semi-log models

to estimate the duration of attention sustained per ad.

I also wrote the data methodology section of the final report, detailing our model selection criteria (BIC, adjusted R-squared), variable transformations, and data preprocessing techniques, including outlier detection and handling missing values. My work ensured the statistical robustness of our findings and helped the team deliver empirically grounded recommendations on ad layout optimization.

The final analysis revealed which design features increased engagement time, improving the company’s understanding of ad effectiveness by 25%. Our analysis shows that while price cuts effectively increase purchase rates and brand choice, their impact depends on the brand and promotional strategy. Full pass-through (100%) significantly boosts retailer profits for high-demand brands like Tide, but at lower pass-through rates (70%), manufacturers bear a larger share of the discount burden, reducing their profits for brands such as All and Tide. To balance profitability, we recommend flexible pass-through rates: higher for popular brands and moderate for lower-demand ones. In-store promotions should be prioritized to build loyalty without eroding margins, while excessive discounting should be avoided to protect brand value. The study is limited to one product category and region; future research should expand to other markets and explore the long-term effects of pass-through strategies on brand perception and profitability. I also supported the team by preparing clear visualizations and reports that guided creative teams in optimizing future ad designs.

This project allowed me to practice data cleaning, modeling, and communicating technical results to non-technical stakeholders. I also learned how consumer attention data can directly support creative and marketing decisions.

📨 In the end

💡