数学建模论文展示(上半)

历时近5天的美国大学生数学建模竞赛终于落下帷幕,博主同本校的2位经管专业的同学参加了这次比赛,本人担任编程手的指责,主要负责建立数学模型之后使用代码进行复现,并得出最终结果。我们选择了C题,即与数据科学相关的命题。虽说与博主的专业有些交叉,但实际讨论思路的过程中,还是明显感觉到两位同僚的实力与投入,与他们组队实属诚惶诚恐。所幸最终论文也是在规定时间内撰写完毕,不管结果如何,这也是一次有趣的经历。下面就是对此次比赛具体内容的阐述。

上文说到,我们选择了C题来解答。C题往往与大数据和数据挖掘有关,虽然称不上大数据,与MCM/ICM其他题目相比,数据量算是大的。这就要求选这一题的参赛队要熟悉数据处理的基本方法,包括预处理、后处理等,并掌握相应的编程技能或是相关软件的使用方法。模型、方法方面,可能主要集中在统计、模式识别等方向。

先po出来题目:

2023 MCM
Problem C: Predicting Wordle Results

Background

Wordle is a popular puzzle currently offered daily by the New York Times. Players try to solve the puzzle by guessing a five-letter word in six tries or less, receiving feedback with every guess. For this version, each guess must be an actual word in English. Guesses that are not recognized as words by the contest are not allowed. Wordle continues to grow in popularity and versions of the game are now available in over 60 languages.

The New York Times website directions for Wordle state that the color of the tiles will change after you submit your word. A yellow tile indicates the letter in that tile is in the word, but it is in the wrong location. A green tile indicates that the letter in that tile is in the word and is in the correct location. A gray tile indicates that the letter in that tile is not included in the word at all (see Attachment 2)[2]. Figure 1 is an example solution where the correct result was found in three tries.

Figure 1: Example Solution of Wordle Puzzle from July 21, 2022[3]

| ©2023 by COMAP, Inc. | www.comap.com | www.mathmodels.org | info@comap.com |

Players can play in regular mode or “Hard Mode.” Wordle’s Hard Mode makes the game more difficult by requiring that once a player has found a correct letter in a word (the tile is yellow or green), those letters must be used in subsequent guesses. The example in Figure 1 was played in Hard Mode.

Many (but not all) users report their scores on Twitter. For this problem, MCM has generated a file of daily results for January 7, 2022 through December 31, 2022 (see Attachment 1). This file includes the date, contest number, word of the day, the number of people reporting scores that day, the number of players on hard mode, and the percentage that guessed the word in one try, two tries, three tries, four tries, five tries, six tries, or could not solve the puzzle (indicated by X). For example, in Figure 2 the word on July 20, 2022 was “TRITE” and the results were obtained by mining Twitter. Although the percentages in Figure 2 sum to 100%, in some cases this may not be true due to rounding.

Figure 2: Distribution of the Reported Results for July 20, 2022 to Twitter[4]

Requirement

You have been asked by the New York Times to do an analysis of the results in this file to answer several questions.

  • The number of reported results vary daily. Develop a model to explain this variation and use your model to create a prediction interval for the number of reported results on March 1, 2023. Do any attributes of the word affect the percentage of scores reported that were played in Hard Mode? If so, how? If not, why not?
  • For a given future solution word on a future date, develop a model that allows you to predict the distribution of the reported results. In other words, to predict the associated percentages of (1, 2, 3, 4, 5, 6, X) for a future date. What uncertainties are associated with your model and predictions? Give a specific example of your prediction for the word EERIE on March 1, 2023. How confident are you in your model’s prediction?

| ©2023 by COMAP, Inc. | www.comap.com | www.mathmodels.org | info@comap.com |

  • Develop and summarize a model to classify solution words by difficulty. Identify the attributes of a given word that are associated with each classification. Using your model, how difficult is the word EERIE? Discuss the accuracy of your classification model.
  • List and describe some other interesting features of this data set.
    Finally, summarize your results in a one- to two-page letter to the Puzzle Editor of the New York Times. Your PDF solution of no more than 25 total pages should include:
    • One-page Summary Sheet.
    • Table of Contents.
    • Your complete solution.
    • One- to two-page letter.
    • Reference List. Note: The MCM Contest has a 25-page limit. All aspects of your submission count toward the 25-page limit (Summary Sheet, Table of Contents, Report, Reference List, and any Appendices). You must cite the sources for your ideas, images, and any other materials used in your report.

思来想去,还是将论文整体分为两个部分来展示在此网站中,这篇博客就先展示一半吧。

2322465.docx

Cracking the Wordle Code: Analyzing the Difficulty of Word Guessing Games

Wordle is a word-guessing game. After comprehending the game rules and analyzing the data set of player responses, our team constructed a mathematical model to elucidate the potential relationship between word characteristics and player responses, and hence made predictions for future data.

To begin with, we preprocessed the dataset provided in the prompt,erroneous data and abnormal data were removed, and acceptable ranges of errors were defined. In addition, percentage data was normalized in anticipation of future calculations.

Regarding the first question, we first studied the features of the data and the conditions for applicable models, and established an ARIMA-based time-series forecasting model to predict the number of reported results. To explain the trend features and practical implications of the prediction results, we incorporated real-world news events, search trends, and other relevant information to account for external factors and reduce the influence of errors on the prediction. Predictions were made for a given date with and without considering external factors, yielding a range of predicted values. We then summarized two word characteristics, namely, the number of vowels and the number of singly-occurring letters, and performed single-factor analysis of variance to demonstrate the impact of word characteristics on the percentage of attempts made under difficult mode.

For the second question, we constructed three common machine learning models (neural network, linear regression, and random forest regression) and utilized the summarized word characteristics as feature labels to predict the percentage of player attempts on future dates. The learning algorithm was trained with a series of input-output pairs, and modified through classification, regression, prediction, and gradient boosting to use supervised learning patterns to predict the values of labels for additional untagged data. The uncertainty factors of the model were summarized, and the confidence of predicted results was tested. Results showed that the model accuracy was generally over 95%. An example simulation of the attempt results for the word “EERIE” on March 1st, 2023 was provided.

For the third question, we aimed to define word difficulty and categorize words into three levels of easy, moderate, and difficult. Based on the summarized word feature data from questions one and two, we constructed a hierarchical clustering model using letter encoding and word features, and drew a dendrogram, calculated the clustering silhouette coefficient, and drew a word difficulty distribution chart. We provided an analysis of the difficulty classification for the word “EERIE”.

Finally, we presented other interesting features of the dataset, and summarized our conclusions and suggestions in a letter to the Wordle editor of the New York Times, outlining our views on the relationship between word guessing in Wordle and human language learning and cognition.

Keywords: ARIMA; Machine learning; Hierarchical clustering; Single-Factor; Analysis  Word features

Contents

1 Introduction……………………………………………………………………………. 3

1.1 Problem Background…………………………………………………………………………………………. 3

1.2 Restatement of the Problem………………………………………………………………………………… 4

1.3 Our Work…………………………………………………………………………………………………………. 5

2 Assumptions and Justifications……………………………………………………. 5

3 Data Preprocessing……………………………………………………………………. 5

3.1 Data Cleaning…………………………………………………………………………………………………… 6

3.2 Removal of Anomalous Data………………………………………………………………………………. 6

3.3 Percentage normalization……………………………………………………………………………………. 6

4 ARIMA forecasting model based on time series……………………………….. 7

4.1 Date Description……………………………………………………………………………………………….. 7

4.2 The Establishment of Model 1…………………………………………………………………………….. 7

4.3 The Solution of Model 1…………………………………………………………………………………….. 9

5 Single-Factor Analysis of Variance Model Based on Percentage of Attempts Under Word Featur The Solution of Model 2…………………………………………………………………………………… 11

5.1 The Establishment of Model 2…………………………………………………………………………… 11

5.2 The Solution of Model 2…………………………………………………………………………………… 11

6 Multi-input&output Distribution Prediction Model Based on Common Machine Learning.    18

6.1 Data Description……………………………………………………………………………………………… 18

6.2 The Establishment of Model 3…………………………………………………………………………… 18

6.3 The Solution of Model 3…………………………………………………………………………………… 19

6.4 The Uncertainties of Model 3……………………………………………………………………………. 20

6.5 Confidence Assessment of Model 3…………………………………………………………………… 20

7 Hierarchical Clustering Model based on Letter Encoding and Word Features   21

7.1 The Establishment of Model 4…………………………………………………………………………… 21

7.2 The Solution of Model 4…………………………………………………………………………………… 21

8 Other interesting features in the date set………………………………………. 22

9 Model Evaluation……………………………………………………………………. 22

9.1 Strengths………………………………………………………………………………………………………… 22

9.2 Weakness………………………………………………………………………………………………………… 23

10 Conclusion…………………………………………………………………………… 23

11 A Letter to the Editor of Wordle From New York Times…………………. 24

1 Introduction

1.1 Problem Background

Wordle is a popular puzzle game offered daily by The New York Times, in which players aim to guess a five-letter word within six attempts or less. The artificially created scarcity enhances players’ sense of challenge and anticipation, making the game widely popular. The game interface is a 5×6 grid of letter blocks. After each guess, the game color-codes the letter blocks to indicate the accuracy of the guess: green tiles indicate the presence of the correct letter in the correct position, yellow tiles indicate the presence of the correct letter in a different position, and gray tiles indicate the absence of the letter in the puzzle. Players continue to guess based on the feedback received until the correct answer is guessed or all six attempts are used up.

We believe that Wordle is not just a word-guessing game, but involves linguistic and psychological principles. The educational and entertaining process of Wordle inspires players to explore and learn English vocabulary, which can expand their lexicon and enhance their learning and processing of the material. By building meaningful connections between textual, linguistic, and visual information, players’ vocabulary and logical skills are exercised. We believe that Wordle cannot rely solely on user psychology and social media for long-term promotion and application, and must seek new development and challenges.

Therefore, we study Wordle puzzle and response data, establish a model to describe the underlying connections between the date of the puzzle, the characteristics of the Wordle words, and people’s response patterns. We predict specific vocabulary response results for future dates, and provide suggestions for Wordle development to optimize the word-guessing and learning process, to facilitate language learning and knowledge expansion, and even improve language cognition function.

1.2 Restatement of the Problem

Considering the background information and restricted conditions identified in the problem statement, we need to solve the following problems:

  • Problem 1

his study aims to explain the temporal variations of “the number of reported results” on Twitter by developing a mathematical model. The proposed model is then utilized to forecast “the number of reported results” on March 1, 2023. Additionally, we investigate whether the features of the words affect the percentage composition of the attempts made in the difficult mode.

  • Problem 2

The objective of this research is to establish a mathematical model for predicting the percentage composition (1, 2, 3, 4, 5, 6, X) of attempts for a given word on future dates, while discussing the uncertainties associated with the model. We evaluate the model’s accuracy by predicting the attempts for the word “EERIE” on March 1, 2023.

  • Problem 3

This research aims to develop a mathematical model to identify the specific attributes of words for their classification based on difficulty level. The model is applied to classify the word “EERIE” and determine its level of difficulty, with a discussion on the accuracy of the classification model.

  • Problem 4

To explore other interesting data features in the data set.

1.3 Our Work

Figure 1:Our work

2 Assumptions and Justifications

Players’ personal experiences show no significant difference in knowledge or vocabulary, and they are able to successfully complete the game flow by guessing existing words each time, regardless of whether it is in difficult mode or not. Personal preferences only exist in the choice of words.

Players’ guessing strategies prioritize choosing words that form more diverse vocabulary to obtain more information, and prefer to select more familiar words, i.e. words with higher frequency of usage.

3 Data Preprocessing

Data analysis problems frequently involve the presence of unrealistic and erroneous data within data sets. Without preprocessing, this can significantly compromise the accuracy of modeling and the validity of conclusions, particularly when working with large amounts of raw data.

In the context of “Data file “Problem_C_Data_Wordle.xlsx”, a critical step involves the identification and removal of erroneous or inaccurate data, as well as the optimization of remaining data to facilitate efficient model building in the subsequent section. It is imperative to ensure the accuracy of the data throughout this process.

3.1 Data Cleaning

The provided data set includes the participation dates, contest number, daily word, number of users reported scores, number of players in hard mode on that day, and the percentage of players solving the puzzle in 1 to 6 tries or failing to solve it, denoted by “X”. For ease of subsequent analysis, the data was sorted in chronological order from 1 to 359 using Excel, and the percentages of successful attempts for each data group were summed up.

3.2 Removal of Anomalous Data

In accordance with the game rules of Wordle, word puzzles are limited to five letters in length. Therefore, we have excluded words that do not meet this  requirement. Specifically, the four-letter words contest number 113-“tash” and 324-“clen” were removed.

The percentages of (1,2,3,4,5,6,X) in the data set represent the proportion of participants who guessed the corresponding word correctly after 1 to 6 attempts or failed to guess the word.In some cases, the sum of the percentages may not equal 100% due to rounding. We calculated the sum of the percentages of (1,2,3,4,5,6,X) for each word and found that the resulting range of percentages was between 98% and 102%. Considering the possible errors resulting from rounding, we deemed a range of ±2% as acceptable and included such data in our analysis. However, for the word “nymph” with the contest number of 281, the sum of the percentages was 126%, indicating a data recording error, and thus we removed it.

Moreover, the word “study” with the contest number of 529 had significantly different number of reported results compared to other words, indicating a potential error in data collection.

3.3 Percentage normalization

In order to simplify the utilization and calculation of the data in the subsequent model building, we normalized the percentages (1 try-X) with a total sum that is not equal to 100%.

4 ARIMA forecasting model based on time series

4.1 Date Description

Initially, a scatter plot of the number of reported results against date was created in order to observe the trend of results as they varied with date. However, as shown in the figure of the scatter plot, we were unable to account for the observed phenomenon of results peaking and then gradually declining at a decreasing rate. Since the date was not identified as the sole determinant of the variation in the number of reported results, it became necessary to further explore external factors that affect the variation in results over different date ranges.

4.2 The Establishment of Model 1

In order to forecast the number of reported results for future dates, taking 3/1/2023 as an example, we established an ARIMA model, which is one of the time series forecasting analysis methods. Firstly, we need to determine if the data characteristics meet the conditions for applying the ARIMA model. After examination, the sequence satisfies stationarity, and according to the ADF test, the null hypothesis of non-stationarity can be significantly rejected (P<0.05). The model is purely random and the residuals are white noise. Therefore, the data is a stationary time series, and the model can be applied.

ADF test
VariablesDifferentiation ordertPAICcritical value
1%5%10%
Number of reported results0-3.8530.002***7098.201-3.45-2.87-2.571
1-4.2280.001***7088.465-3.45-2.87-2.571
2-9.7950.000***7070.834-3.45-2.87-2.571
Note: *** , ** , and * indicate the significance levels of 1%, 5%, and 10%, respectively.

As a popular game in current times, Wordle utilizes social media platforms as a means of publishing its game results, and the public’s attention to Wordle directly determines the number of players who come into contact with the game, thus significantly influencing the daily number of reported results for each puzzle. In order to understand the public’s level of interest in Wordle, we conducted a search through Google Trends for the search trend of “Wordle” among Google users worldwide within the date range of the MCM-provided dataset.

The trend in Google searches for “wordle” showed a pattern similar to the scatter plot, with a peak followed by a decline and a subsequent flattening. The date of the peak was relatively consistent (around 1/31/2022). In conjunction with the news events surrounding that time, the acquisition of Wordle by The New York Times was reported on the same day, which could potentially affect the future dissemination trend of Wordle, thereby affecting the number of reported results of the game.

Based on the comparison of the trends shown in the above two charts, it can be inferred that the changes in the number of reported results are consistent with the level of public attention, which may serve as a potential influencing factor for future predictions. Considering the potential forecasting errors that may arise from external factors such as public attention, we need to make a reasonable selection of the time series range in the prediction process. We selected two time ranges for forecasting the number of reported results: a. the entire range of dates, and b. the range of dates where the decline rate slows down. We then compared the impact of these two different ranges on forecasting errors

4.3 The Solution of Model 1

  1. the entire range of dates, as shown in the Fig. 2.

Figure 2: Number of reported results

      b. the range of dates where the decline rate slows down, as shown in the Fig. 3.

Figure 3: Number of reported results

Based on the analysis of time series data under two different assumptions, the  prediction interval for the number of results on 3/1/2023 is from 10342 to 13353.

2 thoughts on “数学建模论文展示(上半)

发表回复

您的电子邮箱地址不会被公开。 必填项已用*标注