数学建模论文展示(下半)

1Single-Factor Analysis of Variance Model Based on Percentage of Attempts Under Word Featur The Solution of Model 2

1.1 The Establishment of Model 2

In the difficult mode of the Wordle game, unlike in the regular mode, once a player correctly guesses a letter (shown in yellow or green), the next word combination they guess must include that letter. Although the data file “Problem_C_Data_Wordle.xlsx” does not provide a separate percentage of scores achieved solely in the difficult mode, we believe that the influence of letter features on the guessing results is not limited to the choice of mode.

Based on the basic assumptions of this study, we propose that players’ answering strategies prioritize choosing words that provide more diverse information, while assuming that each guess is a valid word that allows for a smooth game flow. Word features affect the information gained in the guessing process. For example, in the difficult mode, the available candidate words for selection are reduced to ensure the use of the correct letter from the previous guess. However, the number of candidate words that satisfy different word features differs significantly, which affects the difficulty of guessing the word and, therefore, the percentage composition of guesses in the difficult mode.

After observing the words in the dataset and combining it with our personal experience playing Wordle, we identified two features of words: a. the number of vowels in the word, and b. the number of unique letters in the word. By changing only one word feature, we analyzed the effect of multiple sets of data with 1-7 attempts using a one-way ANOVA model, to test whether there were significant differences within and between groups in the percentage composition of attempts under different influencing factors, i.e., whether word features can affect word difficulty and guessing results.

To eliminate the inherent error in the data under no feature influence, a control group was set. Before conducting the one-way ANOVA, we also need to perform a normality test on the percentage composition of each group in the 1-7 attempts to understand the distribution characteristics of the data.

1.2 The Solution of Model 2

The normality test results for attempts 1-7 are as follows:

Variable nameSample sizeMedianMeanStandard deviationSkewnessKurtosisShapiro-Wilk testKolmogorov-Smirnov test
1 try35500.4650.7823.47219.0650.559(0.000***)0.344(0.000***)
2 tries35555.7863.991.5593.2950.873(0.000***)0.166(0.000***)
3 tries3552322.6517.756-0.006-0.4710.992(0.046**)0.056(0.204)
4 tries3553432.9185.342-0.6651.1770.97(0.000***)0.102(0.001***)
5 tries3552423.7075.9150.071-0.240.994(0.182)0.051(0.303)
6 tries3551011.5696.1651.0351.0250.924(0.000***)0.127(0.000***)
7 or more tries (X)35522.8034.1285.49245.7560.509(0.000***)0.281(0.000***)
Note: *** , ** , and * indicate the significance levels of 1%, 5%, and 10%, respectively.

only the “5 tries” group shows significant normal distribution (P>0.5), while the other six groups do not meet the normal distribution assumption.

  1. Number of letters appearing individually

Chi-square test

 Number of letters appearing individually(Standard deviation)FP
5.0(n=254)3.0(n=94)1.0(n=4)2.0(n=2)4.0(n=1)
1 try0.8660.3870.5007.5330.000***
2 tries3.9473.1750.9570.70703.5150.008***
3 tries7.0797.0181.708002.6230.035**
4 tries4.8836.1944.4257.77801.6080.172
5 tries5.3135.5691.512.02102.3150.057*
6 tries5.4956.3324.699.89901.4930.204
7 or more tries (X)3.0065.9751.59.89904.5040.001***
Note: *** , ** , and * indicate the significance levels of 1%, 5%, and 10%, respectively.   

The significant P values for “4,6 tries” indicate homogeneity of variance, suggesting stronger explanatory power. Conversely, the significant P values for “1 try, 2,3,7” indicate heterogeneity of variance and weaker explanatory power.

Figure 4: One-way ANOVA comparison chart

Variable nameVariable valueSample sizeMeanStandard deviationFP
1 try52540.5790.8665.0080.001***
3940.1810.387
140.250.5
2200
4100
 Total3550.4650.782  
2 tries52546.6813.94713.5130.000***
3943.6493.175
141.750.957
220.50.707
4160
 Total3555.7863.99  
3 tries525424.5987.07920.40.000***
39418.2777.018
1410.751.708
2240
41240
 Total35522.6517.756  
4 tries525433.1264.8833.4370.009***
39432.7026.194
1431.754.425
2219.57.778
41320
 Total35532.9185.342  
5 tries525422.285.31318.9710.000***
39426.845.569
1434.751.5
2235.512.021
41240
 Total35523.7075.915  
6 tries525410.3275.49514.5740.000***
39414.2666.332
14184.69
22309.899
41110
 Total35511.5696.165  
7 or more tries (X)52542.3113.0064.9210.001***
3943.9575.975
142.751.5
22119.899
4130
 Total3552.8034.128  
Note: *** , ** , and * indicate the significance levels of 1%, 5%, and 10%, respectively.

Figure 5: Analysis of variance results table

The values of “Number of letters appearing individually” vary from 1 to 5 for different attempts, and the analysis of variance results demonstrate significant differences.

In combination with the above findings, “Number of letters appearing individually” has a significant and strongly explanatory effect on the “4,6 tries” group.

  • Number of vowels

Chi-square test

 Number of vowels(Standard deviation)FP
1.0(n=103)2.0(n=215)3.0(n=35)0.0(n=2)
1 try0.5050.8850.77502.6720.047**
2 tries3.354.0654.9840.7072.3170.075*
3 tries8.6517.3317.8682.1211.3250.266
4 tries5.7795.2533.5445.6572.0810.102
5 tries6.1065.7776.4142.1210.8970.443
6 tries7.1625.6985.8080.7072.2970.077*
7 or more tries (X)4.1334.382.13101.610.187
Note: *** , ** , and * indicate the significance levels of 1%, 5%, and 10%, respectively.  

The significant P-value for “1 try” indicates that the data does not meet the assumption of homogeneity of variance, indicating weak explanatory power. Similarly, the significant P-values for “2, 3, 4, 5, 6, 7 tries” indicate that the data does not satisfy the assumption of homogeneity of variance, suggesting weak explanatory power.

Figure 6: One-way ANOVA comparison chart

Analysis of variance results table

Variable nameVariable valueSample sizeMeanStandard deviationFP
1 try11030.3690.5051.1970.311
22150.4930.885
3350.60.775
0200
 Total3550.4650.782  
2 tries11035.0683.352.3150.076*
22156.0334.065
3356.5714.984
022.50.707
 Total3555.7863.99  
3 tries110322.2438.6510.280.84
221522.9217.331
33522.0867.868
0224.52.121
 Total35522.6517.756  
4 tries110333.1945.7793.9470.009***
221532.8195.253
33532.0293.544
02455.657
 Total35532.9185.342  
5 tries110323.686.1060.3290.805
221523.6095.777
33524.5146.414
0221.52.121
 Total35523.7075.915  
6 tries110312.1177.1621.1060.346
221511.2985.698
33511.9715.808
025.50.707
 Total35511.5696.165  
7 or more tries (X)11033.1944.1330.5910.621
22152.6984.38
3352.42.131
0210
 Total3552.8034.128  
Note: *** , ** , and * indicate the significance levels of 1%, 5%, and 10%, respectively.

The number of vowels, with values of 0, 1, 2, and 3 for different trial attempts, exhibited significant differences in variance analysis, only in the case of “4 tries.”

Therefore, based on the above, the number of vowels has a significant and stronger effect on the “4 tries,” exhibiting a higher explanatory power.

2 Multi-input&output Distribution Prediction Model Based on Common Machine Learning.

2.1 Data Description

We used the data set processed in the previous question and calculated the number of individual letters and vowels for each word. Textual data were transformed into a numerical format that can be processed by a computer. The two features summarized earlier can serve both as characteristics of words and as feature labels for machine learning.

2.2 The Establishment of Model 3

In order to predict the percentage of attempts for a given word on future dates, we attempt to combine the existing percentage relationship between each word and the features summarized in the previous section as prediction indicators. Due to the large amount of data involved and the difficulty of directly describing the internal underlying relationships, we decided to use machine learning as a data analysis method to automatically construct the model. By using iterative algorithms to learn from the data, machine learning allows computers to discover hidden domains without being explicitly programmed where to look. Iteration is particularly important in machine learning because it enables the model to adapt to new data independently, building on reliable calculations, repeated decisions, and outcomes from previous iterations.

We believe that machine learning can produce models faster and automatically to analyze larger and more complex data, resulting in more accurate results. The algorithm constantly trains to discover patterns and correlations from large datasets, then makes the best decisions and predictions based on data analysis results.

For this task, the number of times respondents guessed a word until they correctly guessed it was collected between January 7, 2022 and December 31, 2022, and this information is used as the label for the data along with the properties of the word. The learning algorithm is fed a series of inputs that have corresponding correct outputs, and the model is modified accordingly. Through classification, regression, prediction, and gradient boosting, supervised learning uses patterns to predict the values of additional unlabeled data labels.

2.3 The Solution of Model 3

Regarding the aforementioned machine learning models, we imported the sklearn module to call the interfaces of several common machine learning models and selected the most suitable model for multiple input variables and multiple outputs, namely neural networks, linear regression, random forests, and LGBM.In the development process of machine learning models, it is hoped that the well-trained models can perform well on new, unseen data. To simulate new, unseen data, we split the available data into two parts for data partitioning. Prior to training the model, we used the HoldOut validation method (in addition to leave-one-out and k-fold cross-validation methods) to divide the dataset into training and testing sets, in order to facilitate the evaluation of the model’s performance. In this experiment, we used a training set: testing set ratio of 7:3.

The goal of machine learning is to maximize the reduction of the loss function. This is not only about having good predictive ability on the training data (extremely low training loss), but fundamentally, it is also about having good predictive ability on new data (generalization ability). We trained the model by passing in the training set x, which represents the attributes of each word, and the training set label y, which represents the number of successful guesses of the word by the participants, using a fitting method. We used a for loop to implement a grid search to adjust hyperparameters and tested the training accuracy by validating the testing set. Based on the validation results, several machine learning models have high accuracy, with the average percentage error of the validation testing set being less than 5%, thereby ensuring that the accuracy remains around 95%. Therefore, the predicted results are guaranteed to be close to the actual results. After synthesizing the two factors of the words proposed in the first question, each machine learning model provided the final prediction value.

The following array shows the predicted results of the corresponding 1-7 number of attempts for the word “eerie” by three models arranged in order.

neural networks:

random forests:

LGBM:

2.4 The Uncertainties of Model 3

Selection of reference indicators: In machine learning models, we only included the percentage of trial attempts that each word already had and the two inductive word features as reference indicators for prediction. We did not consider the possible weight impact of date changes and the variation in the Number of reported results under different dates. It is difficult to determine whether the Number of reported results is correlated with the difficulty of the word, and the allocation of weights under different dates may be the uncertainty that the model needs to face during prediction.

Proportion of training set and test set division: To ensure an adequate number of test instances, we need to increase the proportion of the test set, but this will affect the fitting effect of the training set.

2.5 Confidence Assessment of Model 3

Table below presents the results of the error analysis.

3 Hierarchical Clustering Model based on Letter Encoding and Word Features

3.1 The Establishment of Model 4

Through the aforementioned process, we have discovered that a word’s difficulty can be determined by its own characteristics, and the percentage of attempts made by players can serve as an indicator of a question’s difficulty. To categorize the words in the dataset according to their difficulty, we conducted hierarchical clustering by combining the percentage of attempts with the word features we previously derived, such as the number of vowels, consonants, and standalone letters. Since the dataset is relatively small, the computational complexity is acceptable, and this approach is feasible.We anticipate that the words will be classified into three layers and pruned using a reasonable tree diagram and threshold setting.

3.2 The Solution of Model 4

 The hierarchical clustering results are shown in Figure 7, and the difficulty levels are clearly divided.

Figure 7:The hierarchical clustering results

As a result, the word “EERIE” is classified as difficult.

To validate the accuracy of the model, cross-validation was conducted. The majority of the given sample was used to build the model, while a small portion of the sample was set aside to perform predictions using the newly established model. The squared sum of the prediction errors for this small portion of the sample was recorded.

4 Other interesting features in the date set

It may come as a surprise that as many as 38.3% of words are guessed correctly on the first attempt, which may be attributed to people’s preferences for certain types of words as well as the frequency of commonly used words in Worlde’s word selection. A significant majority of words (81.7%) can be correctly guessed within four attempts or less, indicating that success is achievable for most individuals.

5 Model Evaluation

5.1 Strengths

Data Preprocessing: We eliminated the abnormal data in the data set to improve the reliability of the data; and pre-normalized the percentages that may be involved in the calculation to simplify the calculation process.

Use real-world information to justify dataIt is helpful to explain the influence of exogenous factors in data fitting by serving as the evidence of data set changes in Google trends and related news.

Settings for the control group in the variance test: Before the one-way analysis of variance, the original data is tested for normality separately, which eliminates the existing characteristics of the data itself and prevents the accuracy of the conclusion from being disturbed in the test under the influence of the characteristics.

Good scalability and flexibilityIt can explain the relationship between features of English words and the level of cognitive difficulty associated with those words. As a result, it can be used to improve the Wordle game or conduct further research in the field of linguistics.

5.2  Weakness

More common machine learning methods are adopted: lack of algorithm improvement and model optimization for specific topics, that is, the classification and regression algorithm has not been deeply extended to specific situations, so even if the amount of data increases, the subsequent improvement in accuracy may not be obvious.

The content of the data set is limited: too little data is learned during weight distribution, and the impact on the overall model fitting degree will also increase due to the special performance of individual data.

6 Conclusion

For question one, the ARIMA-based time-series forecasting model was employed under two distinct assumptions to derive a prediction interval for the number of results on 3/1/2023, ranging from 10,342 to 13,353. The characteristics of a word have been shown to influence the percentage distribution of guesses, whereby features such as the number of letters appearing individually have a notable and statistically significant effect on the “4,6 tries” range, while the number of vowels exhibits a stronger effect on the “4 tries” range, displaying higher explanatory power.

Regarding question two, machine learning models were utilized to predict the percentage of player attempts on future dates based on the summarized word characteristics utilized as feature labels.Prediction for “EERIE” on March 1, 2023:

neural networks:

random forests:

LGBM:

In addressing problem three, a hierarchical clustering model was employed that utilized both letter encoding and word features to classify the words according to their level of difficulty. The assessment revealed that the word “EERIE” was classified as being of high difficulty.

7 A Letter to the Editor of Wordle From New York Times

Dear Editor,

We are writing to you with regards to the popular game Wordle, which was recently published on the New York Times website. As avid players and data analysts, we have been studying the game’s historical player data and would like to share our insights with you.

Firstly, we believe that the game’s popularity is rooted in its ability to offer players a challenging yet satisfying experience. The game’s difficulty is not solely determined by the length of the word, but also by its composition. Various factors such as the number of vowels, the presence of repeated letters, and the distribution of consonants can significantly impact the difficulty of guessing the correct word.

Furthermore, by analyzing the historical player data, we have discovered a correlation between the difficulty of the puzzle words and the answers submitted by players. In other words, the distribution of answers can provide insights into the difficulty level of the puzzle words, which can help players strategize and improve their performance. We can predict the percentage of player attempts on future dates based on machine learning models and classify solutions by difficulty with the help of the hierarchical clustering model.

As passionate data analysts and word composition enthusiasts, we suggest adding words in other categories with letters more than five to increase puzzle variety as players use it more often. We believe that there is great potential to further explore the relationship between the difficulty of the puzzle words and the corresponding answers in Wordle. We would welcome the opportunity to discuss this topic with you and potentially collaborate on future projects related to the game.

Thank you for your time and consideration.

Yours Sincerely,

MCM TEAM 2322465

References

[1]  Yan Wang. Applied Time Series Analysis[M]. Beijing: China Renmin University

Press,2005.

[2]  Zhang, M. (2022). Research on the Characteristics and Laws of Public Opinion Dissem nation Based on the ARIMA Model: An Example of Early Data on Weibo Plat form. Advances in Applied Mathematics, 11, 2764.

[3]  Zhao, Y., Karypis, G., & Fayyad, U. (2005). Hierarchical clustering algorithms for document datasets. Data mining and knowledge discovery, 10, 141-168.

[4]  Settles, B., T. LaFlair, G., & Hagiwara, M. (2020). Machine learning–driven language assessment. Transactions of the Association for computational Linguistics, 8, 247-263.

[5]  Taamneh, M., Taamneh, S., & Alkheder, S. (2017). Clustering-based classification of road traffic accidents using hierarchical clustering and artificial neural networks. International journal of injury control and safety promotion, 24(3), 388-395.

以上就是论文全部内容,移植过程中可能会有部分格式变动,但内容还是完整的呈现在了这里。不论结果如何,这也是来之不易的成果,也算是留作一个纪念吧。

接下来一段时间的博客内容应该会偏向于编程语言的学习与理解部分,当然也会有其他方向的分享。本篇博客内容就到这里,感谢观看。

发表回复

您的电子邮箱地址不会被公开。 必填项已用*标注