Most platforms have undergone policy changes over time, with noticing policy updates once or twice a year. However, they may also make unexpected minor or substantial changes without prior notice. We have examined various lexical metrics, including sentence and word counts, unique vocabulary, readability, complexity, and richness, to track the evolution of these policies. Below shows lexical analysis example of TikTok’s and Douyin’s policy documents.
sent_tokenize and word_tokenize functions from nltk package have been used for counting the number of sentences, the number of words, and the number of unique words.
The readability of the text was computed by using the flesch_kincaid function in the readability package, applying the Flesch-Kincaid Grade Level, which indicates the necessary education grade for understanding the text.
Flesch-Kincaid Grade Level = (0.39 x average sentence length) + (11.8 x average number of syllables per word) – 15.59
Text Richness was measured using Hapax richness indicates the number of words that only occur once divided by the number of total words, which is a bit similar to the text complexity but forces more on the word that only appears once.
Hapax Richness = Number of Hapaxes / Total number of words
Text complexity was measured by the type-token ratio of the text, using the `ttr` attribute in the `LexicalRichness` package, which indicates the Type-Token Ratio, the number of unique words (types) divided by the total number of words (tokens). The larger number means a higher text complexity, but the number may stay stable when the corpus is large enough, as it is difficult to have more unique words.
Type-Token Ratio = Total number of unique words / Total number of words




The above shows the wordclouds of TikTok’s terms of use, privacy policy, and community guidelines.
TikTok and Douyin’s policy documents have some differences in many aspects, for example, TikTok has added sections like “Community Principles” and “Youth Safety and Well-Being” to their community guidelines since March 2023, making the structure more detailed but also more complex. In contrast, Douyin’s policy documents have a cleaner structure and are easier to access.
Another difference is observed in TikTok’s terms of use, where they have evolved from a single set of terms applicable to all regions to displaying different region-specific terms on separate pages. This change results in differences in the number of sentences and words, while Douyin has fewer trivial changes like in wording or punctuation. These differences can have an impact on the lexical overview and diversity of the platform’s policies over time.


In these graphs, there is an noticeable peak in the number of words on the left side, as well as peaks in both the number of words(normalized) and the number of sentences(normalized) on the right side. This is because in this period(since 2020-07-29 to 2020-10-10) the platform showed different versions(regions) of policies together on one page, which increased the overall length, as well as the word counts and the sentences counts.

The combination of different versions (or regions) of policy also influenced the overall readability, complexity, and richness trends dramatically in 2020. Due to the increased length, the richness and complexity of the policies decreased. Although the policies became longer, there were many duplicated words and phrases, as well as a large number of simple words, which led to a decrease in complexity and richness. After the platform adjusted their display approach by separating the policies for different regions, both the complexity and richness increased. This was because the length of the policies decreased significantly, which also caused readability improving.
But generally, the complexity and richness of the policy dropped from 2017 to 2021, then started to increase quickly, returning to the same level as its start.


The number of words, unique words, and sentences remained stable from 2017 to 2019. However, significant changes occurred from 2019 to the beginning of 2023. By observing the trends in the number of sentences and words, we can deduce that the policy documents were the longest during the period from 2020 to the middle of 2021. However, since then, the length of the policy has returned to its normal level and has experienced slight fluctuations up until now.

The complexity and the text richness of TikTok’s privacy policy staied stable at the beginning, while rase and dropped dramatically from 2019 to 2021, and increased back to the their original level in 2022. The readability showed a bit difference that it maintained general stable trend and increased since June, 2021. Since 2022, the indexes fluctuated in a reasonable range.


In general, the number of words, unique words, and sentences has been increasing since 2017 until now, which indicates that the length of the text has been increasing.

The trends in readability, complexity, and richness are completely different. Overall, the complexity and richness of the text have been decreasing, while readability has been increasing. This means that TikTok’s community guidelines are becoming more and more user-friendly and easier to understand.




The below shows the wordclouds of Douyin’s user agreement and privacy policy.
For Douyin’s data, we followed a similar process as we did for TikTok’s policy documents. However, before calculating the readability, we translated the Chinese text into English, which allowed us to use the Flesch-Kincaid Grade Level formula. And Douyin updated their user agreements and privacy policy 16 times and 21 times respectively.


From the graphs, we can observe that the general trends of the number of words, unique words, and sentences are quite similar. However, the count of unique words in the policy text appears to be more stable, while the number of sentences and words underwent significant changes both at the beginning and in the past two years.

It is concerning to note that the overall readability of the user agreements has been decreasing, indicating that they are becoming less accessible for users to comprehend. Conversely, the general trends of richness and complexity are increasing, suggesting that the policies are becoming more intricate and detailed. Furthermore, these three indexes remained stable from the middle of 2018 to 2022.


Douyin’s privacy policy text exhibits a clear upward trend in the number of words, unique words, and sentences.

However, the readability, complexity, and richness of Douyin’s privacy policy show a declining trend overall. In 2017, all three metrics had high values, indicating that the policy was initially difficult to read. Since then, the readability has slightly improved, but it has mostly remained at a similar level. On the other hand, both the richness and complexity of the policy have decreased in recent years.