CALL – Is white space tokenization enough?
For the first example using the sentence, “I can’t believe how fast the time flew by during our weekend getaway.” The spaces seem efficient enough to tokenize English language text because the only tokenization that contained an issue was the word “can’t”. For can’t it had separated it by “ca” and “n’t”, which is not separated correctly as a whole word, but it does separate at the conjunction of the two words that make up the word can’t.’
For the second example, we used the sentence, “I don’t know if I can handle the pressure of this last-minute decision.” For this sentence, the spaces also seemed efficient enough to tokenize English language text because again it separated the compound word at the conjunction of the two resulting in “do” and “n’t”. Moreover, the spaces did tokenize last-minute correctly as one word instead of separating the words.
Finally, the last example we used was, “They’ve been working nonstop to finish the project on time.” This sentence also tokenized English language text efficiently because similar to the previous examples it separated “they’ve” by “they” and “‘ve”. While also keeping nonstop as one word. Therefore, for all of the examples, we would say the spaces are pretty efficient in tokenizing English language text.