CS-431: Probabilistic Pos Tagging: limited scope for syntactic dependencies

Hi everyone

I have a question about the probability we are maximizing. After applying both hypotheses we get the following:

I am just wondering if P(T₁^k) is equal to. Just as an example lets take k = 3. Is the sequence equal to

P(T₁) * P(T₂) * P(T₃)
P(T₁) * P(T₂|T₁) * P(T₃|T₂, T₁)

I am inclined to go for the second option as it makes more sense but the notation in the slide confuses me slightly so I just want to be sure.

Thanks for your help!

Laurens

Re: Probabilistic Pos Tagging: limited scope for syntactic dependencies

by Jean-Cédric Chappelier - Thursday, 19 October 2023, 08:49

It's nothing else more than

$P(T_1, ..., T_k)$ : the probability to start with that

$k$ -gram of tags. Notice that

$k$ is the "size" of the model: the "support" for the parameters are

$k$ -grams.
If

$k=3$ , this is

$P(T_1T_2T_3)$ , and that's it! (as usual, there is some implicit in the notation:

$_1$ ,

$_2$ and

$_3$ indicate that it's an initial probability: probability to start with it.)
Sure you can rewrite it with your second formula (not the first one, which is wrong), but I don't see the point: for

$k=3$ ,

$P(T_1 = t_1, T_2= t_2, T_3= t_3)$ are parameters (to be learned).
Does it clarify?

Re: Probabilistic Pos Tagging: limited scope for syntactic dependencies

by Laurens Ludovicus Michielsen - Thursday, 19 October 2023, 10:46

Yes it does. Thank you very much.