955G5 Applied Natural Language Processing
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
955G5
MSc EXAMINATION
August 2021 (A3)
Applied Natural Language processing
1. You are building a classiier to determine whether posts to a popular social media site are positive or negative in sentiment.
(a) study the following code.
def my function ( li s t o f p o s t s ):
somedict ={}
for text , la be l in li s t o f p o s t s :
tokens= t ext . s p lit ( )
for token in tokens :
somedict [ token ] = somedict . get ( token , 0)+ 1 return somedict
t r a i n i n g = [ ( ”This is post 1 . ”, ”Pos”),
( ”This is post 2 . ”, ”NEG”),
( ”thi s is another post . ”, ”Pos”)]
my function ( t r a i n i n g )
i. state the output when this code is run. [5 marks]
ii. Explain how the code produces this output. [5 marks]
(b) You collect 100,000 posts which have been labelled as positive or negative by humans and divide them into a training and testing set with 50,000 posts in each.
i. You run myfunction() on your training set. ln order to visualise the word type frequency distribution, what do you need to do with the output of myfunction()? [4 marks]
ii. sketch the shape of the word type frequency distribution you would expect to see. [4 marks]
iii. Explain why this shape of distribution is a problem for NLP systems. [4 marks]
iv. Explain 4 modiications or pre-processing techniques you could introduce to myfunction() which would alleviate this problem to some extent. lnclude examples to illustrate your answer. which do you think would be the most effective and why? [10 marks]
(c) You decide to build a Naive Bayes classiier.
i. what probabilities need to be estimated based on the training data? why? [4 marks]
ii. Explain how to calculate these from the training data. [4 marks]
iii. Explain how and why you would smooth the probabilities. [5 marks]
(d) You test your classiier on the testing set and its accuracy is 90%. Give 2 possible different conclusions you could reach from this evidence.
Explain how an evaluation based on precison and recall could help you come to the correct conclusion. [5 marks]
2. (a) Describe 2 applications in NLP where it could be useful to be able to compute the similarity between pairs of documents. [5 marks]
(b) consider Table 1, which gives the frequencies of 5 potential word features in a very small corpus of 5 documents. Assume that there are no other words and no other documents in the corpus.
word Feature |
Doc A |
Doc B |
Doc C |
Doc D |
Doc E |
bridge |
0 |
1 |
2 |
1 |
0 |
capital |
1 |
1 |
0 |
0 |
0 |
the |
5 |
4 |
3 |
4 |
4 |
ancient |
0 |
1 |
5 |
3 |
3 |
inhospitable |
1 |
0 |
0 |
0 |
1 |
Table 1: Frequencies of 5 word Features in 5 Documents
i. For documents A and B compute the tf-idf score associated with each of the 5 word features given. [10 marks]
ii. For documents A and B compute the positive pointwise mutual information (PPMI) between the document and each of the 5 word features given. [10 marks]
iii. why is a representation based on tf-idf or PPMI better than one based on raw frequency when considering document similarity? [5 marks]
(c) consider the following snippet of code
def afunc ( docA ,docB ):
the sum=0
for ( key ,value ) in docA . items ( ):
the sum+= value *docB . get ( key ,0)
return the sum
Assuming that the input to afunc()consists of two dictionaries,storing the frequency-based representations of Doc A and Doc B given in Table 1:
i. what would be the output? [5 marks]
ii. Explain how the output is produced by the code. [5 marks]
iii. Explain how this is related to the similarity of 2 documents in the vector space. [5 marks]
(d) Give an example of how variation and ambiguity each might cause unexpected results when calculating document similarity in this way. [5 marks]
3. consider the following document snippet, taken from wikipedia:
on 28 June 1999, Jack Ma and 17 friends and students founded Alibaba.com, a china-based B2B marketplace site, in his Hangzhou apartment. ln october 1999, Alibaba received a Us$25 million investment from Goldman sachs and softBank. Alibaba.com was expected to improve the domestic e-commerce market and perfect an e-commerce platform for chinese enterprises, especially small and medium-sized enterprises (sMEs), to help export chinese products to the global market as well as address world Trade organization (wTo)challenges. ln 2002, Alibaba.com became proitable three years after launch. Ma wanted to improve the global e-commerce system, so from 2003 onward, Alibaba launched Taobao Marketplace, Alipay, Alimama.com, and Lynx.
(a) ldentify 4 different types of named entity mentioned in the text. Give at least one example of each. [8 marks]
(b) what is the difference between a system which carries out coreference resolution and a system which carries out named entity linking? [5 marks]
(c) Give 2 examples of variation and 2 examples of ambiguity for named entities. Explain why variation and ambiguity are problems for named entity linking. [8 marks]
(d) outline a plan for building a named entity linking system. How would you account for variation and ambiguity? [12 marks]
(e) For each of the following applications, would Named Entity Recognition and Named Entity Linking components be useful? Justify your answers. [12 marks]
i. lnformation-Retrieval based Factoid Question Answering system.
ii. knowledge-based Factoid Question Answering system.
iii. Automatic speech Transcription.
iv. Machine Translation.
(f) Describe 2 additional challenges for Named Entity Recognition in the context of 1 or more of the applications from question 3e. [5 marks]
2023-08-07