大厂100 NLP interview questions外企

🕗 发布于 2024-04-16 14:13 自然语言处理 人工智能

CLASSIC NLP

TF-IDF & ML (8)

Write TF-IDF from scratch.
What is normalization in TF-IDF ?
Why do you need to know about TF-IDF in our time, and how can you use it in complex models?
Explain how Naive Bayes works. What can you use it for?
How can SVM be prone to overfitting?
Explain possible methods for text preprocessing ( lemmatization and stemming ). What algorithms do you know for this, and in what cases would you use them?
What metrics for text similarity do you know?
Explain the difference between cosine similarity and cosine distance. Which of these values can be negative? How would you use them?

METRICS (7)

Explain precision and recall in simple words and what you would look at in the absence of F1 score?
In what case would you observe changes in specificity ?
When would you look at macro, and when at micro metrics? Why does the weighted metric exist?
What is perplexity? What can we consider it with?
What is the BLEU metric?
Explain the difference between different types of ROUGE metrics?
What is the difference between BLUE and ROUGE?

WORD2VEC(9)

Explain how Word2Vec learns? What is the loss function? What is maximized?
What methods of obtaining embeddings do you know? When will each be better?
What is the difference between static and contextual embeddings?
What are the two main architectures you know, and which one learns faster?
What is the difference between Glove, ELMO, FastText, and Word2Vec ?
What is negative sampling and why is it needed? What other tricks for Word2Vec do you know, and how can you apply them?
What are dense and sparse embeddings? Provide examples.
Why might the dimensionality of embeddings be important?
What problems can arise when training Word2Vec on short textual data, and how can you deal with them?

RNN & CNN(7)

How many training parameters are there in a simple 1-layer RNN ?
How does RNN training occur?
What problems exist in RNN?
What types of RNN networks do you know? Explain the difference between GRU and LSTM?
What parameters can we tune in such networks? (Stacking, number of layers)
What are vanishing gradients for RNN? How do you solve this problem?
Why use a Convolutional Neural Network in NLP, and how can you use it? How can you compare CNN within the attention paradigm?

NLP and TRANSFORMERS

ATTENTION AND TRANSFORMER ARCHITECTURE (15)

How do you compute attention ? (additional: for what task was it proposed, and why?)
Complexity of attention? Compare it with the complexity in RNN.
Compare RNN and attention . In what cases would you use attention, and when RNN?
Write attention from scratch.
Explain masking in attention.
What is the dimensionality of the self-attention matrix?
What is the difference between BERT and GPT in terms of attention calculation ?
What is the dimensionality of the embedding layer in the transformer?
Why are embeddings called contextual? How does it work?
What is used in transformers, layer norm or batch norm , and why?
Why do transformers have PreNorm and PostNorm ?
Explain the difference between soft and hard (local/global) attention?
Explain multihead attention.
What other types of attention mechanisms do you know? What are the purposes of these modifications?
How does self-attention become more complex with an increase in the number of heads?

TRANSFORMER MODEL TYPES (7)

Why does BERT largely lag behind RoBERTa , and what can you take from RoBERTa?
What are T5 and BART models? How do they differ?
What are task-agnostic models? Provide examples.
Explain transformer models by comparing BERT, GPT, and T5.
What major problem exists in BERT, GPT, etc., regarding model knowledge? How can this be addressed?
How does a decoder-like GPT work during training and inference? What is the difference?
Explain the difference between heads and layers in transformer models.

POSITIONAL ENCODING (6)

Why is information about positions lost in embeddings of transformer models with attention?
Explain approaches to positional embeddings and their pros and cons.
Why can’t we simply add an embedding with the token index?
Why don’t we train positional embeddings?
What is relative and absolute positional encoding?
Explain in detail the working principle of rotary positional embeddings.

PRETRAINING (4)

How does causal language modeling work?
When do we use a pretrained model?
How to train a transformer from scratch? Explain your pipeline, and in what cases would you do this?
What models, besides BERT and GPT, do you know for various pretraining tasks?

TOKENIZERS (9)

What types of tokenizers do you know? Compare them.
Can you extend a tokenizer? If yes, in what case would you do this? When would you retrain a tokenizer? What needs to be done when adding new tokens?
How do regular tokens differ from special tokens?
Why is lemmatization not used in transformers? And why do we need tokens?
How is a tokenizer trained? Explain with examples of WordPiece and BPE .
What position does the CLS vector occupy? Why?
What tokenizer is used in BERT, and which one in GPT?
Explain how modern tokenizers handle out-of-vocabulary words?
What does the tokenizer vocab size affect? How will you choose it in the case of new training?

TRAINING (8)

What is class imbalance? How can it be identified? Name all approaches to solving this problem.
Can dropout be used during inference, and why?
What is the difference between the Adam optimizer and AdamW?
How do consumed resources change with gradient accumulation?
How to optimize resource consumption during training?
What ways of distributed training do you know?
What is textual augmentation? Name all methods you know.
Why is padding less frequently used? What is done instead?
Explain how warm-up works.
Explain the concept of gradient clipping?
How does teacher forcing work, provide examples?
Why and how should skip connections be used?
What are adapters? Where and how can we use them?
Explain the concepts of metric learning. What approaches do you know?

INFERENCE (4)

What does the temperature in softmax control? What value would you set?
Explain types of sampling in generation? top-k, top-p, nucleus sampling?
What is the complexity of beam search, and how does it work?
What is sentence embedding? What are the ways you can obtain it?

LLM (13)

How does LoRA work? How would you choose parameters? Imagine that we want to fine-tune a large language model, apply LORA with a small R, but the model still doesn’t fit in memory. What else can be done?
What is the difference between prefix tuning , p-tuning , and prompt tuning ?
Explain the scaling law .
Explain all stages of LLM training. From which stages can we abstain, and in what cases?
How does RAG work? How does it differ from few-shot KNN?
What quantization methods do you know? Can we fine-tune quantized models?
How can you prevent catastrophic forgetting in LLM?
Explain the working principle of KV cache , Grouped-Query Attention , and MultiQuery Attention .
Explain the technology behind MixTral, what are its pros and cons?
How are you? How are things going?

原文地址：https://blog.csdn.net/longwo888/article/details/137693278

免责声明：本站文章内容转载自网络资源，如本站内容侵犯了原著者的合法权益，可联系本站删除。更多内容请关注自学内容网（zxcms.com）！

上一篇：图片壁纸社区app前后端开源小程序源码取图小程序源码
下一篇：UE5 C++ 创建3DWidgete 血条再造成伤害

leetcode289:生命游戏
根据，简称为，是英国数学家约翰·何顿·康威在 1970 年发明的细胞自动机。给定一个包含m × n个格子的面板，每一个格子都可以看成是一个细胞。每个细胞都具有一个初始状态：1即为（live），或0即为
阅读更多2024-10-20
MongoDB数据恢复
注意：两个MongoDB的版本要一致，本文使用的是mongo:4.2.24。先把K8S上面的MongoDB 容器停止（可以把副本改成0）。1、将容器挂载MongoDB的数据目录备份到本地。经常是数据文
阅读更多2024-10-20
C#中实现事务
C#中实现事务
阅读更多2024-10-20
【LeetCode每日一题】——560.和为 K 的子数组
给你一个整数数组 nums 和一个整数 k ，请你统计并返回该数组中和为 k 的子数组的个数。子数组是数组中元素的连续非空序列。
阅读更多2024-10-20
「漏洞复现」满客宝智慧食堂系统 selectUserByOrgId 未授权访问漏洞
请勿利用文章内的相关技术从事非法测试，由于传播、利用此文所提供的信息而造成的任何直接或者间接的后果及损失，均由使用者本人负责，作者不为此承担任何责任。工具来自网络，安全性自测，如有侵权请联系删除。本次
阅读更多2024-10-20
React面试题目（从基本到高级）
React前端面试常见题目涵盖了React的基础概念、组件、状态管理、生命周期、性能优化等多个方面。
阅读更多2024-10-20
12.个人博客系统（Java项目基于spring和vue）
1 在校学习的学生，可用于日常学习使用或是毕业设计使用 2 毕业一到两年的开发人员，用于锻炼自己的独立功能模块设计能力，增强代码编写能力。 3 亦可以部署为商化项目使用。 4 需要完整资料及源码
阅读更多2024-10-20
YoloV8改进策略：注意力改进|DeBiFormer，可变形双级路由注意力|引入DeBiLevelRoutingAttention注意力模块（全网首发）
本次改进的核心在于将DeBiLevelRoutingAttention模块嵌入到YoloV8的主干网络中，具体位置是在SPPF（Spatial Pyramid Pooling Fast）模块之后。这一
阅读更多2024-10-20
word取消自动单词首字母大写
情况说明：在word输入单词后首字母会自动变成大写取消单词首字母大写步骤：（1）点击菜单栏文件（2）点击“更多”——>“选项”（3）点击“校对”——>“自动更正选项”（4）取消“句首字母大
阅读更多2024-10-20
web前端网页用户注册页面
【代码】web前端网页用户注册页面。
阅读更多2024-10-20

大厂100 NLP interview questions外企

CLASSIC NLP

TF-IDF & ML (8)

METRICS (7)

WORD2VEC(9)

RNN & CNN(7)

NLP and TRANSFORMERS

ATTENTION AND TRANSFORMER ARCHITECTURE (15)

TRANSFORMER MODEL TYPES (7)

POSITIONAL ENCODING (6)

PRETRAINING (4)

TOKENIZERS (9)

TRAINING (8)

INFERENCE (4)

LLM (13)

相关文章