将以上函数整合:
- def to_words(sentence, words):
- return list(map(lambda x: words[x], sentence))
以上部分和官方的例子有必定的类似之处。接下来的处理和官方存在很大年夜的不合,重要参考了Keras例程处理文档的操作:
- def ptb_raw_data(data_path=None):
- train_path = os.path.join(data_path, 'ptb.train.txt')
- valid_path = os.path.join(data_path, 'ptb.valid.txt')
- test_path = os.path.join(data_path, 'ptb.test.txt')
- words, word_to_id = _build_vocab(train_path)
- train_data = _file_to_word_ids(train_path, word_to_id)
- valid_data = _file_to_word_ids(valid_path, word_to_id)
- test_data = _file_to_word_ids(test_path, word_to_id)
- return train_data, valid_data, test_data, words, word_to_id
- def _file_to_word_ids(filename, word_to_id):
- data = _read_words(filename)
- return [word_to_id[x] for x in data if x in word_to_id]
参数解析:
- def ptb_producer(raw_data, batch_size=64, num_steps=20, stride=1):
- data_len = len(raw_data)
- sentences = []
- next_words = []
- for i in range(0, data_len - num_steps, stride):
- sentences.append(raw_data[i:(i + num_steps)])
- next_words.append(raw_data[i + num_steps])
- sentences = np.array(sentences)
- next_words = np.array(next_words)
- batch_len = len(sentences) // batch_size
- x = np.reshape(sentences[:(batch_len * batch_size)], \
- [batch_len, batch_size, -1])
- y = np.reshape(next_words[:(batch_len * batch_size)], \
- [batch_len, batch_size])
- return x, y
- raw_data: 即ptb_raw_data()函数产生的数据
- batch_size: 神经收集应用随机梯度降低,数据按多个批次输出,此为每个批次的数据量
- num_steps: 每个句子的长度,相当于之前描述的n的大年夜小,这在轮回神经收集中又称为时序的长度。
- stride: 取数据的步长,决定命据量的大年夜小。
推荐阅读
应用网格搜刮或随机搜刮或设备文件来调剂超参数 不要手动检查所有的参数,如许耗时并且低效。我平日对所有参数应用全局设备,检查运行结不雅之后,我拒绝一步研究改进的偏向。如不雅这种办法没有赞助,那么你可以应>>>详细阅读
地址:http://www.17bianji.com/lsqh/36985.html
1/2 1