Datasets for Data Mining, Analytics and Knowledge Discovery. Any Paid Dataset or Resource must be marked as such in the title with [PAID].We collected receipts to construct a corpus of genuine and anonymous documents in order to create a benchmark for the evaluation of fraud detection approaches. This dataset is currently composed of 1969 images of receipts and the associated OCR result for each. 250 of them have been altered. In the OPUS project we try to convert and align free online data, to add linguistic annotation, and to provide the community with a publicly available parallel corpus. OPUS is based on open source products and the corpus is also delivered as an open content package. The Hindustani Music Rhythm Dataset is a sub-collection of 151 (5 hours) in four taals of Hindustani music with audio, associated taal related metadata and time aligned markers indicating the progression through the taal cycles. The dataset is useful as a test corpus for many automatic rhythm analysis tasks in Hindustani music.
Oct 20, 2017 · ICWSM Spinnr Challenge 2011 dataset: IIE.org Open Doors Data Portal: ImageNet: IMDB dataset: IMF Data and Statistics: Informatics Lab Open Data: Inside AirBnB: Internet Archive: Digital Library: IPUMS: Ironic Corpus: 1950 sentences labeled for ironic content: Kaggle Datasets: KAPSARC Energy Data Portal: KDNuggets Datasets: Knoema: Lahman’s ... # create data set for traning: processed_data <-as.data.table(as.matrix(new_docterm_corpus)) # Data 1 # combing the data: data_one <-cbind(data.table(listing_id = tdata $ listing_id, interest_level = tdata $ interest_level), processed_data) # merging the features: data_one <-fdata [data_one, on = " listing_id "] # split the data set into train ...
The dataset is divided in three disjoint sets: a balanced evaluation set, a balanced training set, and an unbalanced training set. In the balanced evaluation and training sets, we strived for each... The raw data was extracted from curated data in the CTD-Pfizer collaboration with document-level annotations of Description The NLM-Chem corpus is a manually annotated full-text resource on...The LDC-IL Telugu Speech data set consists of different types of datasets that are made up of word lists, sentences, running texts and date formats. Each speaker recorded these datasets which are randomly selected from a master dataset. Speech is in.wav format and Metadata is in.txt format. The available Speech Corpus details: SMT-based pseudo-code generator. try pseudogen (Unavailable now) Papers. Methods - IEEE/ACM ASE 2015 (PDF) Software - IEEE/ACM ASE 2015 (PDF) Download pseudogen BioCreative corpus: Dataset produced by the BioCreative assessment, text passages relevant for GO annotations of human proteins. GENIA corpus: Annotated corpus of literature related to the MeSH terms: Human, Blood Cells, and Transcription Factors. Yapex corpus: Training and test data for the protein tagger (NER) YAPEX. This corpus contains the documents used for training and testing our company focused named entity recognition system. It contains records for 1,000 documents presented in a JSON format and is structured as follows for each article: annotations - the companies we annotated within the article; url - the url where the article can be found In the framework of the system, we present to the team a series of experiments on different corpus-level recognition datasets. The team uses Convolutional Neural Network (CNN) to perform a semantic segmentation of a speech signal. Compared with the previous methods, the proposed method achieves a better performance on both test datasets. The Enron Email Corpus is a massive dataset, containing ~500,000 messages from senior man- agement executives at the Enron Corporation. Enron was a large American corporation which was investigated by the Federal Energy Regulatory Commission (FERC) in 2001 following its rather spectacular bankruptcy and dissolution. # create data set for traning: processed_data <-as.data.table(as.matrix(new_docterm_corpus)) # Data 1 # combing the data: data_one <-cbind(data.table(listing_id = tdata $ listing_id, interest_level = tdata $ interest_level), processed_data) # merging the features: data_one <-fdata [data_one, on = " listing_id "] # split the data set into train ...
What’s in a dataset? In NLP, both corpora and models are typically a result of a longer pipeline, which includes things like crawling (a specific website or database), text filtering (removing boilerplate or document subsampling), preprocessing (text normalization, encoding, tokenization) etc. Corpus of Music Listening Events for Music Recommendation Description This web page hosts the LFM-1b dataset of more than one billion listening events, intended to be used for various music retrieval and recommendation tasks. Semantic Scholar API Open Research Corpus Supp.ai Dataset CORD-19 Dataset Contact Semantic Scholar API Semantic Scholar provides a RESTful API for convenient linking to Semantic Scholar pages and pulling information about individual records on demand (subject to our dataset license agreement ). What’s in a dataset? In NLP, both corpora and models are typically a result of a longer pipeline, which includes things like crawling (a specific website or database), text filtering (removing boilerplate or document subsampling), preprocessing (text normalization, encoding, tokenization) etc. We introduce the Spotify Podcast Dataset, a new corpus of 100,000 podcasts. We demonstrate the complexity of the domain with a case study of two tasks: (1) passage search and (2) summarization. This is orders of magnitude larger than previous speech corpora used for search and summarization.
This corpus is significantly larger than the corpus we describe in our EMNLP 2017 paper. The dataset is made available for research, teaching, and scholarship purposes only, with further parameters in the spirit of a Creative Commons Attribution-NonCommercial License. Contact Prof. Norman Sadeh with any questions. Browse The Most Popular 52 Corpus Open Source Projects Datasets MDT-ASR-D003 Bahasa Indonesia Speech Corpus Read Speech , Indoor Environments , Mobile , Consumer Robot Controls , Security and Authentication , Bahasa Indonesia Sogou news corpus. This dataset is a combination of the SogouCA and SogouCS news corpora, containing in total 2,909,551 news articles in various topic channels. We then labeled each piece of news using its URL, by manually classifying the their domain names. This gives us a large corpus of news articles labeled with their categories. Yes! I would like to receive Nasdaq communications related to Products, Industry News and Events. You can always change your preferences or unsubscribe and your contact information is covered by ... There are errors in the Tatoeba Corpus. (Detailed Warning) In order to minimize the number of errors, I only used sentences that were owned by identified native speakers working on the Tatoeba Project and English sentences that I've personally checked and did not reject. Warning!
In the GAP corpus, the meetings were recorded with a portable audio recorder placed in the center of the group members, with a webcam in front of each participant to record the frontal upper body view. The publicly available dataset from this corpus contains audio recordings, meeting transcripts, and post-task questionnaire The detailed motion capture information, the interactive setting to elicit authentic emotions, and the size of the database make this corpus a valuable addition to the existing databases in the community for the study and modeling of multimodal and expressive human communication. Color Reference (English) Download. Size: 948 games; 53,365 utterances.. Description: Players saw three color swatches.Trials were split evenly among three conditions manipulating the context to give rise to different pragmatic language use.
The likablity database is a subset of the AGender database, both generated at the Telekom Innovation Laboratories. From the AGender data, which contains sentences from German speakers spoken over telefone equally distibuted in seven age-gender groups, 800 utterances were taken...