需积分: 17 203 浏览量 更新于2023-05-25 评论 收藏 373KB PDF 举报
GDELT-Event_Codebook 介绍gdelt 事件 GDELT-Event_Codebook V2
THE GDELT EVENT DATABASE
DATA FORMAT CODEBOOK V2.0
This codebook provides a quick overview of the fields in the GDELT Event file format and their
descriptions. GDELT Event records are stored in an expanded version of the dyadic CAMEO format,
capturing two actors and the action performed by Actor1 upon Actor2. A wide array of variables break
out the raw CAMEO actor codes into their respective fields to make it easier to interact with the data,
the Action codes are broken out into their hierarchy, the Goldstein ranking score is provided, a unique
array of georeferencing fields offer estimated landmark-centroid-level geographic positioning of both
actors and the location of the action, and a new “Mentions” table records the network trajectory of the
story of each event “in flight” through the global media system.
At present, only records from February 19, 2015 onwards are available in the GDELT 2.0 file format,
however in late Spring 2015 the entire historical backfile back to 1979 will be released in the GDELT 2.0
format. The Records are stored one per line, separated by a newline () and are tab-delimited (note
that files have a “.csv” extension, but are actually tab-delimited).
With the release of GDELT 2.0, the daily GDELT 1.0 Event files will still be generated each morning at
least through the end of Spring 2015 to enable existing applications to continue to function without
modification. Please note that at present, since GDELT 2.0 files are only available for events beginning
February 19, 2015, you will need to use GDELT 1.0 to examine longitudinal patterns (since it stretches
back to January 1, 1979) and use GDELT 2.0 moving forward for realtime events.
There are now two data tables created every 15 minutes for the GDELT Event dataset. The first is the
traditional Event table. This table is largely identical to the GDELT 1.0 format, but does have several
changes as noted below. In addition to the Event table there is now a new Mentions table that records
all mentions of each event. As an event is mentioned across multiple news reports, each of those
mentions is recorded in the Mentions table, along with several key indicators about that mention,
including the location within the article where the mention appeared (in the lead paragraph versus
being buried at the bottom) and the “confidence” of the algorithms in their identification of the event
from that specific news report. The Confidence measure is a new feature in GDELT 2.0 that makes it
possible to adjust the sensitivity of GDELT towards specific use cases. Those wishing to find the earliest
glimmers of breaking events or reports of very small-bore events that tend to only appear as part of
period “round up” reports, can use the entire event stream, while those wishing to find only the largest
events with strongly detailed descriptions, can filter the Event stream to find only those events with the
highest Confidence measures. This allows the GDELT Event stream to be dynamically filtered for each
individual use case (learn more about the Confidence measure below). It also makes it possible to
identify the “best” news report to return for a given event (filtering all mentions of an event for those
with the highest Confidence scores, most prominent positioning within the article, and/or in a specific
source language – such as Arabic coverage of a protest versus English coverage of that protest).
EVENTID AND DATE ATTRIBUTES
The first few fields of an event record capture its globally unique identifier number, the date the event
took place on, and several alternatively formatted versions of the date designed to make it easier to
work with the event records in different analytical software programs that may have specific date
format requirements. The parenthetical after each variable name gives the datatype of that field.
Note that even though GDELT 2.0 operates at a 15 minute resolution, the date fields in this section still
record the date at the daily level, since this is the resolution that event analysis has historically been
performed at. To examine events at the 15 minute resolution, use the DATEADDED field (the second
from the last field in this table at the end).
GlobalEventID. (integer) Globally unique identifier assigned to each event record that uniquely
identifies it in the master dataset. NOTE: While these will often be sequential with date, this is
NOT always the case and this field should NOT be used to sort events by date: the date fields
should be used for this. NOTE: There is a large gap in the sequence between February 18, 2015
and February 19, 2015 with the switchover to GDELT 2.0 – these are not missing events, the ID
sequence was simply reset at a higher number so that it is possible to easily distinguish events
created after the switchover to GDELT 2.0 from those created using the older GDELT 1.0 system.
Day. (integer) Date the event took place in YYYYMMDD format. See DATEADDED field for
MonthYear. (integer) Alternative formatting of the event date, in YYYYMM format.
Year. (integer) Alternative formatting of the event date, in YYYY format.
FractionDate. (floating point) Alternative formatting of the event date, computed as YYYY.FFFF,
where FFFF is the percentage of the year completed by that day. This collapses the month and
day into a fractional range from 0 to 0.9999, capturing the 365 days of the year. The fractional
component (FFFF) is computed as (MONTH * 30 + DAY) / 365. This is an approximation and does
not correctly take into account the differing numbers of days in each month or leap years, but
offers a simple single-number sorting mechanism for applications that wish to estimate the
rough temporal distance between dates.
The next fields describe attributes and characteristics of the two actors involved in the event. This
includes the complete raw CAMEO code for each actor, its proper name, and associated attributes. The
raw CAMEO code for each actor contains an array of coded attributes indicating geographic, ethnic, and
religious affiliation and the actor’s role in the environment (political elite, military officer, rebel, etc).
These 3-character codes may be combined in any order and are concatenated together to form the final
raw actor CAMEO code. To make it easier to utilize this information in analysis, this section breaks these
codes out into a set of individual fields that can be separately queried. NOTE: all attributes in this
section other than CountryCode are derived from the TABARI ACTORS dictionary and are NOT
supplemented from information in the text. Thus, if the text refers to a group as “Radicalized
terrorists,” but the TABARI ACTORS dictionary labels that group as “Insurgents,” the latter label will be
used. Use the GDELT Global Knowledge Graph to enrich actors with additional information from the rest
of the article. NOTE: the CountryCode field reflects a combination of information from the TABARI
ACTORS dictionary and text, with the ACTORS dictionary taking precedence, and thus if the text refers to
“French Assistant Minister Smith was in Moscow,” the CountryCode field will list France in the
CountryCode field, while the geographic fields discussed at the end of this manual may list Moscow as
his/her location. NOTE: One of the two actor fields may be blank in complex or single-actor situations or
may contain only minimal detail for actors such as “Unidentified gunmen.”
GDELT currently uses the CAMEO version 1.1b3 taxonomy. For more information on what each specific
code in the fields below stands for and the complete available taxonomy of the various fields below,
please see the CAMEO User Manual
or the GDELT website for crosswalk files.
Actor1Code. (string) The complete raw CAMEO code for Actor1 (includes geographic, class,
ethnic, religious, and type classes). May be blank if the system was unable to identify an Actor1.
Actor1Name. (string) The actual name of the Actor1. In the case of a political leader or
organization, this will be the leader’s formal name (GEORGE W BUSH, UNITED NATIONS), for a
geographic match it will be either the country or capital/major city name (UNITED STATES /
PARIS), and for ethnic, religious, and type matches it will reflect the root match class (KURD,
CATHOLIC, POLICE OFFICER, etc). May be blank if the system was unable to identify an Actor1.
Actor1CountryCode. (string) The 3-character CAMEO code for the country affiliation of Actor1.
May be blank if the system was unable to identify an Actor1 or determine its country affiliation
(such as “UNIDENTIFIED GUNMEN”).
Actor1KnownGroupCode. (string) If Actor1 is a known IGO/NGO/rebel organization (United
Nations, World Bank, al-Qaeda, etc) with its own CAMEO code, this field will contain that code.
Actor1EthnicCode. (string) If the source document specifies the ethnic affiliation of Actor1 and
that ethnic group has a CAMEO entry, the CAMEO code is entered here. NOTE: a few special
groups like ARAB may also have entries in the type column due to legacy CAMEO behavior.
NOTE: this behavior is highly experimental and may not capture all affiliations properly – for
more comprehensive and sophisticated identification of ethnic affiliation, it is recommended
that users use the GDELT Global Knowledge Graph’s ethnic, religious, and social group
taxonomies and post-enrich actors from the GKG.
Actor1Religion1Code. (string) If the source document specifies the religious affiliation of Actor1
and that religious group has a CAMEO entry, the CAMEO code is entered here. NOTE: a few
special groups like JEW may also have entries in the geographic or type columns due to legacy
CAMEO behavior. NOTE: this behavior is highly experimental and may not capture all affiliations
properly – for more comprehensive and sophisticated identification of ethnic affiliation, it is
recommended that users use the GDELT Global Knowledge Graph’s ethnic, religious, and social
group taxonomies and post-enrich actors from the GKG.
Actor1Religion2Code. (string) If multiple religious codes are specified for Actor1, this contains
the secondary code. Some religion entries automatically use two codes, such as Catholic, which
invokes Christianity as Code1 and Catholicism as Code2.
Actor1Type1Code. (string) The 3-character CAMEO code of the CAMEO “type” or “role” of
Actor1, if specified. This can be a specific role such as Police Forces, Government, Military,
Political Opposition, Rebels, etc, a broad role class such as Education, Elites, Media, Refugees, or
- 我的内容管理 收起
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额