Last modified by Normann P. Nielsen on 2025/12/12 15:13

Show last authors
1 I stumpled upon the [[INSERT INTO LOGSTASH SELECT DATA FROM DATABASE>>https://www.elastic.co/blog/logstash-jdbc-input-plugin]] article and decided to play around.
2
3 Installing PostgreSQL and loading the database and the data into Elasticsearch was just as described; the manual for MBSlave is very good.
4
5 Refer to the [[https://musicbrainz.org/doc/MusicBrainz_Database]] for more
6
7
8 {{success}}I decided to use elkserver3 and a new logstash on that one, to avoid messing elkserver1 up.
9
10 Is possible to have a different Logstash config on each server in the cluster. Also - I decided to name the Index "musicbrainz-%{+YYYY.MM.dd}" to have some control over the load and a possible cleanup afterwards.{{/success}}
11
12
13
14
15 My Input file:
16
17
18 {{code language="10-musicbrainz.conf"}}
19 input {
20 jdbc {
21 jdbc_driver_library => "/etc/logstash/postgresql-9.4.1212.jre6.jar"
22 jdbc_driver_class => "org.postgresql.Driver"
23 jdbc_connection_string => "jdbc:postgresql://localhost:5432/musicbrainz?user=musicbrainz&password=*******"
24 jdbc_user => "musicbrainz"
25 statement_filepath => "/etc/logstash/query.sql"
26 schedule => "0 15 * * *"
27 }
28 }
29 {{/code}}
30
31 Notice the Schedule - the Query runs one time each day at 15:00 - but the data is static, so its not nessesary....But the schedule makes sure I know when the Query is runned one time (only)
32
33 My Output file:
34
35
36 {{code}}
37 output {
38
39 elasticsearch
40 {
41 hosts => "localhost:9200"
42 sniffing => false
43 manage_template => false
44 index => "musicbrainz-%{+YYYY.MM.dd}"
45 }
46 }
47 {{/code}}
48
49 The load gave this in Kibana:
50
51 [[image:Screen Shot 2016-12-30 at 15.58.06.png||width="900"]]
52
53 And verifying the row count in PostgreSQL with:
54
55
56 {{code}}
57 SELECT count(*) AS Dummy FROM (
58 SELECT
59 release_group.gid AS album_id,
60 release_group.type AS album_primary_type_id,
61 release_group_primary_type.name AS album_primary_type_name,
62 release.name AS release_name,
63 artist.name AS artist_name,
64 artist.gid AS artist_gid,
65 artist_credit.id AS artist_credit_id,
66 artist.type AS artist_type_id,
67 artist_type.name AS artist_type_name,
68 artist.begin_date_year artist_begin_date_year,
69 area.name AS artist_country_name,
70 release_country.date_year AS release_year,
71 release_country.date_month AS release_month,
72 release_country.date_day AS release_day
73 FROM
74 musicbrainz.artist
75 INNER JOIN musicbrainz.artist_credit_name
76 ON artist_credit_name.artist = artist.id
77 INNER JOIN musicbrainz.artist_credit
78 ON artist_credit.id = artist_credit_name.artist_credit
79 INNER JOIN musicbrainz.release_group
80 ON release_group.artist_credit = artist_credit.id
81 INNER JOIN musicbrainz.release
82 ON release.release_group = release_group.id
83 INNER JOIN musicbrainz.release_country
84 ON release.id = release_country.release
85 INNER JOIN musicbrainz.artist_type
86 ON artist.type = artist_type.id
87 INNER JOIN musicbrainz.area
88 ON artist.area = area.id
89 INNER JOIN musicbrainz.release_group_primary_type
90 ON release_group_primary_type.id = release_group.type
91 WHERE
92 ((release_country.date_year IS NOT NULL) AND
93 (release_country.date_month IS NOT NULL) AND
94 (release_country.date_day IS NOT NULL))
95 ) As Dummy2
96 {{/code}}
97
98 Gave:
99
100
101 {{code}}
102 dummy
103 --------
104 622527
105 (1 row)
106 {{/code}}
107
108 Success - same row count :)
109
110 I do notice that some rows seems to be the same:
111
112 [[image:Screen Shot 2016-12-30 at 16.14.01.png||width="900"]]
113
114 Or not? The "album_id" is the same, but in one row the "release_year" differs from the two others....
115
116 Running the SQL
117
118
119 {{code}}
120 SELECT Distinct * FROM (
121 SELECT
122 release_group.gid AS album_id,
123 release_group.type AS album_primary_type_id,
124 release_group_primary_type.name AS album_primary_type_name,
125 release.name AS release_name,
126 artist.name AS artist_name,
127 artist.gid AS artist_gid,
128 artist_credit.id AS artist_credit_id,
129 artist.type AS artist_type_id,
130 artist_type.name AS artist_type_name,
131 artist.begin_date_year artist_begin_date_year,
132 area.name AS artist_country_name,
133 release_country.date_year AS release_year,
134 release_country.date_month AS release_month,
135 release_country.date_day AS release_day
136 FROM
137 musicbrainz.artist
138 INNER JOIN musicbrainz.artist_credit_name
139 ON artist_credit_name.artist = artist.id
140 INNER JOIN musicbrainz.artist_credit
141 ON artist_credit.id = artist_credit_name.artist_credit
142 INNER JOIN musicbrainz.release_group
143 ON release_group.artist_credit = artist_credit.id
144 INNER JOIN musicbrainz.release
145 ON release.release_group = release_group.id
146 INNER JOIN musicbrainz.release_country
147 ON release.id = release_country.release
148 INNER JOIN musicbrainz.artist_type
149 ON artist.type = artist_type.id
150 INNER JOIN musicbrainz.area
151 ON artist.area = area.id
152 INNER JOIN musicbrainz.release_group_primary_type
153 ON release_group_primary_type.id = release_group.type
154 WHERE
155 ((release_country.date_year IS NOT NULL) AND
156 (release_country.date_month IS NOT NULL) AND
157 (release_country.date_day IS NOT NULL))
158 ) As Dummy2
159 {{/code}}
160
161 Gave
162
163
164 {{code}}
165 560155 rows
166 {{/code}}
167
168 So, there is a possible redundancy in the SQL provided from [[https://www.elastic.co/blog/logstash-jdbc-input-plugin]]
169
170 Reloading the data (after deleting the Index) gives:
171
172 [[image:Screen Shot 2016-12-30 at 17.08.50.png||width="900"]]
173
174 = Samples =
175
176 Here is a sample of all albums from "Denmark" with "Peter" in the Artist name:
177
178 [[image:Screen Shot 2016-12-30 at 16.20.58.png||width="900"]]
179
180 A few Visualizations..
181
182 [[image:Screen Shot 2016-12-31 at 09.14.14.png||width="900"]]
183
184 = Whats Next.. =
185
186 Well, this is unfinished business...there are so much more data to combine......
187
188 And another project could be parsing IMDB data ... [[http://www.imdb.com/interfaces]]